BigQuery count distinct vs count of group by colx
I have been under the impression that if you were to do a
COUNT(DISTINCT xyz) on some column, it would be equal to the regular count of a
GROUP BY that column.
However, when I do that over a very large dataset in BigQuery, with the exact same conditions, it is showing a large difference in results:
Query Type Count ---------------------------------- - count(distinct ColX) > 7 million - count(ColX) ... GROUP BY ColX ~ 6.5 million
In google bigquery - If you use the DISTINCT keyword, the function returns the number of distinct values for the specified field. Note that the returned value for DISTINCT is a statistical approximation and is not guaranteed to be exact - the documentation is also clear about this.
- To compute the exact number of distinct values, use
- For a more scalable approach, consider using GROUP BY on the relevant field(s) and then applying COUNT(*). The GROUP BY approach is more scalable but might incur a slight up-front performance penalty
Read great educational content like this and a lot more !
Members get free exclusive access to content, new courses, and discounts. Signup for a free account to write a post / comment / upvote posts. Creating an account takes less than 5 seconds and you can start earning badges & points too