How to count distinct values over multiple columns using SQL

Often we want to count the number of distinct items from this table but the distinct is over multiple columns

Method-1 Using a derived table (subquery)

You can simply create a select distinct query and wrap it inside of a select count(*) sql, like shown below:

SELECT COUNT(*) 
FROM ( SELECT DISTINCT DocumentId, DocumentSessionId
FROM Table ) AS internalQuery

Method-2 Using Concatenated columns

A simpler method would be to use concatenated columns

SELECT COUNT(DISTINCT(CONCAT(DocumentId,DocumentSessionId))) FROM Table;

Method-3 If performance is a factor

If you end up doing this often and over large tables, you can consider adding an additional column to the table itself (physicalize), which concatenates all the columns and computes a hash on it

Eg: you can add a new column to the table and store 
MD5(CONCAT(DocumentId,DocumentSessionId)), 
so you can easily count distinct on this new column going forward.

You can even consider creating indexes and or compute statistics on this new column in order to improve performance

You can use these techniques in,

  • Snowflake Data Warehouse
  • Google Bigquery
  • Teradata
  • Any Other SQL database

nVector

posted on 20 Jun 20

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds




kahumphrey16-Jul-20

Maybe I am a lil old fashion, but why would not the following query work:

SELECT  DocumentId, DocumentSessionId,count(*)

FROM Table

group by DocumentId, DocumentSessionId

nVector17-Jul-20

Sure. This query will produce the list of distinct groups and count per each group (the result will be one or more rows). The article is about, how to get distinct count for entire table (returns consolidated one row)