20 Jun 20 · npack · #general ·   Bookmark   ×

How to count distinct values over multiple columns using SQL

Often we want to count the number of distinct items from this table but the distinct is over multiple columns

Method-1 Using a derived table (subquery)

You can simply create a select distinct query and wrap it inside of a select count(*) sql, like shown below:

SELECT COUNT(*) 
FROM ( SELECT DISTINCT DocumentId, DocumentSessionId
FROM Table ) AS internalQuery

Method-2 Using Concatenated columns

A simpler method would be to use concatenated columns

SELECT COUNT(DISTINCT(CONCAT(DocumentId,DocumentSessionId))) FROM Table;

Method-3 If performance is a factor

If you end up doing this often and over large tables, you can consider adding an additional column to the table itself (physicalize), which concatenates all the columns and computes a hash on it

Eg: you can add a new column to the table and store 
MD5(CONCAT(DocumentId,DocumentSessionId)), 
so you can easily count distinct on this new column going forward.

You can even consider creating indexes and or compute statistics on this new column in order to improve performance

You can use these techniques in,

  • Snowflake Data Warehouse
  • Google Bigquery
  • Teradata
  • Any Other SQL database

npack

posted on 20 Jun 20

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds