How source data’s ordering can affect a table's performance in Snowflake
Snowflake recommends not to define a cluster key unless the table performs poorly. With no Cluster key, how is the table partitioned? Here's the secret,
Snowflake micro-partitions, and stores the data in the same order as they arrive from the source!
Here are the good things about it:
- If you loading data as you go, like for example Sales data. The data arrives every day and stored in its own micro-partitions and also maintains metadata about the partition like the min(date), max(date) etc. So if you query the sales table based on Sale_Date, It is going to only check a few partitions to retrieve your results.
- Less overhead - you don't have to maintain the cluster keys/partitioning. Snowflake does it all
Some bad things about Snowflake's clustering:
If your source data is not arriving in the same order, then its a mess. For example, Let us assume, Our Sales table has 5000 partitions, You will end up having the same Sale_Date present in most / all of the partitions as our data is completely in random order. And when you query the table you will end up doing an all partition scan, which is a very costly time-consuming operation
Uhh, Well, What if I take control with manual clustering?
Even if you specify a cluster key, Snowflake won't automatically cluster your data on insertion into the table, in order to do that you need to run a "recluster" command manually every time. But it does fix our random data problem
So, Here are some best practices wrt clustering while you load the data
If you're inserting data into a table and you know what the clustering key is then I think it is better to sort before you insert so you don't need to manage cluster keys as much
If you have multiple columns that could be used to query the table, Its also a good practice to sort the source data by all those columns in the same order. For example, Sort Sales data by Sale_Date ascending, Customer_Nbr ascending, and so on...
Loading historical data? Sort your data based on the date field first before you load it to snowflake.
If you already have a table with data in random partitions. Simply create another table and insert the data to it in sorted order from the old table. You can drop the old table and swap their names
What are the best practices that you follow with the data sorting and clustering? Tell us in the comments