How to do Efficient sampling of a fixed number of rows in Google BigQuery
I have a large data set of size N, and want to get a (uniformly) random sample of size n. There are two possible solutions:
SELECT foo FROM mytable WHERE RAND() < n/N
This is fast, but doesn't give me exactly n rows (only approximately).
SELECT foo, RAND() as r FROM mytable ORDER BY r LIMIT n
This requires to sort N rows, which seems unnecessary and wasteful (especially if n << N).
The best and easiest way to get a random sample from big query:
If you try to execute the following query several times without using cached results, you will got different results.
SELECT * FROM `bigquery-samples.wikipedia_benchmark.Wiki1B` LIMIT 5
Therefore, depends on how randomly you want to have the samples, this maybe a better solution
Read great educational content like this and a lot more !
Members get free exclusive access to content, new courses, and discounts. Signup for a free account to write a post / comment / upvote posts. Creating an account takes less than 5 seconds and you can start earning badges & points too