31 May 2019 · Asran ·       Add to Favorites   Report

How to do Efficient sampling of a fixed number of rows in Google BigQuery

I have a large data set of size N, and want to get a (uniformly) random sample of size n. There are two possible solutions:

SELECT foo FROM mytable WHERE RAND() < n/N

This is fast, but doesn't give me exactly n rows (only approximately).

SELECT foo, RAND() as r FROM mytable ORDER BY r LIMIT n

This requires to sort N rows, which seems unnecessary and wasteful (especially if n << N).

The best and easiest way to get a random sample from big query:

If you try to execute the following query several times without using cached results, you will got different results.

SELECT *
FROM `bigquery-samples.wikipedia_benchmark.Wiki1B`
LIMIT 5

Therefore, depends on how randomly you want to have the samples, this maybe a better solution

Asran

posted on 31 May 2019

Read great educational content like this and a lot more !

Members get free exclusive access to content, new courses, and discounts. Signup for a free account to write a post / comment / upvote posts. Creating an account takes less than 5 seconds and you can start earning badges & points too

Copied