27 September 2018 · npac ·       Add to Favorites   Report

How to remove duplicate rows in BigQuery based on a unique identifier

Consider an example table:

id  value    
1 4 1 2 1 3
2 5
3 7
4 8

What would be the process of removing duplicate records based on a unique ID in BQ?. You can use the ROW_NUMBER function

Deleting the duplicates in the same table

delete from dataset.tablename where (id,value) in 
(
select a.* from
(
select id, value, ROW_NUMBER() over (Partition by id order by value desc) as rn from dataset.tablename
)a where a.rn <> 1
);

And the result:

id  value    
1 4
2 5
3 7
4 8 

NOTE: The above method can be used only when the value column is different across rows. In the case of full-row duplicates, you will not be able to use the above method. Instead, create a new table and only insert the unique ids into the new table using ROW_NUMBER.

npac

posted on 27 September 2018

Read great educational content like this and a lot more !

Members get free exclusive access to content, new courses, and discounts. Signup for a free account to write a post / comment / upvote posts. Creating an account takes less than 5 seconds and you can start earning badges & points too

Copied