Hashing functions in Google Bigquery (MD5, SHA1..) - Generate Unique Id concatenating multiple fields

A hash function can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes and they do not change over time (are constant). Hashing is a one way algorithm, you cannot decrypt a hashed key.

Why do you need to generate Hashed keys?

In certain tables, the key columns can contain sensitive data like SSN, or other PII. Instead of exposing them as join keys, its a good idea to hash them and then store them in the table as the join field

In this guide we will go through the various hashing functions in BigQuery

FARM_FINGERPRINT

This function hashes the input string and returns a fixed integer. The output of this function for a particular input will never change.

WITH example AS (
SELECT 1 AS x, "foo" AS y, true AS z UNION ALL
SELECT 2 AS x, "apple" AS y, false AS z UNION ALL
SELECT 3 AS x, "" AS y, true AS z
)
SELECT
*,
FARM_FINGERPRINT(CONCAT(CAST(x AS STRING), y, CAST(z AS STRING)))
AS row_fingerprint
FROM example;
+---+-------+-------+----------------------+
| x | y | z | row_fingerprint |
+---+-------+-------+----------------------+
| 1 | foo | true | -1541654101129638711 |
| 2 | apple | false | 2794438866806483259 |
| 3 | | true | -4880158226897771312 |
+---+-------+-------+----------------------+

MD5 Algorithm

The MD5 Algorithm computes the hash of the input. The input can either be STRING or BYTES. The string version treats the input as an array of bytes. This function returns 16 bytes hashed output

SELECT MD5("Hello World") as md5;
+--------------------------+
| md5 |
+--------------------------+
| sQqNsWTgdUEFt6mb5y4/5Q== |
+--------------------------+

SHA1 Algorithm

Computes the hash of the input using the SHA-1 algorithm. The input can either be STRING or BYTES. The string version treats the input as an array of bytes. This function returns 20 bytes

SELECT SHA1("Hello World") as sha1;
+------------------------------+
| sha1 |
+------------------------------+
| Ck1VqNd45QIvq3AZd8XYQLvEhtA= |
+------------------------------+

SHA256 & SHA512 Algorithm

Computes the hash of the input using the SHA-256 / 512 algorithms respectively. The input can either be STRING or BYTES. The string version treats the input as an array of bytes. This function returns 32 bytes

SELECT SHA256("Hello World") as sha256;
SELECT SHA512("Hello World") as sha512;

nVector

posted on 05 Sep 20

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds