24 Jul 20 · npack · #Bigquery ·   Bookmark   ×

Google Cloud Professional Data Engineer Certification Exam - Study guide and Practice tests

I cleared the Professional Data Engineer Certification recently and the question topics are fresh in my mind, I wanted to share with you helpful tips for preparing and ace the certification exam.

Why would you want to do a Google Cloud Professional Data Engineer Certification?

You may already have the skills to use Google Cloud already but how do you demonstrate this to a future employer or client? Two ways. Through a portfolio of projects or a certification.

A certificate says to future clients and employers, ‘Hey, I’ve got the skills and I’ve put in the effort to get accredited

Who is it for ?

If you’re already a data scientist, a data engineer, data analyst, machine learning engineer or looking for a career change into the world of data, the Google Cloud Professional Data Engineer Certification is for you

How much does it cost?

To sit the certification exam costs $200 USD. If you fail, you will have to pay the fee again to resit.

What is the exam format ?

  • Length: 2 hours
  • Registration fee: $200 (plus tax where applicable)
  • Languages: English, Japanese.
  • Exam format: 50 Multiple choice and multiple select taken remotely or in person at a test center. Locate a test center near you.
  • Exam Delivery Method: 
  • Take the online-proctored exam from your home
  • Take the onsite-proctored exam at a testing center
  • Prerequisites: None
  • Recommended experience: 3+ years of industry experience including 1+ years designing and managing solutions using GCP.

Course Topics to Prepare:

You will need to familiar with,

  • 5 database products (Bigquery, Bigtable, Cloud SQL, Cloud Spanner, Datastore)
  • 3 ETL tools (Dataflow, Dataprep, Dataproc)
  • 5 Misc products (Stackdriver, CloudComposer, Cloud Storage, Pub/Sub, IOT Core)
  • Machine learning basics and Cloud Basics (eg. zones, regions)

Overwhelmed ? Dont worry, lets take one small step at a time:

Bigquery :

Bigtable :

  • Understand Google Bigtable - No SQL, low latency database
  • Bigtable schema design - Learn the best practices for choosing the row keys
  • In general Bigtable performs good for lean and tall tables. So limit the number of columns
  • Common use cases - Real time series data, IOT data, real time stock market data
  • Understanding the nodes and how to scale write and read operations by choosing the number of nodes and also tweaking the disks between SSD and HDD

Cloud SQL:

  • Cloud SQL is good for Transactional databases, where the data volume is less than 4 TB and max concurrent 4000 connections
  • Choose Cloud SQL when the customer wants to port the application to cloud as it is and do not want to spend on rewriting the application
  • Understand how to install Stackdriver agent to monitor db performance

Cloud Spanner:

  • Cloud Spanner is good for transactional databases, which needs unlimited scalabilty and also scale concurrent connections, and when global consistency is needed
  • Cloud spanner supports Primary Key and Secondary Key. Learn how to choose these keys and how they impact the data distribution across nodes and retrieval
  • Learn how to improve performance by tweaking various options including the number of nodes, disk types etc

Datastore:

  • Understand what is Datastore and its use cases

Dataflow:

  • Understand what is Dataflow (uses Apache Beam internally)
  • Get familiar with various dataflow transformations like IO, sideinputs, sideoutputs, runner
  • Get familiar with various types of windowing - fixed, vs sliding vs session windows
  • How to increase the performance of dataflow jobs
  • How to update a real time streaming job and change the code
  • Understand the difference between stopping vs draining a dataflow job
  • Get familiar with Pcollections and how are they used

Dataprep:

  • Understand what is Dataprep - Mostly used by analysts, who prefer gui, and Dataprep is commonly used for cleansing data whose schema changes frequently
  • Dataprep internally uses dataflow to run the pipeline

Dataproc:

  • Understand what is Dataproc
  • On-prem hadoop instances can be migrated to Dataproc, use cloud storage in lieu of HDFS and use Bigtable to replace Hive
  • Best practice to use Dataproc to run only one type of job and kill the cluster on completion
  • If cost is an issue you can instantiate worker nodes with preemptive types
  • Understand how to run initialization actions on your cluster
  • Understand spark job processing and how to improve the performance of the Dataproc cluster

Cloud Storage:

  • Thoroughly go through the Cloud Storage. Most questions touch upon Cloud Storage
  • Understand the difference between regional and multi regional buckets and the cost associated and what is high availability
  • Understand the various storage classes - Standard, Nearline, Coldline, Archive

Pub / Sub:

  • Decouples sender and receiver, Stores messages for upto 7 days. Guarantees message delivery. Messages may come out of order
  • Used for most of the real time application use cases
  • Unserstand Pub/Sub Topics and subscriptions

Misc Products:

Briefly touch upon Stackdriver, CloudComposer and IOT Core

Machine Learning Basics:

  • Refresh your machine learning concepts using this quick guide
  • Understand when to build your own model vs using Auto ML vs Google ML libraries like (Vision API, translate API etc)
  • Learn about Google AI Platform and how to use it to train and serve models
  • Learn about the various hardware that google offers for training a model - CPUs, GPUs and TPUs

Good Luck with your exam !

npack

posted on 24 Jul 20

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds