21 Dec 18 · npack · #snowflake ·   Bookmark  

×

Real experience stories from Snowflake end users

I stumbled upon this post on Hackernews, where the end-users discuss their experience with Snowflake. User DeGi shares his story, in his own words,

"We have ~80TB of (compressed) data in Snowflake at Celtra and I'm working with Snowflake on a daily basis. We've been using it for the last ~1 year in production. Overall maintenance is minimal and the product is very stable.

Pros:

  • Support for semi-structured nested data (think json, avro, parquet) and querying this in-database with custom operators
  • Separation of compute from storage. Since S3 is used for storage, you can just spawn as many compute clusters as needed - no congestion for resources.
  • CLONE capability. Basically, Snowflake allows you to do a zero-copy CLONE, which copies just the metadata, but not the actual data (you can clone a whole database, a particular schema or a particular table). This is particularly useful for QA scenarios because you don't need to retain/backup/copy over a large table - you just CLONE and can run some ALTERs on the clone of the data. Truth be told, there are some privilege bugs there, but I've already reported those and Snowflake is working on them.
  • Support for UDFs and Javascript UDFs. We've had to do a full ~80TB table rewrite and being able to do this without copying data outside of Snowflake was a massive gain.
  • Pricing model. We did not like query-based model of BigQuery a lot, because it's harder to control the costs.
  • Storage on Snowflake costs the same as S3 ($27/TB compressed), BigQuery charges for scans of uncompressed data.
  • Database-level atomicity and transactions (instead of table-level on BigQuery)
  • Seamless S3 integration. With BigQuery, we'd have to copy all data over to GCS first.
  • JDBC/ODBC connectivity. At the time we were evaluating Snowflake vs. BigQuery (1.5 years ago, BigQuery didn't support JDBC)
  • You can define separate ACLs for storage and compute
  • Snowflake was faster when the data size scanned was smaller (GBs)
  • Concurrent DML (insert into the same table from multiple processes - locking happens on a partition level)
  • Vendor support
  • ADD COLUMN, DROP COLUMN, RENAME all work as you would expect from a columnar database
  • Some cool in-database analytics functions, like HyperLogLog objects (that are aggregatable)

Cons:

  • Nested data is not first-class. It's supported by semi-structured VARIANT data type, but there is no schema if you use this. So you can't have nested data + define a schema both at the same time, you have to pick just one.
  • Snowflake uses a proprietary data storage format and you can't access data directly (even though it sits on S3). For example, when using Snowflake-Spark connector, there is a lot of copying of data going on: S3 -> Snowflake -> S3 -> Spark cluster, instead of just S3 -> Spark cluster.
  • BigQuery was faster for full table scans (TBs)
  • Does not release locks if the connection drops. It's pain to handle that yourself, especially if you can't control the clients which are killed.
  • No indexes. Also, no materialized views. Snowflake allows you to define Clustering keys, which will retain sort order (not global!), but it has certain bugs and we've not been using it seriously yet. Particularly, it doesn't seem to be suited for small tables, or tables with frequent small inserts, as it doesn't do file compaction (number of files just grows, which hits performance).
  • Which brings me to the next point. If your use case is more streaming in nature (more frequent inserts, but smaller ones), I don't think Snowflake would handle this well. For one use case, we're inserting every minute, and we're having problems with the number of files. For another use case, we're ingesting once per hour and this works okay.

Some (non-obvious) limitations:

  • 50 concurrent queries/user
  • 150 concurrent queries/account
  • streaming use cases (look above)
  • 1s connection times on ODBC driver (JDBC seems to be better) 


Snowflake responded to this review,

  • The concurrency limits mentioned above are soft limits that can be raised on customer request (those defaults are there so that runaway applications can be detected easily). Snowflake can handle very high concurrency--we have customers running hundreds of concurrent queries.
  • We’ve recently released a new Spark connector with a bunch of optimizations, including additional push-down capabilities that speed up performance significantly.
  • The clustering capability is currently in "preview", we're definitely taking input and have been working on incorporating feedback we've received so far into it.
  • One important thing to note when it comes to full table scans is that Snowflake allows you to choose how much horsepower you apply to the job, so you can easily adjust the horsepower to get faster scans

Whats you story ? How do you like snowflake ? tell us in the comments.

npack

posted on 21 Dec 18

Enjoy great content like this and a lot more !

Signup for a free account to write a post / comment / upvote posts. Its simple and takes less than 5 seconds




gazulasJul 18 01:45

Though its a old post, but it has good info


Powered by Hittly.com
Copied