At Fishtown Analytics, we provide analytics consulting for venture-funded startups. Our clients range from A-round funded companies finding product-market-fit to late-stage companies in hyper-growth mode.
Because of this range of client profiles, we get exposure to data stacks of companies who are just starting out with analytics all the way up to some of the most sophisticated data organizations in the world. And the biggest change we’ve seen in this cohort over the past two years is a shift towards Snowflake and away from Redshift.
We are wholeheartedly in support of this shift. For a large number of reasons, which I’ll detail in an upcoming post, we’re heavily investing in Snowflake ourselves. We’re improving Snowflake support in dbt and we’re using it as the default when a new client needs to spin up a warehouse.
A year ago, we considered Snowflake a promising technology; today we consider it a default.
We’re not alone in this viewpoint. The dbt Slack community has been broadly engaged around the topic of Snowflake, with more and more Snowflake users popping up every week. And Snowflake as a company is clearly experiencing pretty serious growth with its recent capital raise.
We’ve now been involved in several Redshift >> Snowflake migrations. The driver here is typically that a company scales up their data organization and starts hitting Redshift concurrency issues. While it’s certainly possible to scale Redshift a very long way, it simply requires more effort to maintain a high-concurrency Redshift cluster than it does a similarly high-concurrency Snowflake cluster. When companies hit this stage, we frequently help them migrate.
So far, we’ve had a tremendous amount of success in the migrations we’ve done. We’ve found that query response times decrease, costs decrease, and concurrency issues are mitigated (all at the same time!). And these are meaningful Snowflake installations — we’ve migrated companies who have hundreds of scheduled jobs running and hundreds of data consumers in their warehouses.
It’s actually easier to do the migration than you might think. Many off-the-shelf ETL tools today support both warehouses, so loading raw data into Snowflake via these tools just requires some configuration changes. From there, it’s all about refactoring SQL.
Snowflake and Redshift both support a very similar flavor of ANSI-SQL, and Snowflake, for the most part, supports a superset of the functionality that’s included in Redshift. But when you attempt to migrate 25,000 lines of SQL from running on Redshift to running on Snowflake, you’ll come across plenty of language inconsistencies that you didn’t expect.
I wanted to provide my notes as I went through this process in the hopes that it will make other folks’ experiences smoother. The gap between the two SQL dialects is still larger than I’d like (as you’ll see in my notes below), but my hope is that it will continue to close over time. And rest assured: if you make it through this transition process, the benefits are quite significant.
Let’s dive in.
SQL Inconsistencies when Transitioning from Redshift to Snowflake
…in order from biggest to smallest. Please note that this list does not include data ingestion, only DML and SQL.
Snowflake Window Functions
This was the biggest bummer for me. Snowflake window functions just aren’t all the way there. They’re wholly adequate for most use cases, but I found some instances where they break down.
- Some Snowflake window functions — for example,
avg()—don’t support sliding window frames. This means you can’t do trailing 30 day averages. Or, rather, you have to write a trailing 30 day average as a sum divided by a count. This inconsistency in implementation is irritating, and it’s not something you have to worry about when writing SQL on Redshift.
- Some Snowflake aggregate functions — for example,
stdev_samp()—don’t support windowing at all. This became particularly painful for me in one instance where I was doing outlier detection in a timeseries (something we do semi-frequently) and realized that, without access to a windowed version of standard deviation I was totally hosed. This was a very uncomfortable situation. I think it’s possible to write a windowed version of
stdev_samp()using nothing but sum and count window functions, but I shouldn’t have to re-derive statistical formulas to write SQL.
The lack of complete window function support isn’t just a headache, it’s a real pain point. When you adopt a data warehouse platform, you need to have confidence that it’s going to have 100% SQL support, not 99%. That final 1% is actually very important.
The only reason I’m not more up in arms about this is that the Snowflake product team does have a good track record of rounding out their SQL support over the past couple of years and I’m hopeful that trend will continue here. If this continues to be the case by EOY 2018 I will be much more upset.
Recommendation: Assess your exposure to this issue prior to making the decision to transition. It’s possible that this could be a show-stopper on your transition to Snowflake.
Date and Timestamp Math
Do you like to write Postgres-style timestamp math?
--REDSHIFT SELECT '2011-12-31 01:00:00'::timestamp - '2010-09-17 23:00:00'::timestamp
I personally don’t, but a lot of Redshift users do. Snowflake doesn’t support this. Instead, use this format:
--SNOWFLAKE SELECT DATEDIFF(day, '2011-12-29 23:00:00', '2011-12-31 01:00:00')
Similarly, you can do Postgres interval math on Redshift, like this:
--REDSHIFT SELECT timestamp '2001-09-28 23:00' - interval '23 days'
Again, not supported. Rewrite this as follows:
--SNOWFLAKE SELECT dateadd(day, -23, '2001-09-28 23:00')
Redshift supports both function syntaxes, so you might not have a lot of issues here if you already used the SQL Server-style syntax. But rewriting all of your date math can be a huge headache if you use a lot of Postgres-style syntax.
I personally prefer the syntax that Snowflake has adopted, and I don’t mind that they force you into a single syntax instead of allowing multiple (as Redshift does). This forces analysts into common SQL patterns and promotes code readability within teams. But it can cause some pain when migrating.
Recommendation: Re-write all of your Postgres-style timestamp math.
Dealing with Time Zones
As any software engineer will tell you, time zones are always annoying. The Redshift approach to time zones is one that I find quite straightforward and is inherited directly from Postgres; either a timestamp has a time zone associated with it or it doesn’t. Timestamps without time zones are automatically assumed to be in UTC for the purposes of any time zone conversions unless you specify otherwise. Easy.
In Snowflake, things get more complicated. There are actually three timestamp formats instead of two:
TIMESTAMP_NTZ (no time zone),
TIMESTAMP_TZ (with time zone), and
TIMESTAMP_LTZ (local time zone). Rather than attempting to explain these here, feel free to read the docs. In addition to the additional data type, Snowflake also has an account-level time zone setting, which is (perplexingly) not defaulted to UTC.
In my experience, this additional functionality actually ends up creating more problems than it solves. For example, here’s how you get a current UTC timestamp (no time zone) in Redshift:
-- REDSHIFT select current_timestamp::timestamp
Here’s how you do the same in Snowflake:
-- SNOWFLAKE select convert_timezone('UTC', current_timestamp)::timestamp
It took me surprisingly long to figure out that simple snippet: navigating the docs and parsing out how the account default timezone interacts with the various timestamp data types is quite tricky. I think it’s actually better to skip all of this complexity and just use UTC timestamps in your warehouse. This will likely incur some refactoring costs in migrating from Redshift if that’s not the strategy you’re already using, as you’ll need to check how each of your data ingestion pipelines store their output timestamps. Once you have your data all stored in
TIMESTAMP_NTZ you’ll find that time zone logic is straightforward.
Recommendation: Convert all of the timestamps in your ingested data to UTC and store them as
TIMESTAMP_NTZ. Let your BI layer deal with any timestamp conversions as the data gets presented.
Casing and Quotes
Snowflake handles case sensitivity very differently from Redshift or Postgres. When identifiers in Redshift aren’t quoted, Redshift will downcase them by default. So, for example:
-- REDSHIFT create table test_table as select ...; select * from test_table; --works select * from "test_table"; --works select * from "TEST_TABLE"; --fails
Snowflake, on the other hand, upcases unquoted identifiers. This leads to some very surprising behavior, and means that you need to be very careful about quoting:
-- SNOWFLAKE create table test_table as select ...; select * from test_table; --works select * from "TEST_TABLE"; --works select * from "test_table"; --fails
This is confusing going the other direction, too:
-- SNOWFLAKE create table "test_table" as select ...; select * from "test_table"; --works select * from test_table; --fails select * from "TEST_TABLE"; --fails
This is very jarring at first. In other databases, you can get away with using quoting loosely. In Snowflake, you have to be really deliberate about what you do and don’t quote. Our general recommendation has been to quote as little as possible in Snowflake, since it can be annoying to have to type out your identifiers in all caps!
Recommendation: Don’t quote your identifiers in Snowflake.
Snowflake’s functions are often less forgiving
In many cases, Redshift is more “forgiving” than Snowflake is. Here’s an example to illustrate the point:
--REDSHIFT select coalesce('a') --'a' select coalesce('a', 'b') --'a'
Note that Redshift allows you to coalesce a value with…nothing! This is a useless function call, but it’s very common that small idiosyncrasies like this show up when you’re converting an entire data warehouse installation and all the corresponding SQL. Snowflake throws an error if you attempt to do the same thing:
--SNOWFLAKE select coalesce('a') --error: "not enough arguments for function" select coalesce('a', 'b') --'a'
The Redshift method signature for
coalesce() is just more forgiving — it accepts a single argument, whereas Snowflake’s requires at least two. This ends up occurring fairly frequently: Redshift always seems to be more forgiving, allowing SQL that isn’t always perfect and just making decisions about what the user probably intended. I don’t have strong feelings on what is “right” here — there are real benefits to Snowflake’s strictness as well. But it’s definitely something you’ll run into when doing a migration.
Recommendation: Allocate plenty of time to troubleshoot minor SQL issues in your existing code. Snowflake will likely find many opportunities to yell at you about small things.
Snowflake’s data type conversions are often less forgiving
Take a look at this query:
with test_data as ( select '2018-04-26 13:26:38' as ts union all select '2018-04-26 1:26:38 PM' ) select ts::timestamp from test_data
This is a pretty common situation when you have heterogenous data sources: two different data sources define the same column with slightly different timestamp formats,
HH12 AM. When Redshift sees both of those formats in the same column and you’ve attempted to cast the entire column as a timestamp, it auto-recognizes the differences and converts both outputs to
2018–04–26 13:26:38. This makes life very easy. When Snowflake sees the same, it throws an error:
Timestamp ‘2018–04–26 1:26:38 PM’ is not recognized.
This continues the trend of Snowflake being overall less forgiving than Redshift. It’s not a problem, per se, but it’ll force you to write a regular expression that checks the timestamp format and then applies a specific conversion format string based on what it sees. This same situation occurs when converting to numeric data types as well.
Recommendation: If a column has heterogenous timestamp formats, explicitly convert each different format using
regexp_replace()before casting the whole column.
Assigning Column Names in DDL
Snowflake requires you to explicitly name all columns when you are creating tables and views. This is good practice, but often Redshift DDL can get a little sloppy and Redshift is happy to provide default column names for you. Here’s how this plays out in practice:
--SNOWFLAKE select current_timestamp(); --works create table test_table as select current_timestamp() as ts; --works create table test_table as select current_timestamp(); --fails, "Missing column specification"
Redshift has no problems with the last one of those statements and will just name the column
current_timestamp for you. This might not be a big problem for you, but the code I was transitioning actually had this come up quite a lot, as we were running tons of DDL on Redshift that had implicit column names.
Recommendation: Make sure all of your columns are explicitly named in your