The great migration from Redshift to Snowflake

Rafay Aleem
The Craft
Published in
16 min readFeb 28, 2023

--

Faire was founded in 2017 and has since then, grown rapidly from a handful of people to over 1000+ team members, supporting over 600K retailers and over 85K brands around the world.

With this fast-paced growth came a lot of technical challenges and in turn, our dependency on data grew at a much faster pace than what we had ever imagined.

Our internal data stakeholders are embedded across a wide variety of roles such as Business Analytics, Recommendations for ML, Data Science, A/B testing, Marketing, and Sales–all of whom require some level of access to data in order to “seek the truth”, one of our core values at Faire.

When initial data capabilities were created, we used Redshift as our data warehouse.In this article, we’ll walk through the challenges we faced as we scaled and how we ultimately made the decision to migrate to Snowflake.

Why did we migrate to Snowflake?

Scaling pain

In our experience, Redshift had a very high maintenance cost and needed consistent efforts to keep it fast enough to support the needs of a growing data org. I was hired as the second data engineer, when the entire company was around 300 Faire folk. The functionality that we were trying to support with our Redshift cluster were roughly the following.

  • More than 100 Airflow ETLs
  • Experiments and Machine learning platform that was supported by Airflow ETLs
  • Data analytics reports and dashboards powered by Mode Analytics
  • SageMaker and local Jupyter notebooks
  • Machine learning apps running in AWS Batch

Redshift has a coupled compute and storage architecture , unless you are using RA3 instances. Therefore, you cannot isolate different workloads over the same data and this coupling meant that we had to be very careful about table / query design and space management.

If you are not controlling the design of each and every query or table in Redshift, it becomes a tedious process to answer why is my query so slow? To fully understand the extent of work needed to answer this question, it required a Data Engineer (DE) to go through the following flow from our on-call handbook.

With so many systems working off the data warehouse, it became very challenging to optimize the cluster and Workload Management (WLM) queues and be able to maintain SLAs.

Redshift’s serial isolation level

At that time, Redshift only supported serializable isolation level (Snapshot isolation level was introduced in May 2022), which proved problematic for our use cases in particular. To fully comprehend this, it’s pertinent to understand how Redshift locking works. Redshift has 3 types of locks:

  • AccessExclusiveLock: Acquired primarily during DDL operations, such as ALTER TABLE, DROP, or TRUNCATE. AccessExclusiveLock blocks all other locking attempts.
  • AccessShareLock: Acquired during UNLOAD, SELECT, UPDATE, or DELETE operations. AccessShareLock blocks only AccessExclusiveLock attempts. AccessShareLock doesn’t block other sessions that are trying to read or write on the table.
  • ShareRowExclusiveLock: Acquired during COPY, INSERT, UPDATE, or DELETE operations. ShareRowExclusiveLock blocks AccessExclusiveLock and other ShareRowExclusiveLock attempts, but doesn’t block AccessShareLock attempts.

Now, imagine a use case where a scheduled BI dashboard (Mode report) runs at 9AM EST daily and relies on a complex long-running SELECT query on my_table.

There is another ETL that starts at 9:15AM EST which creates a new temporary table and then swaps that with my_table using ALTER TABLE RENAME TO. Since AccessShareLock blocks AccessExclusiveLock, the ETL gets blocked until the Mode report is complete.

Now, imagine a second user that manually triggers another Mode report that performs SELECT on my_table. Would they be able to start a query? No.

Since Redshift processes queries in the order they are received, this issue is further exacerbated.

We are stuck with the following order:

Since the second SELECT came after AccessExclusiveLock in the queue, it has to wait for the table rename to finish before it can perform its operation, which means that it will have to wait the entire duration of the first select + table rename combined before anyone else can read from that table.

From an engineer’s perspective, it’s non-trivial to resolve this, either by;

  1. Making the rename step its own airflow task with a very short timeout (10 seconds) and a large number of retries to make it a best effort process. AccessExclusiveLock is quickly acquired and released and this becomes a best effort process
  2. Changing to a DELETE FROM / INSERT INTO pattern

However, the solutions described above don’t necessarily work for all the use cases. For instance, if you have a lot of ETLs with ALTER TABLE RENAME patterns and you are switching to 1 to fix them, you are at least doubling the number of Airflow tasks supported by your Airflow cluster. How does that affect your DAG SLAs now? Does Airflow scale well to resolve this?

Finally, if you want to be a data-driven organization, you need to also understand that not everyone working with data is a Data Engineer. It’s unfair to expect each and every person to follow these best practices and ensure that they are religiously followed.

Since we had no notion of Data Marts at Faire, this meant that our warehouse workloads weren’t isolated by business functions and each and every request to the warehouse had to be served by the same Redshift cluster which would have only compounded these issues with our future growth.

Scope of the migration

To put it very succinctly, there were two main phases to this migration.

  • 100% of tables migrated to Snowflake with at least 95% data parity with Redshift
  • All data-dependent workflows like Airflow ETLs, Mode reports and ML models migrated to reference Snowflake instead of Redshift

This was an enormous undertaking for the entire data warehouse and needed to be done with zero downtime, to limit disruption on both our business and infrastructure. The general approach that we took to make this possible was to run the two warehouses, Redshift and Snowflake, in parallel during the entire length of the migration.

This allowed us to gradually migrate data and its consumers and also bought us enough time to perform data parity checks as we moved along the process.

Source data migration

We broadly classified our sources into AWS Kinesis event streams and relational data sources that were synced via Stitch.

Event streams

We specified a cutoff date from when we started to replicate these streams into both Redshift and Snowflake. Kinesis stream ingestion in Snowflake was set up using Snowpipes. We also took full S3 backups of all Kinesis streams up to the cutoff date and loaded them manually into the same Snowflake tables that were used by the Snowpipes. This method ensured that we had both historical and new streaming data available in Snowflake, therefore maintaining simultaneous parity with Redshift.

MySQL + integration data models

Stitch was our main data integration tool that was used to sync data from multiple different sources into Redshift. We set up a new Stitch deployment to replicate all these sources separately into Snowflake. This was fairly quick to set up with the Stitch Connect API, which we used to retrieve parameters for all the sources in Redshift and replicated them by performing POST requests to create sources in the Stitch deployment for Snowflake. That way, we didn’t have to create each integration manually in the new deployment.

Introducing Transformation Layer in the new Snowflake Warehouse

As part of transitioning to Snowflake, we introduced the concept of a transformation layer in our Snowflake warehouse which is illustrated below.

This was a fundamental design change from how we had been doing things in Redshift and was paramount in enabling the migration and several of our future use cases.

In this section, we describe how and why the notion of transformation layer is fundamental in new-age data engineering practices and why every organization should consider implementing one.

What is the Transformation Layer?

  • A logical separation between RAW and ANALYTICS database layers
  • Comprised of Airflow hooks with Snowflake tags
  • Airflow staging DAGs to maintain all sorts of transformations
  • Enable us to introduce PII masking and restricting access to certain schemas

Managing things you can’t control — third party tools

When it comes to controlling behaviors of external tools, there is usually little you can do, other than dealing with them strategically. During the migration process, we had a number of challenges when dealing with such use cases.This is not to call out any particular third-party tool–it’s just that integrations across different systems work in different ways.

Column renames in source data

It’s not uncommon for source table columns to get renamed over the course of their lifecycle. However, migrating all downstream queries to reference new columns becomes tedious. Moreover, your downstream ETLs might actually need to reference data that is there in the previous column name. Now, if they also have to start referencing the new column after a cutoff date, that becomes a nightmare to maintain.

Do you now introduce COALESCE or CASE statements in your ETLs to build the right referencing logic? Or do you simply make that change in the Transformation Layer and keep things transparent to the downstream, as if nothing ever changed? Well, the latter is the answer and you know that.

We have found this to be a very powerful concept that enables so many possibilities with the least amount of friction on downstream ETLs. Imagine that your company is planning to move to a different third-party or open-source data extraction platform and you encounter cases where landed data has slightly different formats. How would you mitigate those differences such that you avoid migrating all downstream ETLs? Transformation layer for the rescue!

Future-proofing the warehouse

In the coming months, this transformation layer enabled us to implement several enhancements that were completely transparent to our downstream users. Here, we briefly discuss Snowflake object tagging and how we are leveraging that to enable PII masking in our entire data warehouse.

Snowflake’s PII masking at the source

As part of our data governance initiatives, we wanted to enable PII masking on the entire data warehouse. This can get prohibitively difficult to manage if you start tracking down each and every downstream table column to perform masking on.

Our transformation layer combined with Snowflake’s object tagging made this task quite achievable.

We created a metadata registry of tags for our production tables specifying whether each column was PII or not. Each time our staging layer is updated, we read these tags from the metadata registry and apply them to each column of source tables in RAW.

Once our source tables are tagged, we use them to propagate these tags to ANALYTICS views and tables that reference source RAW tables.

This fundamental design allows us to achieve PII masking for the entire warehouse by just enabling the masking policy on the tag. Any downstream queries reading from PII tagged columns automatically start seeing masked column values if their role doesn’t allow them access, ensuring tight data governance across the entire warehouse and any data sourced from it.

Maintaining parity between the two warehouses

The core migration roughly spanned across nine months, which is a fairly long time to maintain two parallel warehouses at scale. In this section, we explain how we managed to maintain this redundancy without introducing overhead to our stakeholders.

Parallel branch off master

We maintained two parallel branches during the migration process. master which was the main branch used to deploy code to the Airflow cluster for Redshift and a snowflake branch for Snowflake’s Airflow cluster.

We set up GitHub Actions to automate merges to the snowflake branch whenever PRs got merged into master branch. The action would also take care of replacing Redshift Airflow operators with Snowflake ones and take care of any other Snowflake specific parsing on the diff files.

The general workflow is outlined below:

  • PR merged to master
  • Automatic PR raised against snowflake branch and diff files parsed and modified to ensure Airflow operator compatibility. For example, the action would replace RedshiftOperatorFaire with SnowflakeOperatorFaire
  • Ensure that new DAGs deployed in Snowflake Airflow cluster are turned on

This strategy was crucial in maintaining data parity and will become clearer in the next few sections.

SQL Query parser

Although both Redshift and Snowflake are ANSI SQL compatible, there are still quite a few subtle differences between the two. For example, Redshift’s DATEPART function is DATE_PART in Snowflake.

Similarly, JSON parsing between the two is quite different, and much simpler in Snowflake.

Redshift:

JSON_EXTRACT_PATH_TEXT(a.last_touch_brand_impression_request, 'query') AS search_string

Snowflake:

a.last_touch_brand_impression_request:query AS search_string

During a good chunk of the migration period, our source of truth was still Redshift while we were moving data to Snowflake, which meant all the code development was still happening for Redshift dialect.

To maintain SQL compatibility with Snowflake, we wrote a custom parser to parse Redshift SQL to Snowflake SQL and leveraged that in our custom Snowflake Airflow hooks using a flag parse_from_redshift=True. This allowed us to dynamically parse Redshift style SQL for Snowflake at runtime while allowing us to only maintain it for Redshift in the code. With this method, we were able to simultaneously run our ETLs in two separate Airflow clusters, one for Redshift and the other one for Snowflake.

We are open-sourcing some of these utils here.

Data Model Equivalency Testing (Data Diff)

Let’s reiterate how far we have come with this migration. We have:

  • Migrated source data to Snowflake and ensured this is continuously replicated
  • Introduced a new Transformation Layer to keep structural changes transparent to downstream
  • Maintained ETL code parity between the two warehouses in parallel

Now that we have established a parallel data warehouse with all data pipelines running through a separate Airflow cluster for Snowflake, how do we ensure that data landing in Snowflake is at parity with data in Redshift?

We decided to use a third-party tool called Datafold to perform data parity between tables in Redshift and Snowflake (huge shoutout to folks at Datafold for supporting us through the migration process!).

This was a bit more complicated than we initially expected, as there was no way to perform joins across the two different warehouses. As a solution, we created a tool that would dump a Redshift table into S3 and then load that file into Snowflake using its COPY INTO syntax.

This proved to be a very tedious process:

  1. We were running all ETLs in parallel in two warehouses, which meant that a data diff run yesterday wasn’t necessarily valid today
  2. We needed to crowdsource data validation for non-core tables to expedite data migration. It wasn’t scalable for data engineers to perform data validation for all tables in the warehouse

We resolved (1) by automating the entirety of the data-diffing process via a combination of Airflow dag, Datafold API and the S3 transfer process explained before.

To resolve (2), we incorporated simple YAML based configs that could be defined in the DataDiff framework. This allowed anyone in the company, such as business analysts and data scientists, to define their source Redshift and Snowflake tables for data parity checks.

Since all data parity metrics, such as parity percentage, were surfaced in a Mode dashboard, including Datafold report links, following through and ensuring data parity for your own data models became super easy across the entire company.

The following screenshot shows what this data parity dashboard looked like.

This crowdsourcing empowered our entire data organization to go and investigate their data models and get underlying answers to any data discrepancies that they encountered.

Automated data parity checks allowed us to easily track our goal of having all data models maintain a parity of > 95% right before the cutoff date.

BI Layer (Mode Analytics) migration

Our BI layer is built using a 3rd party tool called Mode Analytics. We had more than 350 core Mode reports that needed to be migrated. This wasn’t a process of simply pointing reports to use Snowflake instead of Redshift.

Besides converting Redshift style SQL to Snowflake, we discovered that chart column references in Mode had to be uppercase in order to work with Snowflake.

The process to migrate a single Mode report had multiple steps.

  • Clone report
  • Update all query connections
  • Fix all chart column references (upper case)
  • Get final approval from report owner

We realized this was a manual and tedious process, so one of our platform engineers started searching for a solution based on Mode’s API. After some back-and-forth with the API, we realized that although it was fairly easy to clone the report, API support and documentation were lacking for use cases, such as converting chart references within the chart objects.

Chrome Dev Tools to the rescue

After figuring out the lack of support in the documented API, our platform engineer tried his luck using Chrome Dev Tools and managed to reverse engineer a good chunk of undocumented API to fix chart column references.

We ended up building some tooling around this and migrated all core Mode reports in bulk, leveraging the reverse-engineered APIs.

For future Mode reports, we also created a self-served process using Github Actions that would receive Mode report token as input and take care of cloning the report, replacing Redshift specific SQL with Snowflake, converting chart column references to uppercase and performing a dry run of the cloned report. Once approved by the report owner, the old report would be archived.

Quality-of-life improvements post-migration

Query tags in Snowflake

Snowflake supports query tagging, which is a simple but powerful feature when used correctly. We have enabled query tagging across all the possible modes of querying Snowflake in our infrastructure that includes Airflow operators (via Airflow hooks), Mode reports, SageMaker, ML apps, etc. This has allowed us to track query metrics such as cost, run-time, etc., at a very granular level. Here is an example of a query tag construct from our Airflow hook:

"QUERY_TAG": f"""
{{
"system": "airflow",
"dag": "{query_tag.get("dag_name")}",
"task": "{query_tag.get("task_name")}",
"owner":"{query_tag.get("owner")}",
"execution_date": "{query_tag.get("execution_date")}",
}}
"""

Just by being thoughtful about our design, we can find the most costly DAGs and tasks in Airflow by simply querying information_schema.query_history in Snowflake.

Transformation Layer goes a long way

We cannot emphasize enough the benefits of having a transformation layer between your raw and analytics databases. If you have one, that’s great! If you don’t, we highly recommend our readers prioritize this work for the future maintainability and pluggability of your data warehouse.

Write task-specific Airflow operators

Ever since moving to Snowflake, we have emphasized the usage of task-specific Airflow operators over generics. A simple Snowflake operator that can run any query in Snowflake sounds easy to maintain initially, but it doesn’t implicitly support best practices (users can do whatever they want). On the contrary, having specific rebuild and amend table/schema operators ensures SQL consistency by limiting user-provided SQLs to CTEs. This prevents anti-patterns such as using user-defined date columns instead of an org-wide date column that ensures future consistency, such as backfilling on date-partitioned snapshots.

If you can control data model creation within your Airflow operator, do it and limit user-defined SQL to CTEs. This enables QoL use cases such as SQL validation without any side effects.

Project your raw data using views with correct clustering

We have learned that having views on top of your raw data does not drastically affect your view performance in Snowflake, even for large tables, as long as you use the right clustering keys on the source table. Having views on RAW also eliminates the need for backfilling in case of an outage.

We use views (that also do transformations) on top of raw database for both event data loaded via Snowpipes, and for Stitch loaded data.

We have enforced clustering on these raw tables to ensure adequate performance for our needs, and simply incorporating the right cluster key (usually a date column in our case) ensures good performance.

Key learnings

Getting company-wide buy-in

If you are planning a data warehouse migration, ensure that your entire company is on-board, not just your data organization. Data migration is an extremely complicated task that, if not planned right, can wreak havoc with your data customers.

Migration work is a multi-team collaborative process, and therefore, teams that work with data, be it visual analytics or ELTs, should incorporate related work in their OKRs.

Enable self-serve processes where possible

As much as this is a technical process, it serves your teams very well if you can automate and make processes self-served (either by documentation or tooling) wherever possible. This will also ensure you are able to get help where possible, by crowdsourcing some of the migration work. Two of the biggest examples in our case were data validation between the two warehouses and Mode reports migration via GitHub Actions.

Final thoughts

With careful planning and work, we were able to make a full cutoff from Redshift to Snowflake in around nine months–which included all ETLs, BI tools, machine learning, offline and online feature serving, etc., and all data models validated against Redshift.

We have been very pleased with Snowflake’s performance overall. Scaling Snowflake for our needs has been fairly simple,, and we’ve seen a significant improvement in some of our metrics, such as query queuing times and Airflow job runtimes.

Today, our data infrastructure serves an average of over 4 million ETL queries per month, and our feature store calculations power 100 million online predictions per day, all while maintaining strict data SLA requirements across our entire data infrastructure.

Curious about what it’s like to work for a world-class Data Org? Learn more about our open roles.

Repository containing some open-source source utilities related to Snowflake migration: https://github.com/Faire/snowflake-migration

Acknowledgements

This was an incredible team effort and we are thankful for all the Faire folk that made this endeavor possible, including but not limited to the following:

Jonathan Medwig for his incredible work, direction and planning of the entire migration, cost-benefit analysis and extensive design and technical guidance.

Ivan Rudovol for his amazing work on ETL migration, ML apps migration, streamlining migration workflows and making a lot of processes self-served.

Ben Poland for hacking his way through the Mode API and making Mode reports migrations possible.

Ben Thompson and Alex Jang for their work on custom Airflow operators and enabling our stakeholders to migrate their Mode reports.

Eason Wang for his extensive work on the snowflake parser and other internal tooling.

Matthew Garrelick for helping us migrate Finance ETLs.

Marc Kelechava for his help with migrating search and discovery ETLs.

Venkat Gandur and Chen Peng for supporting this migration on the organization level.

View all articles from the Faire Data Science team here.

--

--

Senior Data Engineer at Faire. Based in Toronto. Music aficionado who likes playing guitar and is an Eric Clapton fan.