You may have come across "Type 2 Slowly Changing Dimensions" or this idea of "change tracking" within your data warehouse. Regardless of the nomenclature, all data analysts have faced the problem–how do I model data when the record in the data warehouse keeps changing? In this post, I'll cover how to solve this using dbt snapshots.
A tale of two data types
Source data tables generally come in two different flavors:
- Mutable: Records are updated in-place over time. A typical example is an
orderstable, where the
statuscolumn changes as the order is processed.
- Immutable: Once a record is created, it is never updated again. A typical example is clickstream data, like a
link_clicked. Once that event is recorded, it won’t be updated again.
Applications often store data in mutable tables. The engineers that design these applications typically want to read and modify the current state of a row – recording and managing the historical values for every row in a database is extra work that costs brain power and CPU cycles.
For analysts, however, mutable data can be difficult to work with. It’s best explained with an example.
Imagine you have an orders table where the status field can be overwritten as the order is processed. One day you query the table and get back this result:
A day later, that order goes from "pending" to "shipped". Now, your query gives you back the following:
Since this record was mutated, the information about the order in the pending state has been overwritten. Straight away, you’ve lost the ability to answer questions like “when did that order ship?”, and “how long did it take to change status?”.
If you’ve ever had a KPI for last week change out from under you, or if a member of the finance team tells you that the values they exported to Excel no longer match what your BI tool is saying, you’re probably experiencing the pain of mutable data. Working with mutable data can make you feel like you’re building your analytics on top of quicksand.
You realize that all your problems would be solved if there were a way to turn your order data into immutable records, tracking the different values in your table over time. You ask your API team, “can we build an
order_history table in the backend?”, but get told you’ll have to wait months before they can look into it.
Fortunately, this is where the dbt’s snapshots come in – they make it possible to generate a stream of record changes from a source table, effectively turning mutable data into immutable data. As an analyst using dbt, you can add snapshots to your dbt project in an afternoon, without the need to wait on external engineering resources from your core product team.
How do dbt snapshots work?
Snapshots are simple
select statements which define a dataset. Every time you run the dbt snapshot command, dbt will run your select statement and check if the dataset has changed compared to its last known state. If the records have changed, then dbt will create a new entry in the snapshot table for the new state of the record. If there are net-new records in the source dataset, then dbt will create an initial entry for the record.
Behind the scenes, dbt runs all the appropriate DDL and DML to insert, update, and merge records into your snapshot table. If new columns appear in your source query, then dbt will add these columns to your snapshot table.
The particular mechanism for tracking changes that dbt uses is often referred to as "Type 2 Slowly Changing Dimensions".
What the heck is a Type 2 Slowly Changing Dimension?
If you're already familiar with the term, jump ahead! If the term is new to you, or if it you’ve always wondered what it meant, let’s work through it step by step:
- "Dimension", in Kimball parlance, is a data model where a single record represents a person, place, or thing – customers, products, locations, etc. This is in contrast to a "fact" which represents a process – orders, web sessions, transactions, etc.
- Dimensions are often considered "Slowly Changing" because their details update unpredictably and over a long period of time – a customer changes their address, a product is renamed, or an agent is assigned to a new team. Facts, on the other hand, tend to have a point in time at which they are no longer updated. Once an order is received, the order won't be updated again; once a user leaves her computer, the web session is finished.
- "Type 2" because there are a number of ways to handle mutable data, and the approach that dbt uses (i.e. inserting changed records, with valid_from and valid_to metadata columns) has been arbitrarily defined as the second type – you can check out the other ways of handling mutable data here.
Tracking data changes
dbt ships with two different strategies for determining if rows have changed in a table. In both cases, dbt relies on a provided
unique_key to match rows in your snapshot query to rows in your snapshot table. Each strategy is examined in further detail below.
timestamp strategy uses an
updated_at field to determine if a row has changed. When you run the
dbt snapshot command, dbt checks whether the
updated_at columns for a row is more recent than the last time the snapshot ran. If it is, then dbt will invalidate the old record and insert the new one. If the timestamps are unchanged, then dbt will not take any action, as the row has presumably not changed.
Some data sources do not include a reliable timestamp that indicates when a record has changed. If this is the case for your data, you can instead use the
check strategy. This strategy works by comparing the values of a list of columns between their current and snapshotted values. If any of these column values have changed, then dbt will invalidate the old record and insert the new one. If the column values are identical, then dbt will not take any action.
What should I snapshot?
Snapshots, like models, are defined as select queries. You can use
ref in these queries like you would in any model query. With this flexibility you can snapshot effectively any part of your dbt DAG.
Snapshots should almost always be run against source tables. Your models should then select from these snapshots, using the
ref function. As much as possible, snapshot your source data in its raw form and use downstream models to clean up the data. This is an important rule, but it’s not one that dbt enforces.
So, why run snapshots against source tables? Why not snapshot the results of models?
Snapshots, by their very nature, are not idempotent. The results of a snapshot operation will vary depending on if you run
dbt snapshot once per hour or once per day. Further, there’s no way to go back in time and re-snapshot historical data. Once a source record has been mutated, the previous state of that record is effectively lost forever. By snapshotting your sources, you can maximize the amount of data that you track, and in turn, maximize your modeling optionality.
By contrast, a snapshot that runs on top of a model will record changes to “transformed” data. If the logic in a model upstream of a snapshot changes, then the snapshot operation will record different data. In some cases this can be desirable, but for typical modeling use cases, it presents an opportunity for data loss. Logic errors in data models are inevitable, but dbt works hard to make sure that you can fix these bugs and rerun your dbt project. This isn’t the case if you have a snapshot in the middle of your DAG: that bug will likely result in data loss.
Getting started with snapshots
Snapshots are so powerful because they generate new information. The earlier you add snapshots to your dbt project, the more you can make use of their power. The docs are the best place to go for all the info about adding snapshots to your dbt project – make sure you check out the section on “Snapshot queries” to understand the best way to set up your snapshot.
With great power comes great responsibility. The usefulness of snapshots relies on them being run regularly and reliably – once you’ve added snapshots to your project, make sure you have a way to run the
dbt snapshot command on a schedule (it's a good idea to schedule this job separately to your
dbt run). Further, ensure that you have monitoring and alerting set up to find out quickly if something goes wrong. We use dbt Cloud to schedule our snapshots and leverage Cloud’s built-in email and Slack notifications for alerting.
If you have mutable data sources in your dbt project, you should make the time to set up snapshots – it only takes an afternoon! Remember:
The best time to start snapshotting your data was twenty years ago.
The second best time is today.
Already added snapshots to your dbt project? Check out our tips on building models on top of snapshots!