At early stage tech companies, the application database is often the richest source of data available for analysis. Getting this data into a data warehouse like Redshift enables your analysts to examine nearly every part of the business. Unfortunately, while this data is rich in meaning, it is also usually chaotic. Records are regularly deleted and updated as dealing with business edge cases takes priority over having the cleanest possible database.
A common example of this is order status. When an item is shipped out of your warehouse, or a customer service representative cancels an order, or a developer runs a mass update to reconcile a bug, records get updated in-place. While this makes it easy to update orders in day-to-day operations, it also robs you of the crucial ability to look at historical data.
This is a really big problem! Without this data, there are entire areas of your business that you simply cannot explore. Luckily, there are some ways to deal with it.
First, you can ask your developer team to create historical tables, restrict access, refine practices, and so on. This ends up being very difficult to do effectively. There are always more edge cases that require editing the database. You lose the benefit of the “just update the database” quick fix. And, most significantly, it requires building a massive table in the production database for analytics purposes only.
Fortunately for us, there is an alternative: we can create a historical record right in the warehouse. In the rest of this post, we’ll explore a simple example and look at how to recreate this historical data.
The rest of this post uses Postgres/Redshift-compatible SQL.
Let’s look at some data to clarify the problem. We have an
orders table, which contains information about orders that have been placed, and what stage of processing they are in:
If we look at the first order (id 1), we can see that it was placed on July 8 and completed on July 17 (when the order was last updated). We can infer that this order was shipped at some point between July 8 and July 17, but we don’t know exactly when!
What we need is a table that looks like this:
Here we can see the historical statuses, and the associated dates. Order 1 was placed on July 8, shipped on July 10, and marked completed on July 17.
null for that last record, meaning that this is the current version of this record in the production data. This gives us a complete view of how these orders have progressed. So, how do we get this in the warehouse?
It’s totally possible to create this intermediate table. We need a query that identifies changed records. Then, for each record that has changed, we need to insert a new row into the log table, and update the previous “current” row to make it not the current row any longer.
I’ve come up with a monster query to generate & update the desired table. Brace yourself, the SQL to accomplish this is somewhat nasty. (Don’t fret too much over how this SQL works, there’s a better way to be found below.)
If you run this whole thing periodically, it will keep updating the
orders_logtable over time with more and more historical data. But, it’s painful to run and manage this query. Each time the schema changes in the source data, we’ll have to update it. Each additional log table is equally burdensome to set up and maintain. Fortunately, there’s a simpler way…
Enter dbt archive
I’m one of the developers of an open source data modeling and warehouse management tool called dbt that does all this out of the box.
All we do is create a dbt project, and specify some YAML configuration:
Now, you can simply run:$ dbt archive
to generate the log table. You can also see it in action below.
Not only does this accomplish everything we laid out here, it also handles schema changes, and can support additional tables easily.
Of course, historical data will only be available from the date you begin running this process. So, if you have frequently changing tables in your warehouse, you should set up this process sooner rather than later!