Each week, dbt is being used by more data teams to solve more problems. Part of this growth means we are seeing new members join the dbt community who come from a wider range of backgrounds when it comes to transforming their data.

Since we first started building dbt, we’ve been very opinionated about our product design philosophy. We have strong opinions about how the overall data stack should be constructed that dbt fits into as well as dbt’s role within that stack. As a result, dbt is not a generic data transformation tool — we’ve built it to be extremely good at some things and unsuited to others.

We believe that having strong opinions makes dbt better: constraints are what make products great. And it seems that there are a lot of people who agree with our opinions, given dbt’s organic growth over the past several years.

As dbt is adopted by a wider community, we want to be extremely clear about when dbt is, and is not, a fit for your project. We believe that dbt is the right tool for your data transformations if:

  1. Your analytics warehouse[1] runs on separate hardware from the database used by your product;
  2. A copy of your source data is already loaded into your analytics warehouse;
  3. Your data transformation logic can be expressed in SQL;
  4. Your data can be transformed in batch; and
  5. You write idempotent data transformations.

If you found yourself nodding your head at each one of those points, great! dbt is likely a good fit for your project. If not, implementing dbt will likely feel like paddling upstream.

An explanation of each of these assumptions, and the reasons we believe them to be useful, follows.

dbt’s Assumptions

1. Your analytics warehouse runs on separate hardware from the database used by your product

By using a separate warehouse to your production database, data teams are empowered to work independently of the engineers focused on product development. The data team should own this warehouse, so that they can create new users, grant privileges, and create objects without being blocked by other teams. Each analyst should have a set of user credentials that has permission to build objects in the warehouse, and query source and transformed data.

2. A copy of your source data is already in your analytics warehouse

dbt is not an ETL tool. As Tristan explains:

[dbt] doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse. This “transform after load” architecture is becoming known as ELT (extract, load, transform). dbt is the T in ELT.

This means that dbt expects that the data you wish to transform (i.e. your source data) is already in your warehouse[2]. dbt, as a product, is agnostic to how you get your source data into your warehouse, however we generally recommend using an off-the-shelf software such as Stitch or Fivetran to do this (see more here).

dbt builds transformed data models in the same warehouse as your source data. Data should not move between warehouses in the process of data transformation.

3. Your data transformation logic can be expressed in SQL

dbt uses SQL to define data transformations, and exclusively works with data warehouses that use SQL as their query language[3]. The reasoning behind this is probably best expressed in this post:

Don’t move data to code, move code to data.

If your transformations cannot be expressed in SQL, these should happen outside of dbt. For example, getting weights from a lead scoring model is typically a data transformation best accomplished using a programming language like Python. dbt can happily sit in a data stack with a separate tool that would perform such transformations, while dbt performs the SQL-friendly transformations.

4. Your data can be transformed in batch

dbt is a batch processing tool — a production deployment of dbt will run data transformations on a schedule. It is not a streaming transformation tool.

As a result, there will be latency in your transformed data based on the frequency of your dbt runs (and the frequency with which source data are ingested!). While lower latency is generally a good thing, we believe that batch processing data is a worthwhile tradeoff in analytics:

  • The entire modern analytics stack that dbt participates in is built around batch-based processing.
    Stitch, Fivetran, Snowflake, BigQuery, Looker, Mode, etc.—they’re all targeting a batch-based use case. This may evolve over time, but that’s the current state of the industry.
  • Most analytics use cases do not need real-time data.
    Decisions based on analytics data can often be made on top of data that are up to a day old — for example, a marketing team may review ad performance weekly to decide which channels they will focus on in the upcoming week. Even Spotify has up to 24 hours delay in their data platform!

You can push dbt to perform transformations that are closer to real time. As with anything in programming, such a decision comes with tradeoffs, so it’s important to find the balance to best serve your needs. For instance, materializing your models as views removes all of the latency associated with dbt batch processing, but may introduce unacceptable latency at query-time.

5. Your data transformations are idempotent

In computing, an operation is idempotent when calling it multiple times does not have a different result to calling the operation once. A real-life example of idempotence is a crosswalk-button — pressing the button once, or a dozen times, has the same effect.

Applying the concept of idempotence to data transformations can be a little difficult, since the inputs (in this case, your source data) are always changing. In data transformations, idempotence is demonstrated in the following scenarios:

  • If your source data were to stop updating, successive runs of your transformations would still result in the same tables and views in your warehouse.
  • If your production deployment of your transformations were interrupted, the next run of the transformations would result in the same tables and views as if the deployment had not been interrupted.
  • If you manually triggered transformations between scheduled runs, the scheduled run would result in the same tables and views as if the manual runs had not been triggered.
  • If the objects that data transformations builds were dropped, you could rebuild them from scratch.

dbt is built on the assumption that your data transformations are idempotent — this is why models are defined as SELECT statements, while dbt handles wrapping the query in the appropriate data definition language (DDL) to create or update the relations in your table (read more here). You don’t need to write CREATE, INSERT, MERGE or statements when using dbt.

There are acceptable exceptions to this assumption. For example:

  • Writing to a table each time you build a model (as demonstrated here)
  • Including a _dbt_updated_at field in your tables, which is populated based on the current timestamp when dbt executed (and therefore will be different on each run).

In both of these cases, these transformations are creating metadata about your transformations, as opposed to being the transformations themselves.

Even if the data transformations you have today are not idempotent, you can often redesign them so that they are. We highly recommend it.

What if these assumptions don’t work for my project?

dbt may not be the right tool for this particular project! Before declaring a total loss, though, it may be worth doing a gut check: there are 450 companies today who believe that this approach works great, and their transformation use cases are incredibly diverse and demanding.

dbt is definitely a different approach to data transformation, and it could potentially feel a little unusual as you first get started. We encourage those who are just picking it up to find ways to work with it instead of attempting to hack around it: you may find that your code looks a little different than you’re used to, but we think you’ll be happier for it over the long run.

Even if you think that dbt isn’t a fit for your current project, we’d love it if you would say hi on Slack. There are tons of smart people in the dbt community who would love to be a sounding board for your decision and would benefit from your experience!

I agree with all of these assumptions! What do I do next?

Amazing! Wonderful! Fantastic! If you aren’t already using dbt, head over to our documentation to get started. We also have an incredible Slack community of dbt users and we’d love it if you’d say hi!


[1] A note on what we mean by “warehouse”:
We use the term “warehouse” a lot in this article, so it’s worth defining what we mean by it. We consider a data warehouse to be a single logical location for your data, that contains all the objects that can be queried from a superuser account, without changing credentials.

For the data warehouses that dbt supports, this translates to:

  • Postgres: a single database within a Postgres cluster.
  • Redshift: a single database within a Redshift cluster, which may include external schemas that can be accessed via Spectrum.
  • Snowflake: a single Snowflake account, which may include multiple databases, as well as shares and stages.
  • BigQuery: a single BigQuery account, with access to one or more projects.

[2] A note on the seed functionality:
While dbt can load CSVs of raw data in your warehouse (through the seedcommand), this should only be used to load fairly static CSVs used in data transformations, for example, a list of mappings of country codes to country names. It should not be used to load frequently-changing source data.

[3] A note on Python or Javascript UDFs:
Depending on your warehouse, you may be able to write user defined functions (UDFs) in Python or JavaScript. UDFs are often computationally expensive so their use should be limited. However they can be extremely useful to solve a specific transformation need, in which case their use is entirely appropriate. A guide on using dbt to version control your UDFs can be found here.