I’m going to assume you’re already familiar with both dbt and Looker. If you haven’t heard of dbt before, you should probably start here.
What exactly is modeling, anyway?
The modern data stack is modular — composed of multiple tools all dedicated to doing one thing extremely well. My friend David Wallace at Mode wrote about this better than I ever could, but the short version is that modularity gives you choice and prevents lock-in. These are very good things.
For all its good qualities, though, the modular architecture can be a bit overwhelming due to the number of different tools working in consort. It can be unclear where the boundaries for one tool stop and the next one start. Nowhere is this more true than in modeling.
Talking about data modeling is confusing because it can mean several different things. Here are four distinct data activities that are all frequently referred to as “data modeling”:
- Cleansing data to fix problems that exist in the source. For example, removing test records so that they’re not included in final results.
- Re-shaping or pre-aggregating data so that it can be more reliably, conveniently, or performantly analyzed. For example, aggregating a table of website page views into a smaller table of sessions.
- Creating metadata about data structures and their relationships that allow tools to act on that data with context. For example, setting field data types and join conditions in LookML.
- Applying algorithms to a dataset to perform a classification or prediction task. For example, predicting a user’s 30-day churn likelihood using a logistic regression.
Last week I got the question “If I’m already using dbt, do I really need to use Looker?” I also get the inverse — “I’m already using Looker, do I really need to use dbt?” The answer, in both cases, is yes.
dbt and Looker actually accomplish two distinct, and complimentary, types of data modeling.
Analysts use dbt to clean, re-shape, and pre-aggregate data. Analysts use LookML to build a metadata layer that allows Looker to write SQL queries against your schema. Use dbt to build your foundation, then use LookML to teach Looker how to navigate it.
Let’s make this a bit more concrete.
A Practical Example: Page Views and Sessions
A web session is simply a sequence of page views that happen within some defined time window. Typically web analytics tools set a session timeout at 30 minutes, so any 30 minute break in activity indicates the end of a session. When you pipe web analytics data into your warehouse, you’ll typically just get a stream of events with timestamps and some other information, but typically that’s not a useful level of granularity to analyze. If you want to know how many people came to your site, how long they stayed, whether they converted, and where they came from, you’re asking questions about sessions.
It turns out that “sessionizing” a stream of page views in SQL is a non-trivial task, and it’s especially challenging to do in a performant way when you’re dealing with a ton of data. In fact, the dbt code required to performantly sessionize page view events emitted by Snowplow is over a thousand lines long. If you’re curious, take a look.
Creating this sessions table isn’t something that Looker is really able to do. You can’t sessionize data without writing custom SQL— the query is too complex for Looker’s query model. And, you could load some version of this code into several PDTs, but that wouldn’t work very well. PDTs are missing basic features like incremental loading and dependency resolution that are absolutely required when you’re doing this kind of heavy lifting.
This is a perfect use case for dbt. dbt is excellent at handling this type of transformation — it makes complex chains of dependencies easy to manage, and makes incremental loading seamless and fast.
Once you have a sessions table, it’s time to write some LookML. Describe your sessions table in a LookML view — data types, field groups, etc.— and then join it to your users table in your LookML model. All the sudden you have an incredibly powerful source of behavioral analytics and marketing attribution. It’s time to build some dashboards.
Build your foundation with dbt, then write LookML to teach Looker how to join everything together.
If you’ve read up to this point, you have the basics. Feel free to stop reading if you just wanted a high-level overview. But if you’re an analyst that works with both dbt and Looker, you probably have more detailed questions. Here are some I’ve been asked and our beliefs about how to best navigate them.
Should I only use dbt for heavyweight transformations?
Ideally, no. We never point a Looker view to a raw data table; 100% of Looker views we write are pointed at dbt models.
Imagine the following scenario. You were using Alooma to load data from your Postgres database, but you decided to switch to Stitch. Alooma was loading data into schema alooma_postgres and Stitch is now loading data into schema stitch_postgres. You need to switch everything over. If you have solid dbt base models built and have implemented tests on top of them, you can make this schema name change in a single configuration file, then re-run all of your tests to validate that the data checks out. Looker never has to know that you made the switch — no LookML code changes required.
Here’s another scenario. In order to filter out certain records from a table, you implement a sql_always_where to your explore. You reason that, since users can’t change it, you’re guaranteed to eliminate those bogus records from your metrics. But what about internal users who use Mode? Or Periscope? Or Postico? Most companies use a variety of analytics products. Implementing this filter in a dbt model will filter out those records for all users of your data warehouse, not just the Looker users. Every tool that connects to your warehouse has access to the logic you build in your dbt models.
Finally, imagine this: you stop paying for Looker. Looker is, after all, not an inexpensive product. Typically, software engineers would prefer to invest time writing code in open platforms; they are investing so much time in their code that they want control over it. This is why dbt is open source: you own your code and the tool used to run it. Looker is amazing, but you should generally limit lock-in where you can.
Use dbt for all transformations, large and small. Implement tests. Create a solid foundation that every data user at your company can share.
Should I create calculated fields in a Looker view or in dbt?
It feels so easy to write a case statement in a Looker view. And it is! You’re already in your browser, just hit dev mode, add the field, and you’re good to go. The problem, as in above, is that non-Looker users won’t have access to this field. And, of course, you can’t test it.
You may respond “But everyone at our company uses Looker!” And that even may be true — today. It won’t be true in the future, though, when you hire that data scientist that does all of their work in Jupyter. If you suggest that they access your modeled data via the Looker API instead of simply writing SQL against your warehouse, you’ll have one unhappy and inefficient data scientist on your hands.
Don’t develop in Looker without a terminal window open. When you need new fields, build them in dbt, and build tests for them at the same time.
Should I calculate metrics in dbt or Looker?
I’ve pushed you towards using dbt for a lot of use cases, but this is one that is definitely Looker’s job. Looker is absolutely best-suited to calculate metrics.
Let’s do another scenario. Imagine you’re and ecommerce company and the primary table in your warehouse is your orders table. Your have three KPIs that you get from the orders table: revenue, orders, and customers. You could aggregate them either in dbt or LookML.
What if one user wants to look at monthly revenue and another wants to look at daily revenue? What if your users want to group by all kinds of customer segments, product groupings, geographies, and acquisition channels. Do you want to pre-aggregate all permutations of this data? You’d have to if you wanted to do all of this in dbt.
dbt is a batch-based tool. It executes a series of data transformations against your database on a schedule. It doesn’t respond to user input. The whole point of an interactive BI tool is that the end user, who often isn’t familiar with your data warehouse or know how to write SQL, is able to formulate their own questions and the BI tool can answer those questions on the fly. This implies that you need to give your BI tool access to data with appropriate granularity to allow the user to actually make meaningful choices. Then, when the user has applied any filtering and grouping they choose, it’s up to Looker to eventually run the sum(amount) query to get revenue.
Don’t limit user choice by calculating metrics ahead of time in dbt. Give Looker clean, performant data to work with and then let it do its thing!
Anything I’m missing?
There are a lot of people today using dbt and Looker together, and it’s not always 100% clear on which tool is best at what. Hopefully this post gives you a sense of how the tools fit together and how we manage some of the overlapping areas in our own work.
If you have other questions in this vein that the post didn’t cover, feel free to respond below or to hit me up directly in Slack. I’d love to add them.