dbt, our open source product for data modeling, is SQL-only. SQL can, by itself, accomplish a majority of analytical (non-data science) workloads. With dbt’s built-in Jinja2 templating engine, including its ability to introspect the schema at runtime, SQL becomes even more capable.
dbt functions as a compiler and a runner, while the database is responsible for execution. This decision to outsource execution to the database means that dbt gets a lot “for free”:
- Free “upgrades”. As new database technologies come out, dbt supports them natively, simply by building new adapters. With all of the innovation in MPP databases, this is very important.
- Free scalability. dbt is as scalable as your database. dbt doesn’t introduce an additional layer of technology to your data stack that you have to think about scaling.
- Free user and access control. Databases already have battle-tested security models. Rather than introducing another permissions model, dbt relies on the one you already have.
- Free interoperability. If you have a data warehouse, you’ve probably figured out how to get data into it and how to visualize that data. Operating within that warehouse allows dbt to plug into that ecosystem without needing to think about how to get data in and out.
Getting all of this “for free” is what has allowed us to build dbt into such a useful and capable product with so little time and few resources. Supporting SQL also means that dbt has access to the largest possible set of analysts, as SQL knowledge is widespread. It also makes it very easy for analysts to set up as there is no local execution environment to configure.
But SQL can’t run a logistic regression. It can’t train a TensorFlow model. And while 80% of data science is actually data munging, we do also care about (and run into) these types of problems.
Just recently we were working with a client to build a predictive model. The model is a fairly straightforward logistic regression, written in Juypter Notebook using Python. The weights need to be updated periodically as new data comes in. Pretty typical stuff.
It is not hard to stick this Python on an EC2 box, run it in cron, and have it output weights to the client’s database, but this solution is not at all elegant. The frustrating thing is that there doesn’t seem to be a great solution today.
We’ve thought a lot about this problem, because we’d love to come up with a solution that feels more like building models in dbt. Here’s what we’ve considered so far and what we do and do not like about these options:
- Google Cloud DataProc, DataFlow, AWS EMR: These are great services, and give you free scalability. Unfortunately, they’re difficult for analysts to use because jobs have to be written in PySpark and not Python. While this makes sense, it introduces a hurdle for analysts (“What do you mean I can’t use Prophet?”). It also makes the problem of local environment configuration more challenging.
- Domino Data Lab (or similar): There are a handful of products out there that provide entire data science stacks, including associated execution environments. These solutions are pricey and they lock you into a closed source ecosystem controlled by one vendor. Under the hood, they’re really just packaging up individual EC2 instances for you to run your jobs on and wrapping them in some nice UI. Meh.
- AWS Lambda: The serverless option of Lambda is promising, and is very suitable for some analytic workloads. There is minimal configuration required and analysts can write code in traditional Python. The problem is that Lambda isn’t suitable for large scale analytics workloads. Each Lambda worker has strict limits, and while it’s totally possible to deconstruct some jobs into many workers, this is not a good option for large-scale data processing.
This represents our understanding of the solution set today. While large-scale data processing in Python is increasingly capable, it still doesn’t feel increasingly accessible. In order to bridge the gap to a point where millions of analysts have Jupyter Notebook installed and are deploying analytics to production, something is going to have to change.
It’s not clear whether dbt will play a role in this process. We take this problem seriously and are looking for opportunities to solve it in a way that is consistent with our view of how analytics products should be built. We’re open to ideas.