dbt's metrics capabilities, recently announced at Coalesce but not yet launched, are already generating a lot of buzz. As this train continues to gather steam, here's one question asked at Drew's presentation that I realized I hadn't spent enough—really any—time talking about publicly:
Let's start at the beginning: why did we open source dbt Core in the first place? Why are we not just a typical proprietary software company? There are really two answers to this question: values, and strategy.
If you've never read our values before, you should. I wrote them prior to starting the company, and they explain a lot of our somewhat unusual take on the world. Our values define our success not in terms of our own metrics, but rather in terms of the impact we make on the world around us: creating knowledge, creating value, and then relying on our users to pay those gifts forwards and trusting that somewhere in this process there would be room for us to create a financial return for ourselves.
Open source could not be more consistent with these values. Open source isn't a tactic, it's the only licensing model that made sense for our core innovation.
It was clear from the outset that dbt was fundamentally a compiler for a new programming language. And no one wants to learn a new language if that language is owned by someone else. No one wants to lock up all of their business logic—thousands, tens, hundreds of thousands of hours of work—into a language that you fundamentally don't control. Imagine spending a year for a team of ten people building a dbt project and then having us—the sole owners of the proprietary compiler for dbt-SQL—triple your price because we know how locked in you are.
Open source meant freedom for our users; that freedom in turn created trust between us and our users. It is in large part this scalable trust that has enabled dbt to become the standard that it has.
Enter: dbt Cloud
In 2017 we launched Sinter Data, the product we later renamed to dbt Cloud. It was the first time we had ever built a cloud service, the first time we wrote proprietary software. We thought a lot about the line between dbt Core, our Apache 2.0-licensed code base, and Sinter. Licensing was a hot topic back then—this was in the run-up to the period when Redis, Elastic, Mongo, and many others re-licensed their projects—so we had the opportunity to learn from many much larger communities. What a gift! We are deeply indebted to the open source maintainers that have come before us.
While we haven't always been exactly consistent with this definition, the line we drew was as follows:
- open source products will be focused on languages and integrations with data platforms.
- open source products would be stateless.
- open source products would be interacted with either via APIs or the CLI, not GUIs.
- the compiler for any language we ever create will be permissive-OSS; there will be no "proprietary language extensions." C++ doesn't require a license to use the
forloop so dbt won't require you to pay for parts of any of our languages either. this was critical for ecosystem adoption.
- proprietary products will focus on ease-of-use and technical infrastructure required to utilize core language constructs in production.
- proprietary products will typically be stateful and present themselves as cloud services.
- proprietary products will be built on top of open source products and not re-implement their functionality. users of open source products and commercial products would always be using the same compiler. anything else represents too great an opportunity for a schism between open source and commercial user bases over time.
There are exceptions—for instance, dbt docs was an example of us breaking our understood rules! We just felt like it was too important for users to have access to this basic level of catalog and graph visualization as dbt projects started to get more complex in 2018. But by and large you can look at the products that we build today and see these same standards applied very consistently.
What's worth stating here is that we are not dogmatic about open source. Some of the code that we write makes sense to be open; some of it makes more sense to reserve for ourselves to build a sustainable business around.
Even users who only use our OSS code should care about the health of our commercial business—as we've seen commercial success, our investment in open source has dramatically increased. Our commitment to OSS never eclipsed roughly 1 FTE while we were exclusively funded by consulting revenue; in the past two years that has grown to ~8 FTE. Double that if you count the humans we now employ to support the OSS-centered dbt Community! There will always be a direct relationship between our commercial success and our ability to contribute more value into the world.
From Batch to Interactive
If you watched Drew's talk at Coalesce, you know that dbt is growing its footprint from batch to interactive queries. That is to support a lot of initiatives, but initially it's about building metrics.
One of the things this means is a jump in technical complexity. Batch is fundamentally pretty simple...ask dbt to do some work, it spins up, does its thing, and spins back down. Because response time in a batch context is not that critical, the entirety of the dbt program and the entirety of your codebase can be fully read into memory, operated upon, and then flushed at the completion of each thing you ask dbt to do. If you do two successive
dbt run operations, your operating system has to load dbt into memory twice. Not exactly efficient.
In an interactive world, that doesn't fly—interactivity requires orders of magnitude better performance. One of the things we're building in order to support metrics, then, is the dbt server. To return to Mikael's original question:
Will this be open-source? The dbt server thing?
Before fully addressing this, let's do a final detour and talk about how metrics will actually be processed. There are 5 distinct codebases involved in responding to a request for a metric.
- dbt Core - As stated above, dbt Core is ultimately responsible for compiling 100% of dbt-generated SQL. Metrics will use this same engine. This code is and has always been Apache 2.0-OSS.
- Your database adapter - dbt's database adapters teach Core how to connect to and write SQL for each database that dbt works with. Metrics will rely on new cross-database SQL compilation functionality to be built into each adapter. This code is and has always been Apache 2.0-OSS.
- Metrics standard library - Just as dbt understands how to materialize datasets today and this functionality is written as macros, dbt will need to understand how to write interactive SQL queries. This logic will also be written in macros. This net new codebase will be licensed Apache 2.0-OSS.
The above three codebases are all Apache 2.0-licensed open source software and represent everything that you need in order to generate and execute the SQL for a canonical dbt metrics query from Python or the CLI. There are two additional pieces of infrastructure required to respond to such queries in an interactive environment:
- dbt Server - dbt Server wraps dbt Core in a persistent server that is responsible for handling RESTful API requests for dbt operations. It's a thin interface that is primarily responsible for performance and reliability in production environments. This is a new codebase and will be licensed under a new license for us—the Business Source License (BSL). This license will enable all users to run the server on their own behalf without limitation but will prevent it from being sold as a cloud service.
- The proxy server - The proxy server is what enables dbt to intercept requests to a database and compile dbt-SQL into raw SQL that the database understands. It is not a very dbt-specific piece of technology and needs to run with extremely high performance and reliability. This is a new codebase; it will be proprietary and delivered as a component of dbt Cloud.
The specific answer to Mikael's original question, the dbt Server part of this stack will be "source available," not "open source", although that will provide nearly all of the openness that users would typically get from an Apache 2.0 license:
- It provides access to the source code, just like Apache 2.0.
- It provides the ability for users to modify the source code, just like Apache 2.0.
- It provides the ability for users to run the source code on their own behalf, just like Apache 2.0.
- It has a very cool provision whereby all code released actually graduates into true Apache 2.0 open source after a certain period (we plan to use 3 years).
The main difference between the BSL and Apache 2.0 for the dbt Community will be that recent versions of the dbt Server will not be able to be incorporated into other commercial products. A user will be able to deploy and use all versions of the server for themselves / their company just as they would with Apache 2.0 code.
We chose this license because it gives users what they need—control—while reserving for ourselves enough space for commercialization. Other large cloud providers cannot simply take the latest version of the dbt Server, fork it, and sell it to their existing customer base without giving anything back to the dbt Community.
This licensing approach is also highly consistent with the standards we've applied to our licensing model for the past five years: all SQL compilation and database connectivity are happening in open source software, and all cloud services are either more restrictive or proprietary. It's possible that we may use the BSL in future code bases as a way to bridge the gap between OSS and proprietary, as it maximally empowers users while reserving some rights for ourselves.
So I'll need to use dbt Cloud?
So where does that land you as a user? Will you need to use dbt Cloud in order to utilize dbt's upcoming metrics capabilities?
You won't. dbt Cloud will make the process far easier and it will also come with a bunch of added functionality—like authentication and authorization!—and integrations with BI tools that rely on other dbt Cloud functionality. But you'll certainly be able to run this infra yourself and generate canonical dbt metrics responses if you decide you'd prefer to run a high-availability real-time service. Upon release, we'll put documentation together for running the dbt Server.
If you also want to intercept requests to your database and inject dbt compilation logic in this process (which isn't required), you'd just need to write your own reverse proxy. The simplest possible version of this is not so hard to do, and if you are thinking about running this type of infrastructure in production this is likely well within your capabilities.
While dbt Cloud isn't a requirement, it's true that responding to interactive queries is a heck of a lot more complex from a technical perspective than running
dbt run on a schedule, and so it's likely that a lot more folks will choose to outsource the problem to us.
During the early days, we're going to include metrics capabilities with all dbt Cloud plans, including the Developer plan, for free (subject to generous caps to prevent damage to our infrastructure). Our goal is just to drive adoption and experimentation. Over time as this part of the product matures and we start to understand usage patterns, we'll roll out some type of pricing that ensures that it's accessible to the broadest possible audience while monetizing appropriately.
Navigating Trust, Community, Profit, and Standardization
Licensing choices live at the very heart of the open core business model. In some ways they're tremendously boring—I have now read countless hours of licenses and legal opinion about licenses—but they're what everything else is built upon.
Open source is all about creating a non-traditional incentive structure for a large group of humans—these licenses determine who can do what, and the downstream impacts of that decision are enormous. As Charlie Munger said, "Never, ever, think about something else when you should be thinking about the power of incentives."
So the balancing act is this: give as much value as possible to the community while saving some space for ourselves to build a sustainable business. Create the user trust required to bring a new standard into existence and make that standard easily accessible to everyone through both open codebases and scalable, performant and easy to use commercial offerings. Create space for the Community to build the future of the MDS with us and ensure that Cloud providers can't take this work and repackage it without also giving back to the Community.
These licensing choices do the best job of balancing these interests, and I hope that they act as a framework for many years and decades into the future.