Advice from analysts and data engineers at HubSpot, Managed by Q, GitLab, Wong Decision Intelligence, AskWonder, and Calogica
Tim Finkel, Manager of Data & Analytics at Managed by Q, didn’t have a “data testing problem” he had a bunch of business users who didn’t quite trust the data they were getting. These end users had access to Looker, in theory they could do their own analysis, but they were still coming to analysts to verify that the KPIs and dimensions were actually correct.
This fear about data quality wasn’t unfounded. The company was growing and the business was changing, and all of these changes (eventually) showed up in the data:
- The engineering team was shipping new functionality every day, if changes affected analytics code, analysts wouldn’t know about it until a report broke.
- Changes in business systems would break analytics code. One memorable moment was when an update to the order processing system resulted in a report that showed order volume just dropped by $100k 😬
The data team had no way to keep tabs on all the changes happening in the business, so they spent an enormous amount of time reacting to problems as end users reported them. And every problem eroded trust. It’s a position that is so familiar to analysts that many of us assume it is simply part of the job, but Tim’s CTO encouraged him to question that assumption.
What if instead of reacting, you could put proactive systems in place that would give you visibility into data problems before those changes impacted the end user?
This is mindset that software engineers use to build systems for high reliability. And when Tim started applying this mindset to analytics problems, the solution became clear: they needed data tests.
What is data testing?
Data testing is the practice of making assertions about your data, and then testing whether these assertions are valid. This concept can be used to test both the quality of your source data and to validate that the code in your data models is working as intended.
In dbt, data tests are defined in easy-to-read yaml – when you run a test, dbt compiles this configuration to a query, runs it against your database, and returns results (and prints a big red FAIL message!) if your assertion turns out to be false. Testing is one of the things that our power users like best about dbt.
Why is data testing important?
dbt’s biggest testers all talk about trust. Without tests, end users and analysts can never quite trust the data being delivered.
Gordon Wong, Principal Consultant & Founder at Wong Decision Intelligence says that just like Maslow’s Hierarchy of Needs there is a Data Hierarchy of Needs: “The foundation of my pyramid is quality, then reliability, user experience, and at the top is coverage.” Data that is high quality and reliable can be trusted, and the only way to have high quality and reliable data is to test it.
Data testing builds trust with end users, but it also means that analysts can trust data -- this can have a dramatic impact on the productivity of your data team.
Tested data is trusted data.
When should I test my data?
There are regular points in time at which you’ll want to test your data. During development you’ll be testing for data quality as well as testing data transformation logic. When you push new analytics code to production you’ll want to run your tests to make sure that your new code isn’t breaking anything in the existing code base. And then once your code is in production, you’ll want to have automated tests running at regular intervals to make sure that everything in your ETL pipeline and transformation logic is still working as expected.
1. During development
This is when you’ll be writing new tests. Claus Herther, Principal Consultant and Founder at Calogica, uses a simple data testing framework to guide the types of tests you might want to write at each stage of the data transformation process:
- Sources: Use simple schema tests like uniqueness and not_null to check for data quality and test your basic assumptions about the source data
- Staging models: Transformation tests to make sure you didn't create duplicates or fan out a join
- dim/fact models: Test core business logic using relationship test, expression tests, recency tests, etc.
- Across models: ex: making sure all orders contain at least one order from a certain product category
2. During pull requests
Before merging changes into your analytics code base, test your data. We use dbt Cloud to do this, so whenever a Pull Request is made against our internal-analytics repo, dbt Cloud will kick off a job that builds and tests all models in your development branch in a scratch schema. If any of the models or tests fail, then dbt Cloud puts a big red X on the Pull Request. If everything works as intended, then dbt Cloud shows a nice green check mark. The contextualized test success/failures help guide code review, and acts as a last check before code gets merged into production. In practice, it looks like this:
However you manage the process, make sure that new data models and data transformation code is never entering your code base without being tested.
3. In production
Once your data tests are built, you want to run them on a regularly scheduled cadence. Things will change in your data:
- An engineer will push a new feature into production that changes your source data
- A business user will add a new field in the CRM, breaking the business logic of your data transformation
- Your ETL pipeline will experience an issue, pushing duplicate or missing data into your warehouse
Automated tests ensure that when something changes about your business or the data, you – not your end user – are the first to know. We use dbt Cloud to regularly run automated tests on our data models. If any of the tests fail, dbt Cloud sends us email and Slack notifications.
Other popular tools for running data tests in production include Airflow, automation servers like GitLab CI/CD or CodeBuild, or scheduling cron jobs (read more about the pro’s and con’s of these different options here).
Ok, so how do I get started?
Our testers have a few pieces of advice:
Write one test.
Taylor Murphy from GitLab says, “Don’t think you have to do a big refactor to get going. Two tests are better than one and one is better than none. If you’re using dbt (which, really, you should) then start with the basics of unique and not null tests. Keep adding as you’re building. If you find a bug in your data, then write a test for it so you can catch it sooner. Start small and build the habit. Eventually you have a good number of tests and then you can invest even more in better coverage.”
Commit to adding tests to your most used data sources or to all new models.
Jillian Corkin from HubSpot recommends starting by adding tests to all new data models moving forward or adding tests to your most heavily consumed data.
If necessary, commit time to doing a refactor.
Kriselle Sanchez, Analytics Lead at AskWonder, had been using dbt for four months and had about 100 data models before she wrote her first test. Business users were noticing problems and her first tests revealed a few incorrect assumptions. She dedicated two weeks to doing the necessary refactoring–adding basic tests to all of her models and updating models that were built on flawed assumptions.
Talk to your end users.
Today, when Kriselle gets a new data set, her process looks a lot different: “Instead of starting with modeling the data based on what I’m seeing, I talk to stakeholders and understand their workflow. They often have gut feelings about how the data will look that I can then use to inform my first tests.”
Emilie Schario, a Data Engineer at GitLab, recommends talking to the end user as well, she specifically asks about manual inputs in their process. Manual inputs are rife with human error, but if she knows where they exist, she can easily write data tests that catch human error before it causes a problem in the analysis.
So...what’s a “normal” amount of data testing?
At Fishtown Analytics, we typically have a not null and unique test on almost every model. Plus some extras (for testing additional nulls, relationships, etc). So what’s normal for most of our clients is a 1:2 or 1:3 model:test ratio. That’s normal for us, but not necessarily normal for the whole community.
Managed by Q has five tests for every model.
Gitlab has six tests for every model.
Claus Herther, Principal Consultant and Founder at Calogica, says his largest project has 315 models with at least one test on every dim/fact model.
HubSpot, still early in both its testing and dbt journey, has 40 tests for its 400+ models.
So don’t feel overwhelmed, start with your first test knowing that your test coverage will grow over time.
Once you have the basics in place, nearly everyone eventually starts adding custom tests to prevent particular errors. Tim Finkel calls this “building the immune system.” An example he gives is a test his team has that makes sure that an invoiced order has a corresponding invoice in the payments system. They built this test after an issue where the product broke and no one knew. If this ever happens again, the data team will be the first to know.
Tim said the model:test ratio has grown from 1:4 about a year ago to 1:5 today. That increase is primarily from adding custom tests to critical data sources that will ensure the same problem never happens twice.
GitLab’s experience has been similar. Emilie says that of the 18 custom tests they have today, 11 of these are tests on subscription revenue data from Zuora. Accuracy on this data is incredibly important to GitLab’s business, and each new data test builds stronger immunity against future problems.
🎁All tests, even the most simple tests, are a gift you give your future self. And they are gifts that keep on giving: every new test you add today will make you more immune to problems in your data tomorrow. 🎁
For more thinking from the dbt community on the role of testing, check out the following: