Test-driven development has been popular on software engineering teams for so long that it’s hard to imagine a mature software codebase without thorough automated testing. There are just too many moving pieces in even a modestly-sized codebase to assume that a software engineer will be able to keep all of the interactions in their head.

Mature engineering teams spend real time and energy maintaining their test frameworks and writing good tests. And yet, until recently, analytics teams didn’t apply similar rigor when asserting the quality of their often equally-sophisticated code bases.

In fact, until just a few years ago, almost no analytics teams had any code quality processes at all. This led to all kinds of issues: lack of trust in data, reporting silos with different definitions of truth, and a breakdown of internal data cultures. Many a promising data project has been laid low by this fundamental inability to create consistently high-quality outputs.

Here’s the simple truth: analytics code is code. Which means that it will have bugs. And without rigorous processes in place to find and eliminate those bugs, code quality will degrade over time.

Fortunately, the practice of analytics engineering is beginning to bring rigor to the modern data team, and data testing is a big part of that story. As of today, there are roughly 600 companies using dbt to test the data in their warehouses (close to 40% of the total install base), while Great Expectations has ~1k stars on Github.

This level of interest in data quality is exciting! But it’s not very revealing. In this nascent practice, we wanted to know how practitioners were actually going about asserting the quality of their pipelines. What types of errors were they looking for? How did they operationalize both writing tests and responding to failures?

What follows are summarized responses from a group of dbt power users to questions like these. As an industry, we have a long way to go when it comes to institutionalizing data testing, but we’re starting to see exciting progress. It’s only by sharing knowledge like this that we’re going to figure it out together, so thanks to everyone who took the time to participate!

Emilie Schario

Data Engineering, Analytics at GitLab

🚦Testing process
✅require data tests in order to merge any new data models into production
✅run automated data tests in production

🧪Model coverage
373%

💰Most valuable test
“Zuora CRM ID. All of our retention analyses are based on Zuora data. Unfortunately, there are a lot of manual processes around this data. Our custom data tests help us alert finance around a manual error. This has been so useful that we’re now rolling similar tests out to other parts of the org.”

💡Advice
“You will save yourself infinite amounts of bug chasing if you test your assumptions.”

🛠Testing tools
dbt, GitLab CI, Snowflake’s Zero Copy Clones

Brandyn Lee

Data Science and Analytics, Rover

🚦Testing process
✅require data tests in order to merge any new data models into production
✅run automated data tests in production

🧪Model coverage
1000%

💰Most valuable data test
“The default tests that ship with dbt are great. We super-powered them by adding an optional `where` clause. For example, if you are looking at an orders table and you have a column specifying someone's first order, then the purchaser's ID should be unique if you filter down to "where is_first_order = 1”. Once this helped me validate that a third party vendor was dropping records as it piped data from our production database to our analytics warehouse. We worked with our vendor to get things patched."

💡Advice on getting started
“There are more bugs in your queries than you know. Even if your queries were perfect when you first wrote them, sometimes underlying data generation processes change and your logic needs to be updated. Anticipate how things could break and write tests.”

🛠Testing tools
dbt, Luigi, cron jobs, visualization with Periscope Data

Michelle Ballen-Griffin

Director of Data & Analytics at Billie

🚦Testing process
✅require data tests on our most important sources and models
✅run automated data tests in production

🧪Model coverage
200%

💰Most valuable test
“Our most valuable data test involves making sure that our transactional data is up to date. Our transactional data is used to help us keep a pulse on the business, evaluate marketing efficiencies, and even define audience segments for trigger-based marketing communications. If this data were to become stale, that could result in inaccurate marketing analyses and even incorrect communications being sent to our customers. If the test ever fails (indicating data staleness) we have procedures in place for updating dependencies accordingly. “

💡Advice
“Spend some time thinking through all of your data dependencies and all of the things that *could* go wrong with your data. It helps to fully understand the nuances of how data is tracked and loaded into your warehouse. From there, assign priority to each test and start tackling the top priority ones first.”

🛠Testing tools
dbt, Looker

Joe Naso

Technical Product Manager at pymetrics

🚦Testing process
✅run one-off data tests when something breaks
✅require data tests in order to merge any new data models into production
✅run automated data tests in production

🧪Model coverage
129%

💰Most valuable test
“Validating relations between dependent and tangential models. We use this for pipeline validation of directly customer-facing reports.”

💡Advice
“Testing is daunting at first, but the lift of getting some simple tests in place is much smaller  than attempting to fix things downstream when problems arise. Even simple validation is useful, and at this point in the maturity of the analytics stack, there is no reason not to include tests in your analysis pipelines.”

🛠Testing tools
dbt, Travis CI, Airflow

Martin Guindon

Assistant Director, Business Intelligence at Absolunet

🚦Testing process
✅require data tests in order to merge any new data models into production
✅run automated data tests in production

🧪Model coverage
1098%

💰Most valuable test
“A test that checks for overlapping rows for the employees table We do not have an HRIS at this time, and therefore the employees list we import in our data warehouse is maintained manually in a Google Sheet. This sheet contains the start/end dates of teams and roles assignations, and as such there is a high risk of human error in creating overlapping rows, which would cause fanouts when we later join this table to other critical operational data.”

💡Advice on getting started
“Start with having schema tests in every model/source, then build your way up with simple data tests like asserting that certain rows or certain aggregations are returning correct known values. This will be a good foundation to build on. And finally fully investigate any test failure to fix the root cause.”

🛠Testing tools
dbt

Josh Temple

Analytics Engineer at Spotify (Previously Senior Data Engineer at Milk Bar)

🚦Testing process
✅run one-off data tests when something breaks
✅require data tests in order to merge any new data models into production
✅run automated data tests in production

🧪Model coverage
203%

💰Most valuable data test
“Primary key uniqueness. It's the quickest way to catch bad model design, especially around joins. Duplicate data is a good indicator of database unhealthiness or bad modeling.”

💡Advice on getting started
“Testing data isn't usually that fun, but investing time and tooling in testing pays off. Accurate data builds trust with end users and  leads to good business decisions. If you don't test your data, you risk jeopardizing both of those things with mistakes–and we all make mistakes eventually.”

🛠Testing tools
GitLab CI/CD, dbt, pytest

Travis Dunlop

Data Analyst at Bowery Farming

🚦Testing process
✅require data tests on our most important sources and models
✅run automated data tests in production

🧪Model coverage
568%

💰Most valuable data test
“Data freshness tests. Several times, a service has been down, and the data freshness check was the first thing to catch it. We are building more monitoring for these services, but dbt tests were the easiest to implement in these instances. So, they ended up being the only type of testing for awhile.”

💡Advice on getting started
“If you are feeling intimidated, start with schema tests. They are super easy to implement in dbt.”

🛠Testing tools
dbt

Ben Edwards

Technical Lead at PageUp

🚦Testing process
✅run automated data tests in production

🧪Model coverage
76%

💰Most valuable data test
“Relationship tests. Our source database does not have reliable referential integrity, our models enforce this and the tests are a way to ensure we are doing it correctly.”

🛠Testing tools
dbt

Tim Finkel

Manager, Data & Analytics at Managed by Q

🚦Testing process
✅require data tests in order to merge any new data models into production
✅run automated data tests in production

🧪Model coverage
549%

💰Most valuable data test
“Unique. Fan-outs (duplicate "primary keys") can have catastrophic consequences, leading to outages within your data warehouse when joins lead to exponential increases in row counts. We use the vanilla version of unique extensively (every dbt model must have one) and we also have a unique_where custom test for cases where we join on the first instance of a related model.”

💡Advice on getting started
"Read the blog post from a few months ago." 😁

🛠Testing tools
dbt

Ted Conbeer

SVP, Strategy at MakeSpace

🚦Testing process
✅require data tests in order to merge any new data models into production
✅run automated data tests in production

🧪Model coverage
545%

💰Most valuable data test
“Uniqueness Testing. Uniqueness is an incredibly powerful way to gain confidence that your model is what you think it is, and that your joins haven't caused fan-outs that will inflate your metrics.”

💡Advice on getting started
Start!

🛠Testing tools
dbt and dbt Cloud

dbt-Learn---World-Tour-5