Version control is a critical part of the software engineering workflow—it’s impossible to produce high quality code by an even moderately-sized team without a strong code review processes in place. Fortunately, software engineering teams have decades of experience in figuring out best practices, and cultural norms are at least reasonably consistent throughout the software industry. In fact, whole posts have been written about what a great pull request looks like.
But like so many of the norms of software engineering, good pull request habits haven’t been broadly adopted by people writing analytics code. Many of dbt’s early adopters have some software engineering background, but increasingly, the community includes data analysts (like me!) who had never made a pull request before using dbt. Analysts participating in the analytics engineering workflow are being asked to catch up fast.
An important element of reviewing PRs for these analysts is providing feedback to help them become better analytics engineers. As the Director of Consulting at Fishtown Analytics, I review 5-10 pull requests a day for code in dbt projects. And, if it is a good day, I tend to submit at least 1 pull request myself.
Here are the things I look for when sitting down to review an analytics pull request.
✅ Represents one logical unit of work
We’ve all been there–you just have a few quick changes to make and you already have a kind-of related branch open. So...just pop a few more changes in, write a rambling description, and submit the pull request. It’s fast, easy, but it comes with big downsides:
- understanding multiple unrelated changes is an enormous pain for your reviewer,
- rolling back changes becomes much more challenging,
- and any delays in getting part of the PR approved end up delaying the entire batch of changes.
A good pull request should solve a specific thing. While this might take a bit more git work—creating new branches and switching between branches if you context switch—this route pays off in increased PR throughput and less reviewer frustration.
Watch out for a PR that is trying to do too much. If it is, help the submitter think about how to break it up into more modular components.
✅ Description provides context
The submitter should spend the time to write up a good description of the thing they are solving. What’s the rationale for this change, is it written in plain language? This will set the context for your review. You should be able to read the description and know what to expect before even looking at the code. If you can’t, ask for more detail from the submitter!
Writing a good description is best practice even if you are a team of one. My early PRs had one or two sentences and made it super hard for me to go back and understand why I did certain things. Good descriptions help your future self.
⚡️Tip: I keep the description of the PR open on one screen and have the files changed open on another. This allows me to refer back to the notes written about the changes so that I don’t ask a redundant question in my review that the submitter has already provided details on. See the example below (and yes, I’m going to get to PR templates in a bit):
✅ Continuous integration tests have completed successfully
You should have a CI environment set up to test pull requests. Your eyes can review all of the items in this checklist, but if a project has 150 models, you can’t trust your feeble human brain to catch all errors.
When a PR is opened, the full project should be run and tested, with the status reported back into the PR interface. dbt Cloud has functionality to do this with Github (with more services planned for the future). Getting a green check that the PR ran and passed the CI is the best fail safe to ensure there aren’t breaking changes.
✅ Code adheres to the project’s style guide
If you don’t have an internal style guide, create one! When I review a PR that has inconsistent formatting, it takes longer for my brain to read and understand what the code is trying to accomplish.
A style guide may feel tyrannical—it’s as if you’re saying “write code my way!”–but it’s critical to share common coding style when a code base is maintained by a team. If there are no shared conventions, every single contributor will be annoyed at everyone else’s wonky conventions. It’s for this reason that many style guides have become common in the software engineering industry: check out Python’s PEP-8 for one great (and super-detailed) example.
If code in the PR is outside of the style guide, add a note in your review and link to the style guide to have it updated.
✅ Every new model has a test
I scan the list of files changed and make sure that every completely new model created in the pull request has a corresponding test for at least uniqueness and that the unique key is not null.
There are instances where the above may not be necessary, but I err on the side of testing every model and flagging to the analyst that they haven’t tested something. If their response for why they didn’t is sufficient, I will remove the review note.
If you need help convincing analysts on your team that data testing is important, share this article with them.
✅ Documentation is added
If the change is doing something complex, adding documentation in the project is helpful for the same reasons the pull request description is important – the submitter will know more about the change now, when they are making it, than they will a year from now. Gotchas and intricacies of the data should be explained when the transformation logic is still fresh in the mind.
✅ Code is DRY
The hardest part of reviewing another analysts’ work is understanding the problem the person is trying to solve and ensuring that they way they structured their solution is actually the best way to solve the problem.
This is challenging! The solution presented may provide the correct results but it is your goal as the reviewer to make recommendations for the long term health of the project, as well. Note: you will often not get this 100% correct the first time (#refactoring) but the below are helpful things to keep in mind:
1.) Can the analyst rewrite the model in a more DRY way?
- Does the analyst repeat code?
- Does the work duplicate logic that already exists? (this requires a pretty broad understanding of the entire project)
2.) Can the DAG be modified so that fields can be defined earlier and then cascaded more seamlessly?
3.) Should parts of the code by pulled in macros? Or, can macros from dbt-utils be utilized?
4.) In a complex model, can CTEs be broken up into models of their own? This is helpful if:
- The piece of the query can be used elsewhere in the project
- The piece of the query should be tested to avoid future errors
✅ Data returns expected results
For more junior analysts, I do a more thorough code check and actually query the models built in the CI environment to familiarize myself with the work. This will help me think through if there is a more efficient way to solve the problem.
- If I have a feeling the work they did altered the grain of the data, or that a join wasn’t correct and fanned out the data, I query for uniqueness on the unique key defined on the test for the model.
- If the change is supposed to refactor or replace something that already exists, I query the prod version and the dev version and make sure that the general results are the same (row counts, fields, etc)
One of the many benefits of running dbt in a CI environment is being able to query the production data and see how it compares to the data built by the changes in the PR. Take advantage of this when reviewing!
For more senior analysts, I spend less time querying their data but still query from time to time to make sure things work as they should!
✅ Models are materialized appropriately
If a model is built to be queried in the business intelligence layer, the model should be materialized as a table. This ensures good performance for the user. Some of this is also going to be warehouse specific, for example, if your data warehouse is Redshift, models materialized as tables should also have sort and dist keys defined.
✅ Naming is clear and consistent
Naming. Is. Hard.
It’s also critically important–earning a spot as one of the five principles of good data warehouse design. Getting naming right ensures that the submitter, along with other analysts on the team, will be able to quickly incorporate changes into their analysis.
While reviewing a pull request, I always ask three questions about naming:
- Do model names appropriately represent the grain of the model? For example, if have a table called
quickbooks_invoices, it’s important that there is one record in that table per invoice, not per invoice line item (a recent catch!). Getting model naming granularity correct saves a tremendous amount of confusion for all subsequent users of that model.
- Do the new or changed model names follow the conventions of the project? For example, we have specific internal conventions around model prefixes such as
- Are newly created fields in models straightforward?
number_of_times_user_completed_action_prior_to_churningis really descriptive, but do you want to write that out from now until eternity when querying this?! It is easiest to rename now, when reviewing the PR instead of six months from now when a lot work is built on top of that field.
⚡Tip: Create a pull request template in your project
Pull requests are submitted at the end of a whole lot of work, we’re excited to ship! And remembering all of the above can be difficult. The best thing you can do to level up the quality of pull requests you’re getting from your team is to create a pull request template that you use for all analytics code.
Create a pull request template today and you will save your future self a whole lot of time reviewing future requests.
If you have your own advice to share on how to write and review analytics PRs, or just better manage the code review process, jump over to Discourse and share your thoughts. Sharing what works for you makes the whole community better 🤓