You just started your dream job – you are the first data scientist at a meal kit startup. The engineers connected all the data sources to the warehouse, the pipeline runs twice a day, and there's so much fun work to be done, from forecasting meal kit sales to A/B testing landing pages.
But before you can focus on these “narrow-and-deep” problems, your colleagues need you to build some basic dashboards so that they can see the “wide-and-shallow” (check out this phenomenal article from the Harvard Business Review about the distinction).
Finance wants to know monthly recurring revenue before they need your forecasts.
Operations wants to report on average transit time.
Marketing wants to establish a baseline for conversion rate.
Quicker than you can say triple exponential smoothing, network optimization, and multi-arm bandit you find yourself thinking about how to clean and model the data so you can build those dashboards. What’s the most optimal way to do this? Where should you begin?
I believe in three things:
- starting with a swimlane
- building a data mart
- using Kimball’s conformed dimensional modeling approach
These are highly personal preferences–I like the old-school name "data mart" because it brings up images of happy slushie customers at a Kwik-e-Mart. I am also decidedly partial to Kimball, and others share my preference, Shopify hands a copy of Kimball’s book to every new data hire! But Kimball is not the only modeling approach–the dbt community has compiled a good list of the other notable data modeling schools out there. What matters is that you have a repeatable approach for how you build your data models. Here's mine...
🏊 Start with a swimlane
I start with mapping out the customer lifecycle on a swimlane diagram because, in the words of Lean Analytics,
...metrics follow a natural, logical order. Consider the customer’s lifecycle: the company acquires a user through viral or paid marketing. Hopefully, that user continues to use the service, and eventually pays for a subscription. The user invites others and, and perhaps upgrades to a higher tier. As a customer, she may have issues. In the end, she stops using the service – at which point, we know how much revenue she contributed to the business. Describing the customer lifecycle in this way is a good method for understanding the key metrics that drive your business.
What is a swimlane? It is a type of flowchart that helps you see one customer’s journey (their flow across the company) and understand how each department views the slice of the journey they touch. It has vertical lanes representing each discipline and customers move from one end to another horizontally across those vertical lanes.
In the case of our imaginary meal kit startup, we have a marketing website where users can subscribe. Our marketing team is thinking a great deal about the acquisition funnel, conversion, and A/B tests. Once users become subscribers, finance wants to calculate monthly recurring revenue (MRR) and other SaaS metrics. Each week, as orders ship out, the operations team monitors things like the ratio of on-time shipments to all shipments and average transit times.
Why a swimlane? Clarifying each discrete step in the process helps me familiarize myself with our product offerings and other relevant elements of the business plan canvas. It also helps me find the various stakeholders – my internal customers – and involve them in the process of dimensional modeling. The diagram can help break down an otherwise overwhelming process into discrete steps and serve as my roadmap. And finally, it helps me work towards a conformed dimensional model (more on this below).
🏗 Add height to create the data mart
Once I have a swimlane draft with the customer moving left to right and crossing each vertical department, I add some horizontal lanes. For each discipline, I go through the following steps:
- List the key metrics – what KPIs are in use? You can flesh them out in a neat spreadsheet following this Data School tutorial or you can use the BEAM example data table to capture similar information.
- Identify the base units – what are we measuring with these metrics? What is an observable measurable interaction event in this lane of the customer lifecycle? Subscriptions? Orders? Pageviews?
- Find the data sources – where do these interactions between your customers and the company happen? Where does the data come from? This step may ferret out many gotchas involved in the data cleaning process.
- Add any fact tables you may need beyond the base units – how do we aggregate these subscriptions? If our KPI is monthly recurring revenue (MRR), do we want to compare this number on a day-to-day or an hourly basis? Fact tables should be aggregated at their most granular form to give us flexibility creating derived tables downstream.
- Ask your stakeholders for acceptance criteria – when I build this table, how will I know that it is correct? What is the North Star you navigate by? Do you currently look at recurring revenue reports in the datasource (e.g. Stripe), a third-party tool (e.g. ChartMogul), or in an internal spreadsheet? What is the current solution this dashboard is going to replace?
📋 List your conformed dimensions on the side
Now that I know what metrics are used by what departments and I have a list of fact tables, data sources, and acceptance criteria, I can identify the dimensions – what else do we care about? When we describe these facts in more detail, what do we want to know about the “who, what, where, when, why, and how”? How do we want to slice and dice the data? I like to work with conformed dimensions, meaning dimension tables that are reused across all fact tables and departments.
What’s the advantage of a conformed dimensional model? It helps answer questions that cross disciplines.
- Wouldn’t it be nice if we included
customer_idfor every identifiable
anonymous_idwebsite visitor so marketing could see which sites refer customers with the highest lifetime value?
- If operations could see on-time shipment rates for different MRR tiers?
Conformed dimensions help align disciplines and will prevent the kind of metric knife fights described (and illustrated!) so aptly in Chapter 4 of The Analytics Setup Guidebook.
But perhaps most importantly, it helps you write and maintain fewer tables. As many people pointed out, storing data is cheaper now than humans spending hours thinking about data models. I still like to think of this problem as having a New York apartment vs. a Montana ranch. Yes, with the latter, I have a lot more space and don’t have to spend a lot of time thinking about what I keep and what I throw out, but in a year or two, will I be able to find what I am looking for?
Once you reach that point and you need to bring order to your Montana ranch, dimensional modeling will help break the work ahead of you into manageable chunks, with the star schema as its well-defined unit of delivery.
Principles to follow:
- Define your metrics (and commit them to writing).
- Work with your stakeholders to find the base units of their metrics and primary data sources.
- Group your fact tables by functional areas, name them, and use the same words in your transformation layer and Looker models.
- Use conformed dimensions to align metrics and definitions.
- Define the acceptance criteria.