Once you’ve mastered the basics—SQL, Excel and your business intelligence tool—it can be really hard to figure out how to increase analytics skills. For many analysts this can lead to frustrating periods where you know you need to level up, but aren’t sure exactly what to focus on next.
If you’re a hotshot junior analyst and you want a guide for breaking through this skills plateau, this guide is for you.
If you’re anything like me, when things get tough, your first step is to double down on the thing that made you successful — cranking out more and more analysis. The result?
You start to feel overwhelmed as requests pile up on top of each other. Your data flows break and your dashboards run slowly. It starts to feels like you’re running as fast as you can on a treadmill that just keeps going faster.
You have arrived at the stage in your career where more analysis no longer equals better analyst. You need to figure out how to scale yourself.
I’ve been lucky enough to have a world class mentor, Tristan Handy of Fishtown Analytics, who has guided me through this process. In this article, I’ve collected the most impactful wisdom I’ve learned over the past year to create a roadmap and actionable tips for any talented junior analyst who wants to take their skills to the next level.
Becoming an expert analyst is tough, because you need both technical competence and business sense, and the highest level you rise to will be determined by whichever of these you are worse at. As my old boss Mike Duda always says — you can’t just write great code or be an expert communicator: you have to be able to do both.
You need to think like the CTO and the COO, then explain yourself to both of them. Luckily, there are some tips, tactics and strategies which can help you down that path, In this guide, you’ll learn to write SQL like an analytics champ, organize your data structure so you don’t get lost in it, and explain the importance of the work you’re doing to everyone in your organization.
This guide is split into three parts:
- World Class SQL
- Professional Data Modeling Workflow
- Building a Data Driven Company Culture
This is by no means an exhaustive list of things you can do to improve. But by taking these steps, you can be sure you’re moving in the right direction for your career and for your analytics team.
World Class SQL
Use Common Table Expressions for extremely readable SQL
Have you ever written a complex query with multiple subqueries and tried to go back a few months later and figure out what it does? If you’re anything like me, you just sat there staring a your monitor trying to make sense of a quadruple nested subquery and hating yourself.
You can avoid this by making one simple change: use common table expressions.
The single most important way to ensure you are writing readable SQL is to use Common Table Expressions (CTEs) instead of subqueries.* CTEs allow you to define a number of temporary tables at the beginning of your SQL statement instead of defining your subqueries in the middle of your query. Sounds like a small change, but it makes your SQL infinitely more readable.
It takes literally half an hour maximum to start using CTEs.
Instead ofselect *
left join (foo) as subquery1
You write:with cte1 as(foo)select * from table1
left join cte1
In a simple query like this, it’s hard to spot the benefit of using CTEs. The benefits shine through when you’re writing a complex query with many layers of nested subqueries. While you can always get what you need with nested subqueries, they have two huge drawbacks which CTE’s improve upon.
- Subqueries are inflexible. If you realize you need to change something deep in a nested subquery it can mean hours of refactoring. Changing a CTE is much less likely to break your query and even if it does, debugging is simpler.
- Readability. Anything more than two or three subqueries makes a SQL statement a nightmare for anyone who hasn’t written it to understand (or for you to come back to six months later to explain why you the number you did for a certain metric).
Both of these benefits stem from the fact that while subqueries are by nature nested within a larger, more complex statement, CTEs are predefined at the beginning of your query and atomized in nature.
There is no easier way to improve the quality of your SQL code and your value as an analyst than to ditch the subqueries and switch to CTEs.
*check out this guide more info on CTEs.
Learn Window functions and ditch the spreadsheets
“Pretty much anything you can do in Excel, you can do with a window function.” — Tristan Handy
I will admit — window functions can be scary. With all the rows unbounded preceding and the partition by X over Y it can be tough to get a sense of how to write a window function. But once you try them out, they aren’t actually very difficult. Also, they are really freaking cool.
Window functions allow you to treat rows and columns in a SQL query like rows on a spreadsheet. With window functions you can:
- Pull in the results from the row above, letting you keep a running total of orders from a customer over time
- Create a trailing 7 day average of visitors to a certain page on your site
- Calculate median values without having to leave your SQL editor
One of the only difficulties with window functions is that they are so flexible it’s hard to know where to start with them. Running totals, first_value and last_value are all great entry points into the world of window functions.
Once you’ve got those down, its easy to know when you should check and see if a window function is applicable. Every time you are thinking about using Excel for a piece of an analysis, check and make sure that there aren’t any window functions that could do it instead.
The value in window functions compared to Excel is that once you’ve created a window function for a specific metric, you can use it consistently across analyses without manually recalculating the values every time. For performance metrics you track over time, like customer health or retention, this is a big time saver.
Use Aggregate Case Statements for Easy Summaries of Data
Very often when writing a query you’ll find yourself wanting to know an aggregate such as a sum or a count, not for the entire dataset you are selecting, but for a subset of it. Maybe it’s a count of all trial sign up users from Ireland or total revenue from paid advertising customers.
The way you’re probably doing that today is using a where clause to to fetch just the records that you’re trying to calculate. Something like:select count(distinct customer_id) as ireland_trials
where country = ‘Ireland’
This works fine when you’re only looking to find aggregate values off of a specific table and only need that information within the query. But what if instead you wanted a single query that returned total customers, customers from Ireland, and customers all of from Europe?
Since you getting trials by country and trials by continent are based off of separate fields, you can’t just use a simple
group by statement. This can start to get messy and can easily devolve into a gross nest of subqueries (I attempted to write an example of how this query could go wrong and got so frustrated I gave up. Seriously, save yourself the trouble).
Using case statements in your aggregates makes this a trivial task. You simply write a normal aggregate function:sum(revenue)
But you use a case statement to make sure you’re only summing the revenue you want to see:sum(case when lead_source = 'paid' then revenue else null end) as paid_revenue
else null part of this statement that does the work, when the query runs revenue will be added only if lead source is paid, otherwise it will be ignored.
Let’s look at another example. Here’s what it would look like to write a query for all customers, customers from Ireland, and customers from Turkey.select count(distinct customer_id) as trials,
count(distinct case when country = ‘Ireland’ then customer_id else null end) as ireland_trials,
count(distinct case when continent = ‘Europe’ then customer_id else null end) as europe_trials
Easy as that you’re able to conditionally aggregate across multiple columns. Each
case statement is creating a new column that’s null if it doesn’t match the conditions you’re trying to count, then counts up all matching values at execution.
Check out this guide for more info on CASE statements and examples of aggregate case statements.
Don’t forget that this works not just for count, but for sum, average, and many other aggregate functions. Any time you’re writing multiple subqueries to try and grab conditional numbers like this, you’ll have a much easier time if you do it with aggregate case statements.
Simplify joins with md5 ID’s
When you first start as an analyst, odds are most of your data is being piped in directly from your product and will have unique keys associated with each table. As time goes on though and you start pulling in other data sources, such as event data or marketing spend, you might find yourself operating in columns without a unique key.
Not only is this bad practice, it actively harms your ability to maintain data consistency, tie data sources together, and even makes using some BI tools totally impossible.
I first encountered this function when trying to join two tables together using about eight separate fields. Not ideal.
The natural inclination is to create your own ID by simply concatenating a bunch of fields together. Doing this leaves you with strange Frankenstein IDs that look something like
google12122015retargeting3954. These columns are bad because they kind of look like data but operate as an ID. It’s important to have a column whose sole function is to be a unique identifier for that row. IDs that are obviously IDs reduce confusion among junior analyst and end users by removing semi-comprehensible data strings throughout your database.
Enter the md5 function.
All it does is take your Frankenstein field and cryptographically hash it into alphanumeric gibberish. Now there is no more fuzziness about what your ID field is for or what it does. It identifies a row, that’s it. Plain and simple.
Professional Data Modeling Workflow
Use Git for version control and code reviews
The first rule of an expert analyst is to always backup your code. Version control has innumerable benefits for your organization.
First, and most importantly, being that you won’t lose all of your analytics code.
Beyond that, the benefit I’ve seen from implementing version control comes from code reviews and increased collaboration. Code reviews are your first line of defense for spotting bugs or mistaken logic. Trust me, it’s much better to have one of your errors called out at a code review than at a presentation to your CEO.
It’s so much more than just bug fixes though — using git makes you a better analyst. If you have a team of analysts working on separate issues, it can be easy to lose the thread of where the data model as a whole is progressing. By performing code reviews and merge requests within git you ensure that more members of your team are exposed to the way your analytics are built and architected.
Systematize your data transformation
One day while you try to manage eighteen cron jobs and augment them with five manual csv loads and four separate automatic data connections (two of which are broken) you might realize it’s time to get your data transformation process in order. This will either happen as a forward looking strategic initiative (better) or because your Macgyvered together process is slowly beginning to crush your soul (more likely).
Instead of a hodge-podge of different processes to transform your data, you need one central, consistent data transformation tool that allows you to tame the monster that is your data transformation process.
We use dbt (Data Build Tool), an open source solution that provides a simple but immensely powerful toolset for you to define your data transformations. (Full disclosure: dbt was created by Fishtown Analytics and Tristan Handy, who is a mentor of mine, but I have no financial stake in dbt and am only mentioning it because it’s made me a better analyst.)
Dbt allows you to super easily maintain, persist and reference views and derived tables. It’s far easier than managing your own scripting and more performant and flexible than using a BI tool like Looker.
It works like this — you define a dbt model, which are basically SQL queries that you can persist either as tables or views. From there, it is very simple to reference other models, so you can naturally and organically build queries that run on top of other queries. This makes it simple to define a new metric, build it once, and then integrate it into any other query you need to run.
No more digging through old queries trying to figure out how you defined daily active users that one time. Instead, simply build the metric in dbt and reference / update it as needed.
This is one of the highest effort ideas in this guide, but it is also one of the highest impact. Whether or not you go with dbt, the idea is the same — create a strategy for managing your data transformations so that they are modular, reusable, and automated.
Find your abstraction layers
Many hotheaded junior analysts (myself included) find themselves with a thorny mess of tables which require complex joins and many tables which shouldn’t rightly be joined at all. You might think you can keep all of the complexities in your head forever, but if even if you somehow pull that off, you’ll eventually need to train others on your system.
The easiest way to create a data model that is comprehensible to anyone other than the person that built it is to create a system of abstraction layers.
For us, it’s a simple breakdown of companies, users, and events. Every table fits into one of these abstraction layers. This makes it very easy to know how to build reports and self service BI tools for your team around the different layers. For example, we use Looker explores at the company, user, and event layers. Since most of our reports are on the company level, it’s much easier for us to simply showcase this part of the data model.
Create fact tables
Of all the tips within this guide, nothing will be a bigger immediate quality of life improvement for you and your team than the creation of facts tables. Facts tables roll up all of the information that you have defined in different tables in your data model into a singular ‘facts’ table that contains everything you need to know about a given customer, product, user, or whatever else you have facts about.
Facts tables dramatically reduce the number of tables you have to deal with when writing analytical queries. Instead of having to go to 5 or 10 different tables, you can instead work directly off of the facts table. This makes writer queries easier, reduces the possibilities of making a mistake, and decreases the database load by significantly reducing the number of joins you have to make.
Creating a facts table is simple. For example, our companies table is built simply by selecting all of our company IDs from our companies table, and left joining in a number of tables which have information about companies. Creating a facts table should not take you more than a few hours and you will begin to see the time savings almost instantly.
Building a Data Driven Company Culture
Build trust in your numbers
If people don’t believe the metrics you send them are correct, then every other thing you do is a waste of time.
Building trust in your numbers is hard. A single error on your part could do real harm to your credibility. What’s potentially worse is that there are also a whole host of things that can damage your credibility even if everything you do is technically correct.
While you spend your entire day looking at and attempting to understand these numbers, everyone else is doing a different job. What will often happen is someone will see a particular metric change across different reports without having the proper context to understand why.
So when you report that your ARPA (Average Revenue per Account) is $1,000 and someone interrupts you saying that they thought the ARPA was $1,200 based off of another report you did, they aren’t necessarily trying to be difficult. They’re trying to wrap their head around the metrics without as much time as you have to look at the data.
So you explain to them that the $1,200 number was based off of a subsection of your leads or a different date cohort. Issues like this are unavoidable, but they can be reduced.
First, and most importantly, is to have a clear, understandable, and definitive definition for each metric. You will constantly see people trying to define one metric in multiple different ways and you need to be ruthless about insisting that there is one definition per metric.
Second is to always show the date ranges for the data you are presenting. One of the most common issues we see causing confusion is two people who have reports run off of different date ranges.
Third is to provide as much context as you can when reporting numbers without becoming burdensome. If people understand that there are many different views, contexts, and applications of the metrics you’ve clearly defined, then there is less of a chance they will doubt the validity of the numbers when they see something that doesn’t seem to add up, and more of a chance they’ll come to you and ask for clarification.
There is no silver bullet here, but by having clear definitions and necessary context, over time you will begin to gain the organizational trust you need to build a data driven company.
Proactively monitor issues
There are two ways to find out that there is something wrong in your reporting process — you can find it yourself or someone can tell you because they found something wrong. Obviously it is preferable to discover issues yourself.
At its core, there are three ways your reports can go wrong:
- Your ETL process can break.
- You can build a report incorrectly.
- The underlying business logic can change.
For most analysts, a breakdown in the ETL process is probably the most frequent of these issues. The way to fix this is to have automated tests to ensure accurate data flows, and pair this with manual checks. I start every day looking at my data integrity report, which is a very simple listing of how many new rows have been added to my most important tables. It’s not fancy, but it catches 9 out of 10 ETL issues without becoming not a huge burden.
Whatever your method, find a way to quickly ensure your data flows are accurate.
For improperly built reports, there is no silver bullet. Following the SQL best practices and data model tips listed above will get you most of the way there. Beyond that, you will begin to develop an intuitive sense of where your numbers should be (remember I told you that you need both business and technical sense). Learn to listen to the little voice in your head that tells you something isn’t right, then dig in.
Errors due to changes in business logic are next to impossible to catch by yourself. Maybe the sales team started changing how they store things in Salesforce. Maybe marketing put in a new URL tagging structure for their ad campaigns. These sort of issues are tough. The important thing is to have a key ally on any team that feeds you intel. Drill into their heads that any time they make important changes, they need to let you know so you can update the data model. You won’t get it 100% of the time, but you can minimize these errors.
Monitoring issues takes a lot of hard work, but once the pieces of this system begin to fit in place, your life becomes far easier.
Balance Governance, Ad Hoc Tasks, and Strategic Objectives
One of the more difficult tasks as your company grows is balancing data governance, ad hoc tasks and strategic objectives. Each of these is critical and it takes a lot of work to make sure you aren’t letting any of them slack. But before you can do that, you need a strong understanding of what each these actually means.
- Data Governance: This is the stuff you do to keep the lights on and data flowing through all your systems. This can mean debugging an ETL error, optimizing your Redshift cluster or performing manual csv uploads. Data governance is the kind of work that no one notices when you do well, but EVERYONE notices when you don’t.
- Ad Hoc Tasks: These are the smaller requests you need to knock out day by day to get people the numbers they need. Could be a small, one off report, an update to a dashboard or a simple Excel analysis. These tend to not be particularly difficult, but if you aren’t careful they can eat up all of your time.
- Strategic Objectives: The projects you actually want to be working on. Could be a new dashboard for a team that hasn’t been getting as much attention as you’d like, incorporating a new data source into your data model or a comprehensive deep dive into a question the business needs to answer. This is where your team can have the most leverage and highest impact.
On a growing data team at a scaling company, odds are you are going to have to be data engineer, project manager, analyst, spreadsheet troubleshooter, and just about anything else you can think of.
If you slack on the data governance, nothing else matters because all your numbers will be wrong and your reports won’t run. If you let the ad hoc requests pile up, then people will get frustrated and begin to doubt the efficiency of your team. But if you always focus on governance and ad hoc requests, then accomplishing your long term strategic goals is off the table.
There’s no single right answer here, but there are some principles which can help alleviate this:
- Following best practices in data management like those listed in this guide. This will help you get your data in order so hopefully governance can be managed in a few hours per week.
- Make a determination of what constitutes an ad hoc request (for us it is anything which doesn’t require us to alter underlying data model). Then create a process and SLA around fulfilling ad hoc requests, as well as an understanding of what happens to ad hoc requests that are a more significant time burden.
Schedule time every week to focus on strategic goals. For us, it’s a few hours every Friday where we work on long term initiatives. Sometimes this ends up being the only time all week we get to make progress in these areas, but by making a firm commitment to do this weekly, you’ll be shocked by just how much better you get at hitting your long term targets.
Educate, educate, educate
No matter how awesome you are, you can’t build a data-driven culture alone.
In order to help make your organization data-driven, you’ll have to educate an ever-expanding number of people. But not everyone needs to know the exact same things and you can’t drill the minutiae of database intricacies into your exec team. So how do you manage to get everyone exactly what they need to know?
There are three main types of education you need to master before you can call yourself an expert analyst: mentoring your data team, empowering your end users and enlightening your executives.
You’ll spend most of your educating time mentoring your team. For these people your job is to take their innate curiosity and help them use it to find answers. On a practical level this means lots and lots of questions to answer. Be patient (remember how long it took you to get this stuff!), but also avoid immediately showing them the answer as soon as they ask. Rather, point your junior analyst in the right direction and let them figure it out themselves.
Your end users are the people embedded in different departments who are going to be consuming your reports. The important thing for these people is to make sure they understand the metrics and information contained in your reports and have a sense of the nuances contained. You want them to be able to spot trends, notice irregularities, and point out mistakes you may have made. The difficulty with these people is that although they tend to be analytical, they don’t have nearly as much time to spend on reviewing metrics as you do.
With these folks, it’s best to focus on the business applications of the individual reports you make for them. The best way to do this is to have a number of review sessions after creating a report where you view it together with them and point out interesting factors. That not only helps your end user gain a sense of how the report works, but helps you spot things you may have missed.
The most difficult, but ultimately most important educational job you have is for the executive team. They are less likely to care about individual reports and more interested in the overall story of the business and the insights which metrics can bring them.
At the end of the day, the success or failure of your analytics team will hinge upon your ability to educate your team, your stakeholders and your executives. It’s a tough balancing act at the best of times and even more difficult when you are trying to decipher an arcane SQL script you wrote a year ago or your ETL process has ground to a halt.
When everything fits into place and your team is firing on all cylinders then being an analyst is exhilarating and rewarding. You can become one of the highest leverage individuals at your company, finding new ways to look at data and grow the business. I hope the steps laid out in this guide will present some small insight on your journey to becoming a master data analyst.
Best of luck.
Huge thanks to Tristan Handy and Mike Duda for being incredible mentors, Aniket Tambay being an awesome analytics teammate, Janessa Lantz and Jason Peck for providing feedback on this guide and to James Quigley for building a data driven company.