I get these questions all the time:

  • “Should I make a new explore or add more joins to an existing one?”
  • “Which table should I start my explore with?”
  • “Which explore should I use to build this Look?”

Looker created an extremely powerful tool with LookML. And while understanding the basics of building explores is easy — “just join together your views!” — building well-designed explores is the thing we see people get wrong most often when writing LookML. This can lead to poor query performance, inconsistent data, and a confusing UX for your users, and it’s a frequent reason we’re called in to help clean up an existing Looker instance.

It’s not surprising that this happens, either. Understanding how to design your explores requires a reasonably sophisticated understanding of Looker’s query model — how it actually writes the SQL for you—and your database’s query optimizer. This stuff isn’t obvious when you first get up and running with the tool.

The thing is, once you build an explore, build dashboards on top of it, share those dashboards with users, and have users start building their own dashboards…well, it’s very challenging to walk back from a set of poor initial design decisions. Making material changes to an existing explore will likely break many or all of the artifacts built on top of it. Once you push that new explore to production, it may very well be there to stay.

This post contains my heuristics for how to design good explores. These are guidelines I’ve found work well in practice working with dozens of Looker instances. Some of these I’ve had to come by after much frustration and keyboard-banging, so hopefully they help you avoid the same 😊

What makes a good explore

There are three things you should be optimizing for when building your explores:

  1. Correctness
    Do looks built on top of this explore return correct data? This is the very lowest bar for design, as Looker makes it fairly easy to get correct results as long as your join conditions are correct.
  2. Performance
    While there may be multiple ways to get Looker to write a query, there will typically be one way that has the optimal performance. Designing your explores with performance in mind can be the difference between 15 second and 15 minute dashboard load times.
  3. Usability
    When you write LookML, you’re a software engineer and you deliver your users a software application. And when you design software, it not only has to work, it has to be usable. When a user who is not an expert in your data loads your explore, it should be immediately clear how to get the answers they’re looking for.

My guidelines for building good explores

Make multiple small explores instead of a single big one

I mention this first because it seems like this is every first-time Looker user’s instinct: join everything together!

It’s true, it’s possible to join 20 views together in a single explore and hit Run. I think of that approach as the “cowboy” approach — I literally imagine a guy with a hat and spurs on a horse shooting a revolver into the air and saying “Yeeeee-haw! Git ‘er done!” This approach does the job done quickly but creates huge problems down the road.

Massive explores like this perform abysmally because of the way that you’re forcing Looker to write the underlying query. Let’s look at a simple example. Imagine the following tables:

customers
>> orders (via customer_id)
>>>> order_items (via order_id)

And these two explores:

# explore 1
explore: customers {...}

# explore 2
explore: order_items {
  join: orders {...}
  join: customers {...}
}

Let’s say the analysis we actually wanted to produce was a time series of new customers by their created_at. It’s possible to get that via either of the above explores, but in Explore 2 the query that Looker writes will actually start off selecting from order_items, then join to orders and finally customers, where it can finally get the relevant measure and dimension. As a result, there are as many records in the query’s intermediate results as their are records in order_items, even though we only really needed the customers data to get what we needed.

If what you want to do is run a query across 1mm rows, building a 100mm row intermediate results set is a bad thing. You’d prefer to keep your query lightweight, and to do that, Explore 1 works better.

In the above case, the performance implications probably aren’t the end of the world. But, if you join across 20 tables to get an answer that you really only needed two tables for, your intermediate query results have the potential to be truly massive, sometimes exceeding the ability of your database to process. This is the most common culprit of dashboard slowness across all Looker customers.

Instead, keep your explores small. Design each explore to answer a specific set of questions. Single-view explores are fine! One of my most commonly-used explores is explore: snowplow_sessions {}. That’s it. You can do so much with just that.

2–3 views joined together is my sweet spot, and 5–6 is my absolute upper limit. If you find yourself needing more, you probably need to be modeling the underlying data more effectively.

Aggressively curate field lists

When you push an explore to production, users throughout your company have the ability to poke and prod it. You can bet they’re going to try their best to break it and then come back to you and say “When I did this the answer didn’t make sense! I don’t trust the data.”

You want users to have predictable, positive interactions with the explores you give them.

When you build software, every feature is “surface area” — stuff that can break. The more features, the more surface area, the more opportunities for bugs. Simple applications work better because there are fewer things to break.

In your case, every single dimension and measure is a feature. If you have an explore that joins five views and each view has 20 fields, there are 100 factorial ways for users to combine them together! That is a lot of ways to break something. And trust me, combining random fields from different explores together really can result in unintended outcomes — I’ve had this cause a hiccup on client calls more times than I care to admit.

To avoid this, you should specify the fields for every view you join (using fields) and keep the list tight. If you’re making an orders explore and just need to join to customers to grab a customer segment, only include that field. This will feel strange at first, but you’ll slowly start to realize how much better the feedback from your users is. A big part of software design is hiding complexity from users so they don’t have to think about it — that’s your job! So pre-design their experiences to guide them down reasonable paths.

And remember: it’s easy to add fields later but it’s almost impossible to take them away once present since you might be breaking existing looks. Start with as few as possible and add as needed.

Don’t use full outer joins

Imagine you’re analyzing data about advertising spend. Your tables (which have been heavily modeled in dbt or another transformation pipeline outside of Looker) are:

ad_spend
>> sessions (via a composite key made up of utm parameters)
>>>> trial_signups (via user_id)

Joining these three tables together allows you to see how advertising spend eventually flows through to trial signups and allows you to calculate CAC, one of the most important metrics in marketing analytics. But these tables also have the following property: many ads don’t result in trials, and many trials don’t come from ads. This makes it challenging to create an appropriate explore — what should you use as your base view? Either place you start you’re going to be leaving out some records.

Many analysts think about this question from a purely technical perspective and say “just choose one and then full outer join the others!” This is tempting, but it doesn’t work well in practice. There are two primary reasons:

  • It causes aggregates to act in strange ways that are hard for users to reason about. It’s easy to think about analyzing ad performance, but then when the count of trials includes ones that aren’t attributed to ads, then what exactly are we analyzing? Remember, users don’t typically understand the underlying data; you have to make exploration intuitive.
  • It allows users to form questions that don’t make sense. What’s the advertising CAC for leads that came in from non-advertising sources? That’s a non-sensical question, but this explore allows users to ask it.

Instead, create two explores. One should use ad_spend as its base and the other should use trial_signups as its base. Both explores might join in the exact same views, but they should be left joined instead of full outer joined.

With this strategy, each explore has a clear scope: one is to analyze everything related to the performance of an ad campaign (including how many trials it generated) and the other is to analyze everything related to trial signups (including attribution information). Because of the joins, sensible data will show in each. Use the separate explores to further curate the user experience by only keeping the fields that are relevant; in our example we’d want to only keep CAC in the ad_spend explore.

Don’t use many-to-many joins

Looker will let you join together two tables on a key that isn’t unique in either table. This is a bad idea. Don’t do it. It can be absolutely crushing from a performance perspective.

The most common reason that this comes up is when you want to join two tables together on a date. Sometimes you just don’t have granular data to work with and the best join key you have between two different data sets is the date that something happened on. Sometimes this happens with ad spend analysis when companies haven’t implemented sophisticated web analytics yet. Returning to the example above, instead of joining from spend to sessions to trials, we would instead of this:

ad_spend
>> trial_signups (via date)

Using this strategy, you can still get a daily blended CAC, you just don’t know exactly which campaign drove a particular trial. This is totally legit business logic; the problem here is the many-to-many join. Looker will write the query like this:

select [whatever]
from ad_spend
join trials on ad_spend.date = trials.date

This creates a truly massive fanout in your intermediate result set. How many campaigns do you run on a given day? 20? 200? How many signups do you get on a single day? 200? 2,000? Each row will cross-join with every other row for that day. You can easily find this query generating half a million rows of data for every single day where data exists on both sides.

If you were to write this query by hand, you’d pre-aggregate one of the sides of the join to eliminate the fanout like this:

with ad_spend_aggregated as (
  select date, sum(spend) as spend from ad_spend group by 1
)

select [whatever]
from ad_spend_aggregated
join trials on ad_spend_aggregated.date = trials.date

But Looker doesn’t write queries like that. It joins everything together first and then aggregates afterwards. That works fine as long as one of the sides of the join is unique and you don’t get cartesian products, but not in many-to-many situations. That’s an easy way to get your query to hang forever.

If you find yourself in this situation, you have to pre-aggregate one side of the join before joining it into your explore. Write a quick derived table. This is a bit annoying in practice because it forces you to pre-aggregate in ways that can be hard to anticipate, but it’s a necessary evil if you want your queries to actually return.

What have you found works best?

I’d love to learn from you — leave a response below!