I love group by 1. And every so often you'll see me use group by 1, 2. My love for this little snippet of code developed in the summer of 2016. I was a fledgling Β data analyst, and I was writing BI queries on Redshift.

Here were my two options:

select
  date_trunc('month', order_created_at) as order_month,

  count(order_id) as n_orders,
  sum(revenue) as total_revenue

from core.fct_orders

group by date_trunc('month', order_created_at)
not_group_by_1.sql
select
  date_trunc('month', order_created_at) as order_month,

  count(order_id) as n_orders,
  sum(revenue) as total_revenue

from core.fct_orders

group by 1
group_by_1.sql

Already, we can start to see the problem of not using group by 1: I have to repeat my code.

What if I wanted to convert the timezone? Here are my two options again:

select
  date_trunc(
    'month',
    convert_timezone('Australia/Sydney', created_at)
  ) as order_month,
    
  count(order_id) as n_orders,
  sum(revenue) as total_revenue

from core.fct_orders

group by date_trunc('month', convert_timezone('Australia/Sydney', order_created_at))
not_group_by_1_even_more_complicated.sql
select
  date_trunc(
    'month',
    convert_timezone('Australia/Sydney', created_at)
  ) as order_month,
    
  count(order_id) as n_orders,
  sum(revenue) as total_revenue

from core.fct_orders

group by 1
still_just_group_by_1.sql

When you're intent on writing out your group by columns, as the complexity of that first column increases, it gets harder to maintain and read the group by statement.

In my opinion, the second query is cleaner and easier to read. If I want to know what grain my report is at, it's easy for me to parse the 1, jump up a couple of lines of SQL and realize it's at the grain of order_month.

Plus, as someone that writes a lot of SQL, I very much appreciate the fact that 1 is exactly one character.

In late 2018, Redshift introduced lateral column aliasing, meaning I could instead group by order_month, surely I'd switch to that!

-- post August 2018 option, thanks to lateral column aliasing
select
  date_trunc(
    'month',
    convert_timezone('Australia/Sydney', created_at)
  ) as order_month,
  
  count(order_id) as n_orders,
  sum(revenue) as total_revenue

from core.fct_orders

group by order_month
group_by_order_month.sql

I admit, that's pretty darn readable! And readability trumps DRY code. But yet, I still continue to group by 1. In fact I have become even more ardent in my love of it. So much so that I may or may not own this keytag:

Various Keytags are actually the easiest way to win Christmas.

So why do I still group by 1? Well, there's a few reasons:

  • Not every warehouse that I regularly use supports lateral column aliasing (I'm looking at you BigQuery and Postgres πŸ‘€).
  • If you happen to alias a column with the name of an existing column, your SQL engine will assume you are referring to the existing column in your group by clause (it's a rare case, but happens from time to time when you write enough SQL).
  • 1 is a single character (and I'm lazy, OK?!).
  • I believe group by 1 helps you to write better SQL.

Let's dive into that last one a little more.

In my opinion, every reason against group by 1 literally doesn't matter if you write good SQL. In fact, using group by 1 can even help you recognize when your SQL has room for improvement. So let's go through some of the most common objections to group by 1. But first, let's change up our example to something that highlights this point a little better.

These days, as an analytics engineer, it's much less common that I'm writing a query to aggregate to a reporting period β€” I let my BI tool do that aggregation for me. Instead, I am often writing queries to build dbt models, where I group by a foreign key to calculate some facts about an entity. Here's an example:

select
  order_id,

  count(payment_id) as n_payments,
  sum(case when payment_method = 'credit_card' then amount else 0 end) as credit_card_amount
  sum(amount) as total_amount

from staging.payments

group by 1
order_payments.sql

Let's use this as the basis of our examples for the rest of the article as we wade through the most common objections to group by 1.

"But then people write group by 3, 4, 7, which is awful." β€” a luddite

Agreed! That is awful. But in my opinion, the columns you are grouping by should always be listed first in your query.

In this case, group by 3, 4, 7 would help you recognize that you haven't followed this convention. If you had instead grouped by a column name, you might not realize that the order of your columns can be improved.

"But how will I know the grain of my model? 1 doesn't tell me that!" β€” probably a flat-earther

I always use a line break to visually break up the columns I'm grouping by, and the aggregates (see the examples above). So my brain uses that to understand the grain.

Further, I also tend to put aggregating queries in a CTE, and use the CTE names to also call out the grain of the model, like so:

with payments as (
  select * from staging.payments
),

-- this model is at the grain of one per order
-- you can tell, because it starts with `order_`
order_payments as (
  select
    order_id,

    count(payment_id) as n_payments,
    sum(case when payment_method = 'credit_card' then amount else 0 end) as credit_card_amount
    sum(amount) as total_amount

  from payments

  group by 1
)

select * from order_payments
order_payments.sql

Side note: This "all logic in a CTE" pattern is one we use a lot β€” it's especially helpful in queries with lots of CTEs, as it helps us chain CTEs together, and is quicker to debug. If you're not yet using it, give it a go and you'll see what we mean.

"But group by 1 ends up in group by 1, 2, 3, 4, 5, 6, 7, which is hard to read!" β€” actually my mortal enemy

I mean, is it really any easier to read group by order_id, customer_id, order_date, status, ...?

But why is this happening in the first place? Why is a query grouping by so many columns? In my experience, there's two reasons why people group by more than one or two columns:

  1. The query is both joining and aggregating
  2. The query is using a group by to deduplicate

And neither of them are actually very good patterns!

1. The query is both joining and aggregating

Consider this SQL, where some fields on fct_orders are being calculated from Β payments :

with orders as (
  select * from staging.orders
),

payments as (
  select * from staging.payments
),

joined as (
  select
    orders.order_id,
    orders.customer_id,
    orders.order_date,
    orders.status,

    count(payments.payment_id) as n_payments,
    sum(case when payments.payment_method = 'credit_card' then payments.amount else 0 end) as credit_card_amount
    sum(payments.amount) as total_amount

  from orders

  left join payments using (order_id)

  group by 1, 2, 3, 4

)

select * from joined
fct_orders.sql

I mean it looks tidy enough! But, it's hard to extend. What happens if you want to re-order the columns and put your dates at the end? Suddenly, you're breaking the convention of having the columns you want to group by listed first. What happens when you have more columns, or want to join another table to this? It gets really hard to manage the complexity.

Here, there's actually a better pattern β€” group first in a CTE, then join it to your table.

with orders as (
  select * from staging.orders
),

payments as (
  select * from staging.payments
),

order_payments as (
  select
    order_id,
  
    count(payment_id) as n_payments,
    sum(case when payment_method = 'credit_card' then amount else 0 end) as credit_card_amount
    sum(amount) as total_amount
  
  from payments
  
  group by 1
),

joined as (
  select
    orders.order_id,
    orders.customer_id,
    orders.order_date,
    orders.status,

    order_payments.n_payments,
    order_payments.credit_card_amount,
    order_payments.total_amount
  
  from orders
  left join order_payments using (order_id)
)

select * from joined
fct_orders.sql

It's a little longer, but much more maintainable, and in my opinion, easier to understand what's going on.

(Bonus: this pattern is also far more performant!)

(Bonus bonus: if you're using dbt, consider breaking this CTE out into a separate model so you can test that order_payments only has one record per order!)

2. The query is using a group by to deduplicate

Consider this group by that's being used to deduplicate source data to be one record per order_id.

with orders as (
  select * from staging.orders
),

deduplicated as (
  select
    order_id,
    customer_id,
    status,

    max(order_date) as order_date
  
  from orders

  group by 1, 2, 3
)

select * from deduplicated
stg_orders.sql

OK, sure that's not so bad. But what if your table has a dozen columns? That's a lot of columns to group by. And what if the status changed values? You'd end up with two records for the same order_id, so you won't have effectively deduplicated!

Generally, group by to deduplicate is a pretty limited pattern. Here are two other patterns we prefer β€” heads up: they both use window functions.

Option 1: Create a row_number and use a where

This row_number pattern is really useful for wide tables as it extends for extra columns really well!

with orders as (
  select * from staging.orders
),

orders_with_row_number as (
  select
    *,

    row_number() over (
      partition by order_id
      order by order_date desc
    ) as row_number
  
  from orders
),

deduplicated as (
  select * from orders_with_row_number
  where row_number = 1
)

select * from deduplicated
stg_orders.sql

Option 2: Use distinct with window functions

This pattern is a little tricky, but it can come in really handy for narrow tables, like creating a two column table to map one user identifier to another (the example below is still for the orders table though). It's one I only learned recently!

with orders as (
  select * from staging.orders
),

deduplicated as (
  select distinct
    order_id,

    last_value(customer_id) over (
      partition by order_id
      order by order_date
      rows between unbounded preceding and unbounded following
    ) as customer_id,

    max(order_date) over (
      partition by order_id
    ) as order_date

  from orders
)

select * from deduplicated
stg_orders.sql

We deviated a little from talking about group by here, but it's important to note that in my opinion, the concern about grouping by too many columns is more a symptom of a poorly structured query than a problem with the group by 1 syntax. As a rough guess, I'd say that 95% of my group by statements are group by 1, and most of the rest are group by 1, 2. It's very rare that I group by more than that.

Final thoughts

OK, I've intentionally written this to ruffle a few feathers, and I'm fully aware that some SQL dialects don't support group by 1 (e.g. mssql).

Opinions can change β€” for example, I used to be very pro-UPPER-CASE-FOR-SQL-KEYWORDS
, and even
, pro-leading commas.

In writing all of this out, I think I've even softened my opinion on group by column_name a little (heck, maybe it is better? When the from and group by are next to each other, maybe you can parse it a little more quickly. After all, brain time is the most expensive resource of all! Though, it is fewer characters...)

At the end of the day, be opinionated in your SQL style, put it in a git repo and stick with it consistently. Here's ours, and here's another one I love β€” yes the latter recommends group by column_name and I still love it.

In all honesty, I can't wait for the day that someone builds a really good SQL linter based on a definable set of conventions, so that no matter which conventions you adopt, it's the job of a computer (and not me) to rant about breaking them.

{% endrant %}