Step 1: Switch to Snowflake or Bigquery.
I joke, I joke. While Snowflake and Bigquery do have much more sensible approaches to dealing with semi-structured data, my guess is that you probably don’t have the luxury of making that switch in your current frenetic search to figure out how to deal with that nasty JSON array living in the varchar(max) field you’re staring at. Sometimes we have to use the tools we have!
What makes Redshift’s lack of an unnest, or flatten, function all the more frustrating is that Amazon’s other columnar SQL products, Athena and Spectrum, both have the ability to deal with arrays natively. Why no love for Redshift?
Enough griping. Let’s write some SQL.
Setting up the environment
Imagine we have an ecommerce database where the orders table, with one record per order, contains a nested array of items: the individual products purchased in a given order. Here’s the setup data so that you can run it yourself:
Simple, right? Plenty for what we need to do. Let’s see what we can do with it.
The easy stuff
Here’s something that’s easy to do: grab the contents of the items array out of the JSON object:
select order_id, json_extract_path_text(json_text, 'items', true ) as items from flatten_test
This uses one of Redshift’s core JSON functions, json_extract_path_text. Look at the docs, they’re good. Here’s your result:
It’s hard to see in the narrow table above, but order_id = 1 actually contains the entire JSON array with two nested objects.
So far we’ve just pulled out a single item of the object, items. What else can we do? Let’s try grabbing the first item out of the array:
select order_id, json_extract_array_element_text( json_extract_path_text(json_text, 'items', true ), 0, true ) from flatten_test
We’ve added a json_extract_array_element_text (docs) call on top of the previous query to get the first element of the array (which is zero-indexed). This returns the following table:
This time we only get back the first element of the order_id = 1 record.
Finally, we’re able to get the total number of array items in the items array:
select order_id, json_array_length( json_extract_path_text(json_text, 'items', true ) , true) as number_of_items from flatten_test
This uses another critical Redshift JSON-parsing tool, json_array_length (docs). That query returns this rather humble result:
Ok all of this is great. We’ve been able to grab a specific key out of an object, we’ve been able to grab a particular item out of an array, and we’ve been able to count the items in an array, and it’s all been pretty easy (if a bit verbose!). But we haven’t actually flattened the table. It turns out, that’s rather harder.
Putting it all together: pulling off the unnest
In order to actually execute our unnest, we’re going to have to fan out the number of rows in the source data: we don’t want one record per order, we want one record per item. In order to get there we’ll cross join against a numbers table.
Generating a numbers table is a whole topic unto itself. While Redshift does now have a generate_series function, that function doesn’t seem to allow its outputs to be used by other downstream SQL functions, and the results can’t be saved into a table either. If you’re using dbt to write this SQL (which I highly recommend), you can use the dbt-utils generate_series() macro to build yourself numbers table. If you’re not using dbt, I leave it as an exercise for the reader to create a table called numbers with a single column called ordinal. It should have 10 rows; the numbers 0 to 9.
Once we have our numbers table, we need to cross join to it and then parse out the results. Here’s the entire SQL:
And here’s the resulting table:
I can’t emphasize enough how annoying this pattern is. Writing the same SQL on Snowflake or Bigquery feels idiomatic: you simply use the flatten function on Snowflake (docs) or the unnest function on Bigquery (docs). Both platforms support this type of nested data in a first-class way, and it significantly improves the experience of data analysts.
If you are on Redshift, you should prefer ingestion processes that automatically normalize any JSON data that they run into for Redshift destinations (we’re fans of Stitch and Fivetran). But if you do find yourself with a JSON array to deal with on Redshift, this is a very solvable problem.
Once you’ve transformed your raw array into a clean dataset, make sure you save that result as a table or view back into your database using dbt. You don’t want your fellow analysts to go through the same headache that you did, right?