Picture this: You're analyzing customer data from your e-commerce platform, and instead of neat, normalized tables, you're staring at JSON documents containing nested product information, arrays of purchase history, and complex user preferences. A few years ago, this would have meant exporting everything to Python or writing complex ETL processes. Today's SQL databases can handle this complexity directly.
Modern SQL has evolved far beyond traditional relational operations. Most major database systems now include powerful functions for parsing JSON documents, extracting array elements, and transforming semi-structured data without leaving your SQL environment. Whether you're working with API responses, NoSQL-style documents, or complex nested data structures, you can query them using familiar SQL syntax with specialized functions.
By the end of this lesson, you'll be comfortable working with JSON and arrays as first-class data types in SQL, enabling you to analyze complex, real-world datasets without the overhead of additional processing steps.
What you'll learn:
To follow along with this lesson, you should have:
The examples use PostgreSQL syntax, but we'll note differences for other major database systems where relevant.
JSON (JavaScript Object Notation) has become the standard for storing semi-structured data. Unlike traditional relational data that fits neatly into rows and columns, JSON allows for nested structures, arrays, and flexible schemas within a single field.
Here's what makes JSON powerful in modern applications:
{
"customer_id": 12345,
"profile": {
"name": "Sarah Chen",
"email": "sarah.chen@email.com",
"preferences": ["electronics", "books", "home-garden"]
},
"orders": [
{
"order_id": "ORD-2024-001",
"date": "2024-01-15",
"items": [
{"product": "Wireless Headphones", "price": 129.99, "quantity": 1},
{"product": "USB-C Cable", "price": 19.99, "quantity": 2}
],
"total": 169.97
},
{
"order_id": "ORD-2024-012",
"date": "2024-02-03",
"items": [
{"product": "Programming Book", "price": 49.99, "quantity": 1}
],
"total": 49.99
}
]
}
This single JSON document contains information that would traditionally require multiple normalized tables: customers, orders, order_items, and customer_preferences. Modern SQL databases can store this as a native JSON type and query it efficiently.
Let's start with a practical example. First, we'll create a table to store customer data:
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
data JSONB -- JSONB is PostgreSQL's binary JSON type for better performance
);
Tip: PostgreSQL offers both JSON and JSONB types. JSONB is generally preferred because it supports indexing and provides better query performance, though it uses slightly more storage space.
Now let's insert our sample data:
INSERT INTO customers (data) VALUES (
'{
"customer_id": 12345,
"profile": {
"name": "Sarah Chen",
"email": "sarah.chen@email.com",
"preferences": ["electronics", "books", "home-garden"]
},
"orders": [
{
"order_id": "ORD-2024-001",
"date": "2024-01-15",
"items": [
{"product": "Wireless Headphones", "price": 129.99, "quantity": 1},
{"product": "USB-C Cable", "price": 19.99, "quantity": 2}
],
"total": 169.97
},
{
"order_id": "ORD-2024-012",
"date": "2024-02-03",
"items": [
{"product": "Programming Book", "price": 49.99, "quantity": 1}
],
"total": 49.99
}
]
}'
);
The most basic JSON operation is extracting a value from a JSON document. PostgreSQL uses the -> operator to access JSON keys and ->>' to get the result as text:
-- Get the customer name (returns JSON)
SELECT data -> 'profile' -> 'name' as customer_name_json
FROM customers;
-- Get the customer name (returns text)
SELECT data -> 'profile' ->> 'name' as customer_name
FROM customers;
-- Get the customer ID as a number
SELECT (data ->> 'customer_id')::integer as customer_id
FROM customers;
The difference between -> and ->>' is crucial:
-> returns JSON, preserving quotes and structure->>' returns text, removing JSON formattingModern SQL provides specialized functions for JSON operations. Here are the essential ones:
-- Extract specific keys into columns
SELECT
JSON_EXTRACT_PATH_TEXT(data, 'profile', 'name') as name,
JSON_EXTRACT_PATH_TEXT(data, 'profile', 'email') as email,
JSON_EXTRACT_PATH_TEXT(data, 'customer_id') as customer_id
FROM customers;
-- Check if a key exists
SELECT
data -> 'profile' ->> 'name' as name,
CASE
WHEN data -> 'profile' ? 'phone' THEN 'Has Phone'
ELSE 'No Phone'
END as phone_status
FROM customers;
-- Get all keys at the top level
SELECT jsonb_object_keys(data) as top_level_keys
FROM customers;
Database Differences: MySQL uses
JSON_EXTRACT(json_doc, path)andJSON_UNQUOTE(JSON_EXTRACT(json_doc, path)), while SQL Server usesJSON_VALUE(json_string, path)for similar operations.
Arrays within JSON documents require special handling. Let's explore how to query and manipulate them effectively.
Arrays in JSON are zero-indexed, just like in most programming languages:
-- Get the first preference
SELECT data -> 'profile' -> 'preferences' -> 0 as first_preference
FROM customers;
-- Get the second order
SELECT data -> 'orders' -> 1 as second_order
FROM customers;
-- Get the first item from the first order
SELECT data -> 'orders' -> 0 -> 'items' -> 0 as first_item_first_order
FROM customers;
Understanding array size and checking for specific values is crucial for analysis:
-- Count preferences
SELECT
data -> 'profile' ->> 'name' as customer_name,
jsonb_array_length(data -> 'profile' -> 'preferences') as num_preferences
FROM customers;
-- Check if customer likes electronics
SELECT
data -> 'profile' ->> 'name' as customer_name,
CASE
WHEN data -> 'profile' -> 'preferences' ? 'electronics'
THEN 'Electronics Fan'
ELSE 'Not Interested in Electronics'
END as electronics_interest
FROM customers;
-- Count total orders
SELECT
data -> 'profile' ->> 'name' as customer_name,
jsonb_array_length(data -> 'orders') as total_orders
FROM customers;
One of the most powerful features of modern JSON support is the ability to "unnest" or expand arrays into separate rows. This lets you apply traditional SQL operations to array elements.
The jsonb_array_elements() function transforms each array element into a separate row:
-- Expand all orders into separate rows
SELECT
data -> 'profile' ->> 'name' as customer_name,
order_data ->> 'order_id' as order_id,
order_data ->> 'date' as order_date,
(order_data ->> 'total')::decimal as order_total
FROM customers,
jsonb_array_elements(data -> 'orders') as order_data;
This query takes our single customer record and creates one row for each order, allowing us to analyze orders using standard SQL aggregations:
-- Calculate average order value
SELECT
data -> 'profile' ->> 'name' as customer_name,
AVG((order_data ->> 'total')::decimal) as avg_order_value,
COUNT(*) as total_orders
FROM customers,
jsonb_array_elements(data -> 'orders') as order_data
GROUP BY data -> 'profile' ->> 'name';
For more complex structures like items within orders, we can chain array expansions:
-- Expand all items from all orders
SELECT
data -> 'profile' ->> 'name' as customer_name,
order_data ->> 'order_id' as order_id,
item_data ->> 'product' as product_name,
(item_data ->> 'price')::decimal as price,
(item_data ->> 'quantity')::integer as quantity
FROM customers,
jsonb_array_elements(data -> 'orders') as order_data,
jsonb_array_elements(order_data -> 'items') as item_data;
This gives us a fully normalized view of every product purchased by every customer, derived entirely from our JSON structure.
Sometimes you need to filter array elements based on conditions. PostgreSQL provides powerful path-based queries:
-- Find orders over $100
SELECT
data -> 'profile' ->> 'name' as customer_name,
order_data ->> 'order_id' as order_id,
(order_data ->> 'total')::decimal as total
FROM customers,
jsonb_array_elements(data -> 'orders') as order_data
WHERE (order_data ->> 'total')::decimal > 100;
You can perform calculations across array elements:
-- Calculate total spend per customer
SELECT
data -> 'profile' ->> 'name' as customer_name,
SUM((order_data ->> 'total')::decimal) as total_lifetime_value,
COUNT(*) as order_count,
AVG((order_data ->> 'total')::decimal) as avg_order_value
FROM customers,
jsonb_array_elements(data -> 'orders') as order_data
GROUP BY data -> 'profile' ->> 'name';
-- Count unique products purchased
SELECT
data -> 'profile' ->> 'name' as customer_name,
COUNT(DISTINCT item_data ->> 'product') as unique_products
FROM customers,
jsonb_array_elements(data -> 'orders') as order_data,
jsonb_array_elements(order_data -> 'items') as item_data
GROUP BY data -> 'profile' ->> 'name';
For complex queries, JSON path expressions provide a more readable and powerful syntax. PostgreSQL supports JSONPath, a standardized query language for JSON:
-- Find all orders with items over $50
SELECT
data -> 'profile' ->> 'name' as customer_name,
jsonb_path_query_array(
data,
'$.orders[*] ? (@.items[*].price > 50)'
) as expensive_orders;
-- Extract all product names
SELECT
data -> 'profile' ->> 'name' as customer_name,
jsonb_path_query_array(
data,
'$.orders[*].items[*].product'
) as all_products;
-- Find customers who bought electronics (based on product names)
SELECT
data -> 'profile' ->> 'name' as customer_name
FROM customers
WHERE jsonb_path_exists(
data,
'$.orders[*].items[*] ? (@.product like_regex "(?i)electronic|headphone|cable")'
);
Understanding JSONPath: The
$represents the root,[*]means all array elements, and?()applies a filter condition. The@symbol refers to the current element being tested.
Modern SQL also supports updating JSON documents in place:
-- Add a new preference
UPDATE customers
SET data = jsonb_set(
data,
'{profile,preferences}',
data -> 'profile' -> 'preferences' || '"sports"'::jsonb
);
-- Add a phone number to profile
UPDATE customers
SET data = jsonb_set(
data,
'{profile,phone}',
'"555-0123"'
);
-- Update customer email
UPDATE customers
SET data = jsonb_set(
data,
'{profile,email}',
'"sarah.chen.updated@email.com"'
);
Working with JSON and arrays can be computationally expensive. Here are key optimization strategies:
PostgreSQL allows you to create indexes on JSON paths:
-- Index on customer email for fast lookups
CREATE INDEX idx_customer_email
ON customers USING gin ((data -> 'profile' ->> 'email'));
-- Index on customer preferences array
CREATE INDEX idx_customer_preferences
ON customers USING gin ((data -> 'profile' -> 'preferences'));
-- Functional index for order totals
CREATE INDEX idx_order_totals
ON customers USING btree (
(jsonb_path_query_first(data, '$.orders[*].total')::text::decimal)
);
ALTER TABLE customers
ADD COLUMN customer_name TEXT GENERATED ALWAYS AS (data -> 'profile' ->> 'name') STORED;
CREATE INDEX idx_customer_name ON customers (customer_name);
Use appropriate operators:
-> for intermediate navigation, ->>' only for final text extraction? for key existence checks rather than extracting and comparing to NULLLimit array expansions:
-- Good: Filter before expansion
SELECT customer_name, item_data ->> 'product'
FROM customers,
jsonb_array_elements(data -> 'orders') as order_data,
jsonb_array_elements(order_data -> 'items') as item_data
WHERE (order_data ->> 'total')::decimal > 100;
-- Better: Use path expressions to filter first
SELECT
data -> 'profile' ->> 'name',
jsonb_path_query_array(data, '$.orders[*] ? (@.total > 100).items[*].product')
FROM customers;
Let's practice with a more complex dataset. You'll work with social media analytics data containing user posts with nested engagement metrics.
First, create and populate the exercise table:
CREATE TABLE social_posts (
id SERIAL PRIMARY KEY,
post_data JSONB
);
INSERT INTO social_posts (post_data) VALUES
('{
"post_id": "post_001",
"user": {
"id": "user_sarah",
"name": "Sarah Chen",
"followers": 1250
},
"content": {
"text": "Excited about the new SQL features!",
"hashtags": ["#SQL", "#database", "#tech"],
"mentions": ["@techcompany", "@sqlexpert"]
},
"engagement": {
"likes": 45,
"comments": [
{"user": "john_dev", "text": "Great insights!", "timestamp": "2024-01-15T10:30:00Z"},
{"user": "data_analyst", "text": "Thanks for sharing", "timestamp": "2024-01-15T11:15:00Z"},
{"user": "sarah_sql", "text": "Very helpful", "timestamp": "2024-01-15T14:20:00Z"}
],
"shares": 12
},
"metrics": {
"impressions": 2300,
"click_through_rate": 0.045,
"engagement_rate": 0.025
}
}'),
('{
"post_id": "post_002",
"user": {
"id": "user_mike",
"name": "Mike Rodriguez",
"followers": 890
},
"content": {
"text": "JSON queries are powerful!",
"hashtags": ["#JSON", "#queries", "#data"],
"mentions": ["@database_pro"]
},
"engagement": {
"likes": 23,
"comments": [
{"user": "json_fan", "text": "Absolutely!", "timestamp": "2024-01-16T09:45:00Z"},
{"user": "query_master", "text": "Love this approach", "timestamp": "2024-01-16T13:30:00Z"}
],
"shares": 5
},
"metrics": {
"impressions": 1100,
"click_through_rate": 0.032,
"engagement_rate": 0.028
}
}');
Now complete these tasks:
Task 1: Extract basic post information
-- Your query should return: post_id, user_name, follower_count, like_count
Task 2: Find posts with high engagement
-- Find posts where engagement_rate > 0.026
-- Include post_id, user_name, and engagement_rate
Task 3: Analyze hashtag usage
-- Expand all hashtags and count their frequency
-- Show hashtag and usage_count, ordered by popularity
Task 4: Comment analysis
-- Show each comment as a separate row
-- Include post_id, commenter_username, comment_text, and comment_timestamp
Task 5: Calculate engagement metrics
-- For each user, calculate:
-- - total_posts, total_likes, total_comments, avg_engagement_rate
Here are the solutions to practice with:
-- Task 1: Basic post information
SELECT
post_data ->> 'post_id' as post_id,
post_data -> 'user' ->> 'name' as user_name,
(post_data -> 'user' ->> 'followers')::integer as follower_count,
(post_data -> 'engagement' ->> 'likes')::integer as like_count
FROM social_posts;
-- Task 2: High engagement posts
SELECT
post_data ->> 'post_id' as post_id,
post_data -> 'user' ->> 'name' as user_name,
(post_data -> 'metrics' ->> 'engagement_rate')::decimal as engagement_rate
FROM social_posts
WHERE (post_data -> 'metrics' ->> 'engagement_rate')::decimal > 0.026;
-- Task 3: Hashtag analysis
SELECT
hashtag_data ->> 0 as hashtag,
COUNT(*) as usage_count
FROM social_posts,
jsonb_array_elements(post_data -> 'content' -> 'hashtags') as hashtag_data
GROUP BY hashtag_data ->> 0
ORDER BY usage_count DESC;
-- Task 4: Comment expansion
SELECT
post_data ->> 'post_id' as post_id,
comment_data ->> 'user' as commenter_username,
comment_data ->> 'text' as comment_text,
comment_data ->> 'timestamp' as comment_timestamp
FROM social_posts,
jsonb_array_elements(post_data -> 'engagement' -> 'comments') as comment_data;
-- Task 5: User engagement metrics
SELECT
post_data -> 'user' ->> 'name' as user_name,
COUNT(*) as total_posts,
SUM((post_data -> 'engagement' ->> 'likes')::integer) as total_likes,
SUM(jsonb_array_length(post_data -> 'engagement' -> 'comments')) as total_comments,
AVG((post_data -> 'metrics' ->> 'engagement_rate')::decimal) as avg_engagement_rate
FROM social_posts
GROUP BY post_data -> 'user' ->> 'name';
Problem: Getting text when you need numbers
-- Wrong: This returns text, not a number
SELECT data -> 'orders' -> 0 ->> 'total' + 100 FROM customers;
-- Error: operator does not exist: text + integer
-- Right: Cast to appropriate type
SELECT (data -> 'orders' -> 0 ->> 'total')::decimal + 100 FROM customers;
Problem: Accessing array elements that don't exist
-- Dangerous: What if there's no third order?
SELECT data -> 'orders' -> 2 ->> 'total' FROM customers;
-- Returns NULL, which might not be what you want
-- Safer: Check array length first
SELECT
CASE
WHEN jsonb_array_length(data -> 'orders') > 2
THEN data -> 'orders' -> 2 ->> 'total'
ELSE 'No third order'
END as third_order_total
FROM customers;
Problem: Incorrect JSONPath syntax
-- Wrong: Using dot notation incorrectly
SELECT jsonb_path_query(data, '.orders[0].total') FROM customers;
-- Error: syntax error in jsonpath
-- Right: Start with $ for root
SELECT jsonb_path_query(data, '$.orders[0].total') FROM customers;
Problem: Expanding huge arrays without filtering
-- Problematic with large datasets
SELECT * FROM customers, jsonb_array_elements(data -> 'orders') as order_data;
-- Better: Add WHERE conditions to limit results
SELECT *
FROM customers,
jsonb_array_elements(data -> 'orders') as order_data
WHERE (order_data ->> 'total')::decimal > 50
LIMIT 100;
Problem: Not accounting for missing JSON keys
-- This fails if 'phone' key doesn't exist
SELECT (data -> 'profile' ->> 'phone') FROM customers;
-- Better: Use COALESCE for defaults
SELECT COALESCE(data -> 'profile' ->> 'phone', 'No phone') as phone
FROM customers;
Debugging Tip: When working with complex JSON paths, build your query step by step. Start with simple key access, then add nesting and array operations incrementally.
You've learned to work with JSON and arrays as native SQL data types, transforming how you handle complex, nested data structures. You can now extract values from JSON documents, expand arrays into queryable rows, perform aggregations on semi-structured data, and optimize performance through proper indexing.
Key takeaways:
-> for JSON navigation and ->>' for text extractionjsonb_array_elements() transforms arrays into rows for SQL operations Next steps to expand your skills:
jsonb_agg() and jsonb_object_agg() for creating JSON from relational dataModern SQL's JSON capabilities bridge the gap between relational and document databases, giving you the flexibility to handle diverse data formats while maintaining the power and familiarity of SQL. This foundation prepares you for increasingly complex data scenarios in modern applications.
Learning Path: Advanced SQL Queries