Working with JSON and Arrays in Modern SQL

Picture this: You're analyzing customer data from your e-commerce platform, and instead of neat, normalized tables, you're staring at JSON documents containing nested product information, arrays of purchase history, and complex user preferences. A few years ago, this would have meant exporting everything to Python or writing complex ETL processes. Today's SQL databases can handle this complexity directly.

Modern SQL has evolved far beyond traditional relational operations. Most major database systems now include powerful functions for parsing JSON documents, extracting array elements, and transforming semi-structured data without leaving your SQL environment. Whether you're working with API responses, NoSQL-style documents, or complex nested data structures, you can query them using familiar SQL syntax with specialized functions.

By the end of this lesson, you'll be comfortable working with JSON and arrays as first-class data types in SQL, enabling you to analyze complex, real-world datasets without the overhead of additional processing steps.

What you'll learn:

How to store and query JSON data directly in SQL databases
Essential JSON functions for extracting values, arrays, and nested objects
Array manipulation techniques including filtering, aggregation, and transformation
How to combine JSON/array operations with traditional SQL queries
Best practices for performance when working with semi-structured data

Prerequisites

To follow along with this lesson, you should have:

Basic SQL knowledge (SELECT, WHERE, JOIN operations)
Access to a modern SQL database (PostgreSQL 12+, MySQL 8.0+, SQL Server 2016+, or similar)
Understanding of JSON structure (objects, arrays, key-value pairs)

The examples use PostgreSQL syntax, but we'll note differences for other major database systems where relevant.

Understanding JSON as a SQL Data Type

JSON (JavaScript Object Notation) has become the standard for storing semi-structured data. Unlike traditional relational data that fits neatly into rows and columns, JSON allows for nested structures, arrays, and flexible schemas within a single field.

Here's what makes JSON powerful in modern applications:

{
  "customer_id": 12345,
  "profile": {
    "name": "Sarah Chen",
    "email": "sarah.chen@email.com",
    "preferences": ["electronics", "books", "home-garden"]
  },
  "orders": [
    {
      "order_id": "ORD-2024-001",
      "date": "2024-01-15",
      "items": [
        {"product": "Wireless Headphones", "price": 129.99, "quantity": 1},
        {"product": "USB-C Cable", "price": 19.99, "quantity": 2}
      ],
      "total": 169.97
    },
    {
      "order_id": "ORD-2024-012", 
      "date": "2024-02-03",
      "items": [
        {"product": "Programming Book", "price": 49.99, "quantity": 1}
      ],
      "total": 49.99
    }
  ]
}

This single JSON document contains information that would traditionally require multiple normalized tables: customers, orders, order_items, and customer_preferences. Modern SQL databases can store this as a native JSON type and query it efficiently.

Let's start with a practical example. First, we'll create a table to store customer data:

CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    data JSONB  -- JSONB is PostgreSQL's binary JSON type for better performance
);

Tip: PostgreSQL offers both JSON and JSONB types. JSONB is generally preferred because it supports indexing and provides better query performance, though it uses slightly more storage space.

Now let's insert our sample data:

INSERT INTO customers (data) VALUES (
'{
  "customer_id": 12345,
  "profile": {
    "name": "Sarah Chen",
    "email": "sarah.chen@email.com", 
    "preferences": ["electronics", "books", "home-garden"]
  },
  "orders": [
    {
      "order_id": "ORD-2024-001",
      "date": "2024-01-15",
      "items": [
        {"product": "Wireless Headphones", "price": 129.99, "quantity": 1},
        {"product": "USB-C Cable", "price": 19.99, "quantity": 2}
      ],
      "total": 169.97
    },
    {
      "order_id": "ORD-2024-012",
      "date": "2024-02-03", 
      "items": [
        {"product": "Programming Book", "price": 49.99, "quantity": 1}
      ],
      "total": 49.99
    }
  ]
}'
);

Essential JSON Query Operations

Extracting Simple Values

The most basic JSON operation is extracting a value from a JSON document. PostgreSQL uses the -> operator to access JSON keys and ->>' to get the result as text:

-- Get the customer name (returns JSON)
SELECT data -> 'profile' -> 'name' as customer_name_json
FROM customers;

-- Get the customer name (returns text)  
SELECT data -> 'profile' ->> 'name' as customer_name
FROM customers;

-- Get the customer ID as a number
SELECT (data ->> 'customer_id')::integer as customer_id
FROM customers;

The difference between -> and ->>' is crucial:

-> returns JSON, preserving quotes and structure
->>' returns text, removing JSON formatting

Working with JSON Functions

Modern SQL provides specialized functions for JSON operations. Here are the essential ones:

-- Extract specific keys into columns
SELECT 
    JSON_EXTRACT_PATH_TEXT(data, 'profile', 'name') as name,
    JSON_EXTRACT_PATH_TEXT(data, 'profile', 'email') as email,
    JSON_EXTRACT_PATH_TEXT(data, 'customer_id') as customer_id
FROM customers;

-- Check if a key exists
SELECT 
    data -> 'profile' ->> 'name' as name,
    CASE 
        WHEN data -> 'profile' ? 'phone' THEN 'Has Phone'
        ELSE 'No Phone'
    END as phone_status
FROM customers;

-- Get all keys at the top level
SELECT jsonb_object_keys(data) as top_level_keys
FROM customers;

Database Differences: MySQL uses JSON_EXTRACT(json_doc, path) and JSON_UNQUOTE(JSON_EXTRACT(json_doc, path)), while SQL Server uses JSON_VALUE(json_string, path) for similar operations.

Working with JSON Arrays

Arrays within JSON documents require special handling. Let's explore how to query and manipulate them effectively.

Accessing Array Elements

Arrays in JSON are zero-indexed, just like in most programming languages:

-- Get the first preference
SELECT data -> 'profile' -> 'preferences' -> 0 as first_preference
FROM customers;

-- Get the second order
SELECT data -> 'orders' -> 1 as second_order  
FROM customers;

-- Get the first item from the first order
SELECT data -> 'orders' -> 0 -> 'items' -> 0 as first_item_first_order
FROM customers;

Array Length and Membership

Understanding array size and checking for specific values is crucial for analysis:

-- Count preferences
SELECT 
    data -> 'profile' ->> 'name' as customer_name,
    jsonb_array_length(data -> 'profile' -> 'preferences') as num_preferences
FROM customers;

-- Check if customer likes electronics
SELECT 
    data -> 'profile' ->> 'name' as customer_name,
    CASE 
        WHEN data -> 'profile' -> 'preferences' ? 'electronics' 
        THEN 'Electronics Fan'
        ELSE 'Not Interested in Electronics'
    END as electronics_interest
FROM customers;

-- Count total orders
SELECT 
    data -> 'profile' ->> 'name' as customer_name,
    jsonb_array_length(data -> 'orders') as total_orders
FROM customers;

Expanding Arrays into Rows

One of the most powerful features of modern JSON support is the ability to "unnest" or expand arrays into separate rows. This lets you apply traditional SQL operations to array elements.

Using JSONB_ARRAY_ELEMENTS

The jsonb_array_elements() function transforms each array element into a separate row:

-- Expand all orders into separate rows
SELECT 
    data -> 'profile' ->> 'name' as customer_name,
    order_data ->> 'order_id' as order_id,
    order_data ->> 'date' as order_date,
    (order_data ->> 'total')::decimal as order_total
FROM customers,
jsonb_array_elements(data -> 'orders') as order_data;

This query takes our single customer record and creates one row for each order, allowing us to analyze orders using standard SQL aggregations:

-- Calculate average order value
SELECT 
    data -> 'profile' ->> 'name' as customer_name,
    AVG((order_data ->> 'total')::decimal) as avg_order_value,
    COUNT(*) as total_orders
FROM customers,
jsonb_array_elements(data -> 'orders') as order_data
GROUP BY data -> 'profile' ->> 'name';

Expanding Nested Arrays

For more complex structures like items within orders, we can chain array expansions:

-- Expand all items from all orders
SELECT 
    data -> 'profile' ->> 'name' as customer_name,
    order_data ->> 'order_id' as order_id,
    item_data ->> 'product' as product_name,
    (item_data ->> 'price')::decimal as price,
    (item_data ->> 'quantity')::integer as quantity
FROM customers,
jsonb_array_elements(data -> 'orders') as order_data,
jsonb_array_elements(order_data -> 'items') as item_data;

This gives us a fully normalized view of every product purchased by every customer, derived entirely from our JSON structure.

Advanced Array Operations

Filtering Arrays

Sometimes you need to filter array elements based on conditions. PostgreSQL provides powerful path-based queries:

-- Find orders over $100
SELECT 
    data -> 'profile' ->> 'name' as customer_name,
    order_data ->> 'order_id' as order_id,
    (order_data ->> 'total')::decimal as total
FROM customers,
jsonb_array_elements(data -> 'orders') as order_data
WHERE (order_data ->> 'total')::decimal > 100;

Aggregating Array Contents

You can perform calculations across array elements:

-- Calculate total spend per customer
SELECT 
    data -> 'profile' ->> 'name' as customer_name,
    SUM((order_data ->> 'total')::decimal) as total_lifetime_value,
    COUNT(*) as order_count,
    AVG((order_data ->> 'total')::decimal) as avg_order_value
FROM customers,
jsonb_array_elements(data -> 'orders') as order_data
GROUP BY data -> 'profile' ->> 'name';

-- Count unique products purchased
SELECT 
    data -> 'profile' ->> 'name' as customer_name,
    COUNT(DISTINCT item_data ->> 'product') as unique_products
FROM customers,
jsonb_array_elements(data -> 'orders') as order_data,
jsonb_array_elements(order_data -> 'items') as item_data
GROUP BY data -> 'profile' ->> 'name';

JSON Path Expressions

For complex queries, JSON path expressions provide a more readable and powerful syntax. PostgreSQL supports JSONPath, a standardized query language for JSON:

-- Find all orders with items over $50
SELECT 
    data -> 'profile' ->> 'name' as customer_name,
    jsonb_path_query_array(
        data, 
        '$.orders[*] ? (@.items[*].price > 50)'
    ) as expensive_orders;

-- Extract all product names
SELECT 
    data -> 'profile' ->> 'name' as customer_name,
    jsonb_path_query_array(
        data, 
        '$.orders[*].items[*].product'
    ) as all_products;

-- Find customers who bought electronics (based on product names)
SELECT 
    data -> 'profile' ->> 'name' as customer_name
FROM customers
WHERE jsonb_path_exists(
    data, 
    '$.orders[*].items[*] ? (@.product like_regex "(?i)electronic|headphone|cable")'
);

Understanding JSONPath: The $ represents the root, [*] means all array elements, and ?() applies a filter condition. The @ symbol refers to the current element being tested.

Modifying JSON Data

Modern SQL also supports updating JSON documents in place:

-- Add a new preference
UPDATE customers 
SET data = jsonb_set(
    data, 
    '{profile,preferences}', 
    data -> 'profile' -> 'preferences' || '"sports"'::jsonb
);

-- Add a phone number to profile
UPDATE customers
SET data = jsonb_set(
    data,
    '{profile,phone}',
    '"555-0123"'
);

-- Update customer email
UPDATE customers
SET data = jsonb_set(
    data,
    '{profile,email}',
    '"sarah.chen.updated@email.com"'
);

Performance Considerations

Working with JSON and arrays can be computationally expensive. Here are key optimization strategies:

Indexing JSON Data

PostgreSQL allows you to create indexes on JSON paths:

-- Index on customer email for fast lookups
CREATE INDEX idx_customer_email 
ON customers USING gin ((data -> 'profile' ->> 'email'));

-- Index on customer preferences array
CREATE INDEX idx_customer_preferences 
ON customers USING gin ((data -> 'profile' -> 'preferences'));

-- Functional index for order totals
CREATE INDEX idx_order_totals 
ON customers USING btree (
    (jsonb_path_query_first(data, '$.orders[*].total')::text::decimal)
);

Query Optimization Tips

Extract frequently-used values into regular columns:

ALTER TABLE customers 
ADD COLUMN customer_name TEXT GENERATED ALWAYS AS (data -> 'profile' ->> 'name') STORED;

CREATE INDEX idx_customer_name ON customers (customer_name);

Use appropriate operators:
- Use -> for intermediate navigation, ->>' only for final text extraction
- Use ? for key existence checks rather than extracting and comparing to NULL
Limit array expansions:

-- Good: Filter before expansion
SELECT customer_name, item_data ->> 'product'
FROM customers,
jsonb_array_elements(data -> 'orders') as order_data,
jsonb_array_elements(order_data -> 'items') as item_data
WHERE (order_data ->> 'total')::decimal > 100;

-- Better: Use path expressions to filter first
SELECT 
    data -> 'profile' ->> 'name',
    jsonb_path_query_array(data, '$.orders[*] ? (@.total > 100).items[*].product')
FROM customers;

Hands-On Exercise

Let's practice with a more complex dataset. You'll work with social media analytics data containing user posts with nested engagement metrics.

First, create and populate the exercise table:

CREATE TABLE social_posts (
    id SERIAL PRIMARY KEY,
    post_data JSONB
);

INSERT INTO social_posts (post_data) VALUES 
('{
  "post_id": "post_001",
  "user": {
    "id": "user_sarah",
    "name": "Sarah Chen", 
    "followers": 1250
  },
  "content": {
    "text": "Excited about the new SQL features!",
    "hashtags": ["#SQL", "#database", "#tech"],
    "mentions": ["@techcompany", "@sqlexpert"]
  },
  "engagement": {
    "likes": 45,
    "comments": [
      {"user": "john_dev", "text": "Great insights!", "timestamp": "2024-01-15T10:30:00Z"},
      {"user": "data_analyst", "text": "Thanks for sharing", "timestamp": "2024-01-15T11:15:00Z"},
      {"user": "sarah_sql", "text": "Very helpful", "timestamp": "2024-01-15T14:20:00Z"}
    ],
    "shares": 12
  },
  "metrics": {
    "impressions": 2300,
    "click_through_rate": 0.045,
    "engagement_rate": 0.025
  }
}'),
('{
  "post_id": "post_002", 
  "user": {
    "id": "user_mike",
    "name": "Mike Rodriguez",
    "followers": 890
  },
  "content": {
    "text": "JSON queries are powerful!",
    "hashtags": ["#JSON", "#queries", "#data"],
    "mentions": ["@database_pro"]
  },
  "engagement": {
    "likes": 23,
    "comments": [
      {"user": "json_fan", "text": "Absolutely!", "timestamp": "2024-01-16T09:45:00Z"},
      {"user": "query_master", "text": "Love this approach", "timestamp": "2024-01-16T13:30:00Z"}
    ],
    "shares": 5
  },
  "metrics": {
    "impressions": 1100,
    "click_through_rate": 0.032,
    "engagement_rate": 0.028  
  }
}');

Now complete these tasks:

Task 1: Extract basic post information

-- Your query should return: post_id, user_name, follower_count, like_count

Task 2: Find posts with high engagement

-- Find posts where engagement_rate > 0.026
-- Include post_id, user_name, and engagement_rate

Task 3: Analyze hashtag usage

-- Expand all hashtags and count their frequency
-- Show hashtag and usage_count, ordered by popularity

Task 4: Comment analysis

-- Show each comment as a separate row
-- Include post_id, commenter_username, comment_text, and comment_timestamp

Task 5: Calculate engagement metrics

-- For each user, calculate:
-- - total_posts, total_likes, total_comments, avg_engagement_rate

Solution

Here are the solutions to practice with:

-- Task 1: Basic post information
SELECT 
    post_data ->> 'post_id' as post_id,
    post_data -> 'user' ->> 'name' as user_name,
    (post_data -> 'user' ->> 'followers')::integer as follower_count,
    (post_data -> 'engagement' ->> 'likes')::integer as like_count
FROM social_posts;

-- Task 2: High engagement posts
SELECT 
    post_data ->> 'post_id' as post_id,
    post_data -> 'user' ->> 'name' as user_name,
    (post_data -> 'metrics' ->> 'engagement_rate')::decimal as engagement_rate
FROM social_posts
WHERE (post_data -> 'metrics' ->> 'engagement_rate')::decimal > 0.026;

-- Task 3: Hashtag analysis
SELECT 
    hashtag_data ->> 0 as hashtag,
    COUNT(*) as usage_count
FROM social_posts,
jsonb_array_elements(post_data -> 'content' -> 'hashtags') as hashtag_data
GROUP BY hashtag_data ->> 0
ORDER BY usage_count DESC;

-- Task 4: Comment expansion
SELECT 
    post_data ->> 'post_id' as post_id,
    comment_data ->> 'user' as commenter_username,
    comment_data ->> 'text' as comment_text,
    comment_data ->> 'timestamp' as comment_timestamp
FROM social_posts,
jsonb_array_elements(post_data -> 'engagement' -> 'comments') as comment_data;

-- Task 5: User engagement metrics  
SELECT 
    post_data -> 'user' ->> 'name' as user_name,
    COUNT(*) as total_posts,
    SUM((post_data -> 'engagement' ->> 'likes')::integer) as total_likes,
    SUM(jsonb_array_length(post_data -> 'engagement' -> 'comments')) as total_comments,
    AVG((post_data -> 'metrics' ->> 'engagement_rate')::decimal) as avg_engagement_rate
FROM social_posts
GROUP BY post_data -> 'user' ->> 'name';

Common Mistakes & Troubleshooting

Type Conversion Errors

Problem: Getting text when you need numbers

-- Wrong: This returns text, not a number
SELECT data -> 'orders' -> 0 ->> 'total' + 100 FROM customers;
-- Error: operator does not exist: text + integer

-- Right: Cast to appropriate type  
SELECT (data -> 'orders' -> 0 ->> 'total')::decimal + 100 FROM customers;

Array Index Out of Bounds

Problem: Accessing array elements that don't exist

-- Dangerous: What if there's no third order?
SELECT data -> 'orders' -> 2 ->> 'total' FROM customers;
-- Returns NULL, which might not be what you want

-- Safer: Check array length first
SELECT 
    CASE 
        WHEN jsonb_array_length(data -> 'orders') > 2 
        THEN data -> 'orders' -> 2 ->> 'total'
        ELSE 'No third order'
    END as third_order_total
FROM customers;

Path Expression Syntax Errors

Problem: Incorrect JSONPath syntax

-- Wrong: Using dot notation incorrectly
SELECT jsonb_path_query(data, '.orders[0].total') FROM customers;
-- Error: syntax error in jsonpath

-- Right: Start with $ for root
SELECT jsonb_path_query(data, '$.orders[0].total') FROM customers;

Performance Issues with Large Arrays

Problem: Expanding huge arrays without filtering

-- Problematic with large datasets
SELECT * FROM customers, jsonb_array_elements(data -> 'orders') as order_data;

-- Better: Add WHERE conditions to limit results
SELECT * 
FROM customers, 
jsonb_array_elements(data -> 'orders') as order_data
WHERE (order_data ->> 'total')::decimal > 50
LIMIT 100;

NULL Handling

Problem: Not accounting for missing JSON keys

-- This fails if 'phone' key doesn't exist
SELECT (data -> 'profile' ->> 'phone') FROM customers;

-- Better: Use COALESCE for defaults
SELECT COALESCE(data -> 'profile' ->> 'phone', 'No phone') as phone 
FROM customers;

Debugging Tip: When working with complex JSON paths, build your query step by step. Start with simple key access, then add nesting and array operations incrementally.

Summary & Next Steps

You've learned to work with JSON and arrays as native SQL data types, transforming how you handle complex, nested data structures. You can now extract values from JSON documents, expand arrays into queryable rows, perform aggregations on semi-structured data, and optimize performance through proper indexing.

Key takeaways:

Use -> for JSON navigation and ->>' for text extraction
jsonb_array_elements() transforms arrays into rows for SQL operations
JSONPath expressions provide powerful filtering and extraction capabilities
Proper indexing is crucial for JSON query performance
Always cast extracted values to appropriate data types for calculations

Next steps to expand your skills:

Advanced JSON Functions: Explore JSON aggregation functions like jsonb_agg() and jsonb_object_agg() for creating JSON from relational data
Full-Text Search: Learn to combine JSON queries with full-text search for content analysis
Time-Series JSON: Handle JSON with timestamp arrays for analytics on temporal data
JSON Schema Validation: Implement data quality checks using JSON schema validation functions
Cross-Database Compatibility: Practice translating JSON operations between PostgreSQL, MySQL, and SQL Server

Modern SQL's JSON capabilities bridge the gap between relational and document databases, giving you the flexibility to handle diverse data formats while maintaining the power and familiarity of SQL. This foundation prepares you for increasingly complex data scenarios in modern applications.

Working with JSON and Arrays in Modern SQL: Complete Guide