Master Subqueries and CTEs: Write Complex SQL Analysis in Single Queries

Imagine you're analyzing customer data for an e-commerce company, and your manager asks: "Show me all customers who spent more than the average order amount in the last quarter." Your first instinct might be to calculate the average separately, write it down, then use that number in another query. But what if that average changes daily? You'd be constantly updating your queries with new hardcoded values.

This is exactly the kind of problem that subqueries and Common Table Expressions (CTEs) solve elegantly. Instead of breaking your analysis into multiple manual steps, you can write a single, dynamic query that calculates the average and uses it immediately—all in one seamless operation.

By the end of this lesson, you'll master two of SQL's most powerful techniques for building complex, multi-step analyses within a single query. You'll understand when to use each approach, how they work under the hood, and how to troubleshoot common issues that trip up even experienced developers.

What you'll learn:

How subqueries work and the three main types you'll encounter daily
When subqueries perform well and when they create performance problems
How to write and structure Common Table Expressions (CTEs) for readable, maintainable code
The key differences between subqueries and CTEs, and when to choose each approach
Advanced techniques like recursive CTEs for hierarchical data analysis

Prerequisites

You should be comfortable with basic SQL operations including SELECT statements, WHERE clauses, JOINs, and aggregate functions like COUNT, SUM, and AVG. We'll build on these fundamentals to create more sophisticated queries.

Understanding Subqueries: Queries Within Queries

A subquery is simply a query nested inside another query. Think of it like a function call in programming—the inner query executes first, returns a result, and that result gets used by the outer query.

Let's start with a practical example using a fictional e-commerce database:

-- Find all products that cost more than the average product price
SELECT product_name, price
FROM products
WHERE price > (
    SELECT AVG(price)
    FROM products
);

Here, the subquery SELECT AVG(price) FROM products calculates the average price of all products. The outer query then uses this average to filter products. The subquery runs first, returns a single value (let's say $45.67), and then the outer query becomes:

SELECT product_name, price
FROM products
WHERE price > 45.67;

But the beauty is that this average updates automatically as your data changes—no manual intervention required.

The Three Types of Subqueries

Subqueries come in three main flavors, each serving different analytical needs:

1. Scalar Subqueries (Single Value)

Scalar subqueries return exactly one value—one row with one column. They're perfect for comparisons against calculated values:

-- Find customers who placed orders larger than the average order
SELECT customer_id, order_total
FROM orders
WHERE order_total > (
    SELECT AVG(order_total)
    FROM orders
);

The key characteristic: the subquery must return exactly one value, or you'll get an error.

2. Row Subqueries (Single Row, Multiple Columns)

Less common but useful when you need to compare against multiple related values:

-- Find products with the same category and supplier as the most expensive product
SELECT product_name, category, supplier_id, price
FROM products
WHERE (category, supplier_id) = (
    SELECT category, supplier_id
    FROM products
    ORDER BY price DESC
    LIMIT 1
);

3. Table Subqueries (Multiple Rows)

These return multiple rows and are often used with IN, EXISTS, or comparison operators:

-- Find customers who have placed orders in the last 30 days
SELECT customer_id, customer_name
FROM customers
WHERE customer_id IN (
    SELECT DISTINCT customer_id
    FROM orders
    WHERE order_date >= CURRENT_DATE - INTERVAL 30 DAY
);

Subqueries in Different SQL Clauses

Subqueries aren't limited to WHERE clauses. They're incredibly versatile:

In SELECT Clauses (Computed Columns)

-- Show each customer with their total lifetime value
SELECT 
    customer_name,
    email,
    (SELECT SUM(order_total) 
     FROM orders 
     WHERE orders.customer_id = customers.customer_id) as lifetime_value
FROM customers;

This creates a computed column that calculates each customer's total spending.

In FROM Clauses (Derived Tables)

-- Analyze monthly sales trends
SELECT month_year, avg_monthly_sales
FROM (
    SELECT 
        DATE_FORMAT(order_date, '%Y-%m') as month_year,
        AVG(order_total) as avg_monthly_sales
    FROM orders
    GROUP BY DATE_FORMAT(order_date, '%Y-%m')
) as monthly_averages
WHERE avg_monthly_sales > 1000;

The subquery creates a temporary result set that the outer query can filter and manipulate.

In HAVING Clauses

-- Find product categories with above-average sales
SELECT category, SUM(quantity_sold * price) as category_revenue
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY category
HAVING SUM(quantity_sold * price) > (
    SELECT AVG(category_total)
    FROM (
        SELECT SUM(quantity_sold * price) as category_total
        FROM products p2
        JOIN order_items oi2 ON p2.product_id = oi2.product_id
        GROUP BY category
    ) category_averages
);

Performance Tip: Subqueries in SELECT clauses run once for each row in the result set. With large datasets, this can become very slow. Consider JOINs or window functions for better performance.

Introducing Common Table Expressions (CTEs)

While subqueries are powerful, they can make complex queries hard to read and maintain. Enter Common Table Expressions (CTEs)—a way to define temporary, named result sets that exist only for the duration of your query.

Think of CTEs as creating temporary "views" that you can reference throughout your main query. Here's the basic syntax:

WITH cte_name AS (
    -- Your query here
    SELECT column1, column2
    FROM some_table
    WHERE some_condition
)
SELECT *
FROM cte_name
WHERE another_condition;

Let's rewrite our earlier customer lifetime value example using a CTE:

-- Calculate customer lifetime value with a CTE
WITH customer_totals AS (
    SELECT 
        customer_id,
        SUM(order_total) as lifetime_value
    FROM orders
    GROUP BY customer_id
)
SELECT 
    c.customer_name,
    c.email,
    ct.lifetime_value
FROM customers c
JOIN customer_totals ct ON c.customer_id = ct.customer_id
WHERE ct.lifetime_value > 1000;

This is much more readable than the equivalent subquery version. The CTE clearly separates the calculation logic from the final result formatting.

Multiple CTEs: Building Complex Analysis Step by Step

You can define multiple CTEs in a single query, with later CTEs referencing earlier ones:

-- Complex customer segmentation analysis
WITH monthly_sales AS (
    -- First, calculate monthly sales per customer
    SELECT 
        customer_id,
        DATE_FORMAT(order_date, '%Y-%m') as month_year,
        SUM(order_total) as monthly_total
    FROM orders
    GROUP BY customer_id, DATE_FORMAT(order_date, '%Y-%m')
),
customer_metrics AS (
    -- Then, calculate customer-level metrics
    SELECT 
        customer_id,
        AVG(monthly_total) as avg_monthly_spend,
        COUNT(DISTINCT month_year) as active_months,
        MAX(monthly_total) as highest_month
    FROM monthly_sales
    GROUP BY customer_id
),
customer_segments AS (
    -- Finally, assign customer segments
    SELECT 
        customer_id,
        avg_monthly_spend,
        active_months,
        CASE 
            WHEN avg_monthly_spend > 500 AND active_months > 6 THEN 'VIP'
            WHEN avg_monthly_spend > 200 THEN 'Regular'
            ELSE 'Occasional'
        END as segment
    FROM customer_metrics
)
SELECT 
    segment,
    COUNT(*) as customer_count,
    AVG(avg_monthly_spend) as avg_segment_spend
FROM customer_segments
GROUP BY segment
ORDER BY avg_segment_spend DESC;

This query builds analysis in logical steps: raw data aggregation, metric calculation, segmentation logic, and final summary. Each CTE has a clear, single purpose.

Recursive CTEs: Handling Hierarchical Data

One of the most powerful CTE features is recursion—the ability for a CTE to reference itself. This is perfect for hierarchical data like organizational charts, product categories, or geographic regions.

Here's how to traverse an employee hierarchy:

-- Find all employees in a management chain
WITH RECURSIVE management_chain AS (
    -- Base case: start with the CEO
    SELECT 
        employee_id,
        employee_name,
        manager_id,
        1 as level
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive case: find direct reports
    SELECT 
        e.employee_id,
        e.employee_name,
        e.manager_id,
        mc.level + 1
    FROM employees e
    INNER JOIN management_chain mc ON e.manager_id = mc.employee_id
)
SELECT 
    REPEAT('  ', level - 1) || employee_name as org_chart,
    level
FROM management_chain
ORDER BY level, employee_name;

The recursive CTE works in two parts:

Base case: Find the starting point (CEO with no manager)
Recursive case: Find the next level down by joining employees whose manager_id matches the current level

Warning: Always include a termination condition in recursive CTEs to prevent infinite loops. Most databases have built-in limits, but it's better to be explicit.

CTEs vs. Subqueries: When to Use Which

Understanding when to choose CTEs versus subqueries is crucial for writing maintainable SQL:

Use CTEs When:

Your query has multiple complex steps that build on each other
You need to reference the same subquery multiple times
Readability and maintainability are priorities
You're working with hierarchical or recursive data
You're prototyping and want to test intermediate results easily

Use Subqueries When:

You need a simple, one-off calculation
The subquery is used only once
You're working with legacy systems that don't support CTEs
Performance testing shows subqueries are faster for your specific use case

Here's a side-by-side comparison:

-- Subquery approach: harder to read, logic mixed together
SELECT 
    customer_name,
    total_orders,
    avg_order_value
FROM customers c
JOIN (
    SELECT 
        customer_id,
        COUNT(*) as total_orders,
        AVG(order_total) as avg_order_value
    FROM orders
    WHERE order_date >= '2024-01-01'
    GROUP BY customer_id
    HAVING COUNT(*) > (
        SELECT AVG(order_count)
        FROM (
            SELECT COUNT(*) as order_count
            FROM orders
            WHERE order_date >= '2024-01-01'
            GROUP BY customer_id
        ) customer_counts
    )
) order_stats ON c.customer_id = order_stats.customer_id;

-- CTE approach: clear, logical flow
WITH recent_orders AS (
    SELECT customer_id, order_total
    FROM orders
    WHERE order_date >= '2024-01-01'
),
customer_order_stats AS (
    SELECT 
        customer_id,
        COUNT(*) as total_orders,
        AVG(order_total) as avg_order_value
    FROM recent_orders
    GROUP BY customer_id
),
avg_orders_per_customer AS (
    SELECT AVG(total_orders) as avg_order_count
    FROM customer_order_stats
)
SELECT 
    c.customer_name,
    cos.total_orders,
    cos.avg_order_value
FROM customers c
JOIN customer_order_stats cos ON c.customer_id = cos.customer_id
CROSS JOIN avg_orders_per_customer aopc
WHERE cos.total_orders > aopc.avg_order_count;

The CTE version is longer but much clearer about what each step does.

Hands-On Exercise

Let's practice with a realistic scenario. You're analyzing an online learning platform's data to understand course performance and student engagement.

Database Schema:

courses (course_id, course_name, category, price)
enrollments (enrollment_id, student_id, course_id, enrollment_date)
course_completions (completion_id, student_id, course_id, completion_date, rating)

Challenge: Write a query to find courses that have both:

Above-average completion rates
Above-average student ratings

Try to solve this first using subqueries, then rewrite it using CTEs.

Solution with Subqueries:

SELECT 
    c.course_name,
    c.category,
    completion_stats.completion_rate,
    completion_stats.avg_rating
FROM courses c
JOIN (
    SELECT 
        course_id,
        COUNT(DISTINCT cc.student_id) * 1.0 / COUNT(DISTINCT e.student_id) as completion_rate,
        AVG(cc.rating) as avg_rating
    FROM enrollments e
    LEFT JOIN course_completions cc ON e.course_id = cc.course_id 
        AND e.student_id = cc.student_id
    GROUP BY course_id
) completion_stats ON c.course_id = completion_stats.course_id
WHERE completion_stats.completion_rate > (
    SELECT AVG(course_completion_rate)
    FROM (
        SELECT 
            COUNT(DISTINCT cc.student_id) * 1.0 / COUNT(DISTINCT e.student_id) as course_completion_rate
        FROM enrollments e
        LEFT JOIN course_completions cc ON e.course_id = cc.course_id 
            AND e.student_id = cc.student_id
        GROUP BY e.course_id
    ) rates
)
AND completion_stats.avg_rating > (
    SELECT AVG(rating)
    FROM course_completions
);

Solution with CTEs:

WITH course_stats AS (
    SELECT 
        e.course_id,
        COUNT(DISTINCT e.student_id) as total_enrollments,
        COUNT(DISTINCT cc.student_id) as total_completions,
        COUNT(DISTINCT cc.student_id) * 1.0 / COUNT(DISTINCT e.student_id) as completion_rate
    FROM enrollments e
    LEFT JOIN course_completions cc ON e.course_id = cc.course_id 
        AND e.student_id = cc.student_id
    GROUP BY e.course_id
),
course_ratings AS (
    SELECT 
        course_id,
        AVG(rating) as avg_rating
    FROM course_completions
    GROUP BY course_id
),
benchmarks AS (
    SELECT 
        AVG(completion_rate) as avg_completion_rate,
        (SELECT AVG(rating) FROM course_completions) as avg_rating
    FROM course_stats
)
SELECT 
    c.course_name,
    c.category,
    cs.completion_rate,
    cr.avg_rating
FROM courses c
JOIN course_stats cs ON c.course_id = cs.course_id
JOIN course_ratings cr ON c.course_id = cr.course_id
CROSS JOIN benchmarks b
WHERE cs.completion_rate > b.avg_completion_rate
    AND cr.avg_rating > b.avg_rating
ORDER BY cs.completion_rate DESC, cr.avg_rating DESC;

Notice how the CTE version clearly separates the different calculations and makes the final filtering logic much more readable.

Common Mistakes & Troubleshooting

Mistake 1: Subquery Returns Multiple Values

-- This will error if multiple customers have the same max order total
SELECT customer_name
FROM customers
WHERE customer_id = (
    SELECT customer_id
    FROM orders
    ORDER BY order_total DESC
);

Fix: Use IN instead of = for potentially multiple results:

SELECT customer_name
FROM customers
WHERE customer_id IN (
    SELECT customer_id
    FROM orders
    WHERE order_total = (
        SELECT MAX(order_total)
        FROM orders
    )
);

Mistake 2: Correlated Subqueries in Wrong Context

-- This won't work - can't reference outer query in a CTE definition
WITH expensive_orders AS (
    SELECT order_id, order_total
    FROM orders o1
    WHERE order_total > (
        SELECT AVG(order_total)
        FROM orders o2
        WHERE o2.customer_id = customers.customer_id  -- Error: customers not in scope
    )
)
SELECT * FROM expensive_orders;

Fix: Structure your query so all references are properly scoped:

WITH customer_avg_orders AS (
    SELECT 
        customer_id,
        AVG(order_total) as avg_order_total
    FROM orders
    GROUP BY customer_id
),
expensive_orders AS (
    SELECT 
        o.order_id,
        o.customer_id,
        o.order_total
    FROM orders o
    JOIN customer_avg_orders cao ON o.customer_id = cao.customer_id
    WHERE o.order_total > cao.avg_order_total
)
SELECT * FROM expensive_orders;

Mistake 3: Performance Issues with Correlated Subqueries

Correlated subqueries (ones that reference the outer query) can be extremely slow because they execute once for each row in the outer query:

-- Slow: subquery runs once per customer
SELECT 
    customer_name,
    (SELECT COUNT(*) FROM orders WHERE customer_id = c.customer_id) as order_count
FROM customers c;

Fix: Use JOINs or window functions instead:

-- Much faster: single scan of orders table
SELECT 
    c.customer_name,
    COALESCE(o.order_count, 0) as order_count
FROM customers c
LEFT JOIN (
    SELECT 
        customer_id,
        COUNT(*) as order_count
    FROM orders
    GROUP BY customer_id
) o ON c.customer_id = o.customer_id;

Mistake 4: Forgetting RECURSIVE Keyword

-- This won't work for recursive CTEs
WITH employee_hierarchy AS (
    SELECT employee_id, manager_id, 1 as level
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    SELECT e.employee_id, e.manager_id, eh.level + 1
    FROM employees e
    JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
)
SELECT * FROM employee_hierarchy;

Fix: Always include RECURSIVE keyword:

WITH RECURSIVE employee_hierarchy AS (
    -- rest of query unchanged
)

Debugging Tip: When troubleshooting complex CTEs, test each CTE individually by selecting from it directly. This helps isolate where problems occur.

Performance Considerations

Understanding the performance characteristics of subqueries and CTEs is crucial for production systems:

Subquery Performance

Scalar subqueries in WHERE clauses often perform well because they're evaluated once
Correlated subqueries can be performance killers—they execute once per outer row
EXISTS vs IN: EXISTS often performs better than IN for large datasets

-- Usually faster
SELECT customer_name
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
    AND o.order_date >= '2024-01-01'
);

-- Can be slower with large subquery results
SELECT customer_name
FROM customers
WHERE customer_id IN (
    SELECT customer_id
    FROM orders
    WHERE order_date >= '2024-01-01'
);

CTE Performance

CTEs are generally materialized (calculated once and stored temporarily)
This can be great for reuse, but uses memory for large result sets
Some databases optimize simple CTEs into regular subqueries
Recursive CTEs can be expensive—always include proper termination conditions

Summary & Next Steps

You've now mastered two essential SQL techniques that will dramatically expand what you can accomplish in a single query:

Subqueries excel at:

Simple calculations used once
Filtering based on aggregated values
Breaking complex logic into digestible pieces

CTEs are perfect for:

Multi-step analyses that build on each other
Improving code readability and maintainability
Recursive operations on hierarchical data
Queries where you need to reference intermediate results multiple times

The key insight is that both techniques let you think in terms of data transformation pipelines—taking raw data through a series of logical steps to arrive at your final answer. This approach makes complex analysis much more manageable and your SQL much more powerful.

Practice suggestions:

Take any complex query you've written and see if you can restructure it using CTEs for better readability
Find datasets with hierarchical structures (like organizational charts or product categories) and practice recursive CTEs
Experiment with query performance by comparing subquery vs CTE vs JOIN approaches on your actual data

What's next: Now that you can handle complex single-table and multi-step analyses, you're ready to dive into advanced JOIN techniques, window functions, and query optimization strategies. These skills will let you tackle virtually any analytical challenge SQL can solve.