Master SQL Subqueries and CTEs: Write Complex Queries with Confidence

Imagine you're analyzing sales data and need to find the top-performing sales representatives — but only those who exceeded the company average. With basic SQL, you'd need to run one query to calculate the average, memorize that number, then write another query using that hardcoded value. That's clunky and error-prone.

This is where subqueries and Common Table Expressions (CTEs) become game-changers. They let you write queries that reference the results of other queries, all within a single SQL statement. Instead of juggling multiple queries and manual calculations, you can build sophisticated, self-contained analyses that adapt automatically when your data changes.

By the end of this lesson, you'll transform from writing simple, flat queries to crafting complex, multi-layered SQL that solves real business problems efficiently and elegantly.

What you'll learn:

How to write subqueries that embed one query inside another
When to use subqueries in SELECT, WHERE, and FROM clauses
How to create and use Common Table Expressions (CTEs)
The differences between correlated and uncorrelated subqueries
Best practices for choosing between subqueries and CTEs

Prerequisites

You should be comfortable with basic SQL SELECT statements, including WHERE clauses, JOINs, and aggregate functions like COUNT, SUM, and AVG. If you can write a query to find the average salary in a table or join two tables together, you're ready for this lesson.

Understanding Subqueries: Queries Within Queries

A subquery is simply a query nested inside another query. Think of it as asking a question that depends on the answer to another question. For example: "Show me all products that cost more than the average price" requires first calculating the average price, then finding products above that threshold.

Let's work with a realistic scenario. You're analyzing data for an e-commerce company with these tables:

-- Products table
CREATE TABLE products (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100),
    category VARCHAR(50),
    price DECIMAL(10,2),
    supplier_id INT
);

-- Sales table  
CREATE TABLE sales (
    sale_id INT PRIMARY KEY,
    product_id INT,
    customer_id INT,
    quantity INT,
    sale_date DATE,
    total_amount DECIMAL(10,2)
);

-- Customers table
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR(100),
    city VARCHAR(50),
    registration_date DATE
);

Here's your first subquery — finding products priced above average:

SELECT product_name, price
FROM products
WHERE price > (
    SELECT AVG(price) 
    FROM products
);

The inner query SELECT AVG(price) FROM products runs first, calculating the average price. Let's say that returns $45.50. The outer query then becomes WHERE price > 45.50, showing all products above that threshold.

This is called an uncorrelated subquery because the inner query doesn't depend on any values from the outer query. It runs once and provides a single result.

Subqueries in Different Clauses

Subqueries in the SELECT Clause

You can use subqueries to add calculated columns. Here's how to show each product's price compared to the category average:

SELECT 
    product_name,
    category,
    price,
    (SELECT AVG(price) 
     FROM products p2 
     WHERE p2.category = products.category) AS category_avg_price
FROM products;

This creates a column showing the average price for each product's category. Notice how the subquery references products.category from the outer query — this makes it a correlated subquery. For each row in the outer query, the subquery runs again with that row's category value.

Subqueries in the WHERE Clause

WHERE clause subqueries are perfect for filtering based on calculated values. Find customers who made purchases above the average order amount:

SELECT DISTINCT c.customer_name, c.city
FROM customers c
JOIN sales s ON c.customer_id = s.customer_id
WHERE s.total_amount > (
    SELECT AVG(total_amount) 
    FROM sales
);

You can also use subqueries with IN, EXISTS, and comparison operators. Here's how to find products that have never been sold:

SELECT product_name, price
FROM products
WHERE product_id NOT IN (
    SELECT DISTINCT product_id 
    FROM sales 
    WHERE product_id IS NOT NULL
);

Warning: Be careful with NOT IN when the subquery might return NULL values. If any value in the subquery is NULL, NOT IN returns no results at all. Use NOT EXISTS instead for safer null handling.

Subqueries in the FROM Clause

FROM clause subqueries let you treat query results as temporary tables. This is useful for pre-aggregating data:

SELECT 
    category,
    AVG(monthly_sales) AS avg_monthly_sales
FROM (
    SELECT 
        p.category,
        DATE_FORMAT(s.sale_date, '%Y-%m') AS month,
        SUM(s.total_amount) AS monthly_sales
    FROM sales s
    JOIN products p ON s.product_id = p.product_id
    GROUP BY p.category, DATE_FORMAT(s.sale_date, '%Y-%m')
) AS monthly_category_sales
GROUP BY category;

This query first calculates monthly sales by category, then averages those monthly totals. Without the subquery, this would require multiple steps or temporary tables.

Introduction to Common Table Expressions (CTEs)

Common Table Expressions provide a cleaner, more readable way to write complex queries with multiple levels. A CTE creates a named temporary result set that exists only for the duration of your query.

Here's the basic syntax:

WITH cte_name AS (
    -- Your query here
)
SELECT * FROM cte_name;

Let's rewrite our category average example using a CTE:

WITH category_averages AS (
    SELECT 
        category,
        AVG(price) AS avg_price
    FROM products
    GROUP BY category
)
SELECT 
    p.product_name,
    p.category,
    p.price,
    ca.avg_price,
    p.price - ca.avg_price AS price_difference
FROM products p
JOIN category_averages ca ON p.category = ca.category;

The CTE category_averages calculates average prices by category, then we join it with the products table. This is much easier to read and understand than nested subqueries.

Multiple CTEs and Complex Analysis

You can define multiple CTEs in a single query, each building on previous results. Let's analyze customer purchasing patterns:

WITH customer_stats AS (
    SELECT 
        c.customer_id,
        c.customer_name,
        COUNT(s.sale_id) AS total_orders,
        SUM(s.total_amount) AS total_spent,
        AVG(s.total_amount) AS avg_order_value
    FROM customers c
    LEFT JOIN sales s ON c.customer_id = s.customer_id
    GROUP BY c.customer_id, c.customer_name
),
spending_tiers AS (
    SELECT 
        *,
        CASE 
            WHEN total_spent >= 1000 THEN 'High Value'
            WHEN total_spent >= 500 THEN 'Medium Value'
            WHEN total_spent > 0 THEN 'Low Value'
            ELSE 'No Purchases'
        END AS customer_tier
    FROM customer_stats
)
SELECT 
    customer_tier,
    COUNT(*) AS customer_count,
    AVG(total_spent) AS avg_spent_in_tier,
    AVG(avg_order_value) AS avg_order_value_in_tier
FROM spending_tiers
GROUP BY customer_tier
ORDER BY avg_spent_in_tier DESC;

This query flows logically: first calculate customer statistics, then assign tiers, finally summarize by tier. Each CTE has a clear purpose and builds on the previous one.

Recursive CTEs: Advanced Pattern

CTEs can reference themselves, creating recursive queries perfect for hierarchical data. Suppose you have an organizational chart stored in a table:

CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    employee_name VARCHAR(100),
    manager_id INT,
    department VARCHAR(50)
);

Find all employees in a management hierarchy starting from a specific manager:

WITH RECURSIVE management_tree AS (
    -- Base case: start with the top manager
    SELECT 
        employee_id, 
        employee_name, 
        manager_id,
        0 AS level
    FROM employees 
    WHERE employee_id = 101  -- Starting manager
    
    UNION ALL
    
    -- Recursive case: find direct reports
    SELECT 
        e.employee_id,
        e.employee_name,
        e.manager_id,
        mt.level + 1
    FROM employees e
    JOIN management_tree mt ON e.manager_id = mt.employee_id
)
SELECT 
    REPEAT('  ', level) || employee_name AS indented_name,
    level
FROM management_tree
ORDER BY level, employee_name;

The recursive CTE starts with one manager and repeatedly finds their direct reports, building the complete hierarchy.

Correlated vs Uncorrelated Subqueries

Understanding the difference between correlated and uncorrelated subqueries is crucial for performance and logic.

Uncorrelated subqueries run once and return the same result for every row in the outer query:

-- This subquery runs once
SELECT product_name, price
FROM products
WHERE price > (SELECT AVG(price) FROM products);

Correlated subqueries run once for each row in the outer query, using values from that row:

-- This subquery runs once per product
SELECT 
    product_name,
    price,
    (SELECT COUNT(*) 
     FROM sales s 
     WHERE s.product_id = products.product_id) AS times_sold
FROM products;

Correlated subqueries are more flexible but can be slower with large datasets since they execute repeatedly.

When to Use Subqueries vs CTEs

Choose subqueries when:

You need a simple, one-time calculation
The logic fits naturally in a WHERE or HAVING clause
You're working with a single level of nesting

Choose CTEs when:

You have complex, multi-step logic
You need to reference the same subquery multiple times
You want to improve query readability
You're building hierarchical or recursive queries

Here's the same logic implemented both ways:

-- Subquery approach
SELECT 
    c.customer_name,
    (SELECT SUM(total_amount) 
     FROM sales s 
     WHERE s.customer_id = c.customer_id) AS total_spent
FROM customers c
WHERE (SELECT COUNT(*) 
       FROM sales s 
       WHERE s.customer_id = c.customer_id) > 5;

-- CTE approach  
WITH customer_summary AS (
    SELECT 
        c.customer_id,
        c.customer_name,
        COALESCE(SUM(s.total_amount), 0) AS total_spent,
        COUNT(s.sale_id) AS order_count
    FROM customers c
    LEFT JOIN sales s ON c.customer_id = s.customer_id
    GROUP BY c.customer_id, c.customer_name
)
SELECT customer_name, total_spent
FROM customer_summary
WHERE order_count > 5;

The CTE version is clearer, more efficient (no repeated subqueries), and easier to modify.

Hands-On Exercise

Let's put your new skills to work with a comprehensive exercise. You're analyzing data for a subscription-based software company. Work through these increasingly complex challenges:

-- Sample data setup
CREATE TABLE subscriptions (
    subscription_id INT PRIMARY KEY,
    customer_id INT,
    plan_type VARCHAR(20), -- 'Basic', 'Pro', 'Enterprise'
    monthly_fee DECIMAL(8,2),
    start_date DATE,
    end_date DATE
);

CREATE TABLE usage_logs (
    log_id INT PRIMARY KEY,
    customer_id INT,
    feature_used VARCHAR(50),
    usage_date DATE,
    session_minutes INT
);

Challenge 1: Find all customers paying above the average monthly fee for their plan type.

-- Your solution here
WITH plan_averages AS (
    SELECT 
        plan_type,
        AVG(monthly_fee) AS avg_fee
    FROM subscriptions
    WHERE end_date IS NULL  -- Active subscriptions only
    GROUP BY plan_type
)
SELECT 
    s.customer_id,
    s.plan_type,
    s.monthly_fee,
    pa.avg_fee
FROM subscriptions s
JOIN plan_averages pa ON s.plan_type = pa.plan_type
WHERE s.monthly_fee > pa.avg_fee
  AND s.end_date IS NULL;

Challenge 2: Identify customers who use the product more than the average for their subscription tier, but pay less than the tier average.

-- Your solution here
WITH tier_stats AS (
    SELECT 
        s.plan_type,
        AVG(s.monthly_fee) AS avg_fee,
        AVG(ul.total_minutes) AS avg_usage
    FROM subscriptions s
    JOIN (
        SELECT 
            customer_id, 
            SUM(session_minutes) AS total_minutes
        FROM usage_logs 
        WHERE usage_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
        GROUP BY customer_id
    ) ul ON s.customer_id = ul.customer_id
    WHERE s.end_date IS NULL
    GROUP BY s.plan_type
),
customer_analysis AS (
    SELECT 
        s.customer_id,
        s.plan_type,
        s.monthly_fee,
        COALESCE(SUM(ul.session_minutes), 0) AS monthly_usage
    FROM subscriptions s
    LEFT JOIN usage_logs ul ON s.customer_id = ul.customer_id 
        AND ul.usage_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
    WHERE s.end_date IS NULL
    GROUP BY s.customer_id, s.plan_type, s.monthly_fee
)
SELECT 
    ca.customer_id,
    ca.plan_type,
    ca.monthly_fee,
    ts.avg_fee,
    ca.monthly_usage,
    ts.avg_usage
FROM customer_analysis ca
JOIN tier_stats ts ON ca.plan_type = ts.plan_type
WHERE ca.monthly_usage > ts.avg_usage 
  AND ca.monthly_fee < ts.avg_fee;

Work through these challenges step by step. Start with the inner CTEs to understand what data you're working with, then build the outer query logic.

Common Mistakes & Troubleshooting

Mistake 1: Forgetting column aliases in CTEs

-- Wrong - no alias for calculated column
WITH sales_summary AS (
    SELECT customer_id, SUM(total_amount)
    FROM sales
    GROUP BY customer_id
)
SELECT * FROM sales_summary;  -- Column has no name!

-- Correct
WITH sales_summary AS (
    SELECT customer_id, SUM(total_amount) AS total_spent
    FROM sales
    GROUP BY customer_id
)
SELECT * FROM sales_summary;

Mistake 2: Using correlated subqueries when CTEs would be clearer

-- Hard to read and potentially slow
SELECT 
    customer_name,
    (SELECT AVG(total_amount) FROM sales s WHERE s.customer_id = c.customer_id),
    (SELECT COUNT(*) FROM sales s WHERE s.customer_id = c.customer_id),
    (SELECT MAX(total_amount) FROM sales s WHERE s.customer_id = c.customer_id)
FROM customers c;

-- Much clearer with CTE
WITH customer_stats AS (
    SELECT 
        customer_id,
        AVG(total_amount) AS avg_order,
        COUNT(*) AS order_count,
        MAX(total_amount) AS largest_order
    FROM sales
    GROUP BY customer_id
)
SELECT 
    c.customer_name,
    cs.avg_order,
    cs.order_count,
    cs.largest_order
FROM customers c
JOIN customer_stats cs ON c.customer_id = cs.customer_id;

Mistake 3: NULL handling in subqueries When using NOT IN with subqueries, NULL values cause unexpected results:

-- Dangerous - returns no rows if any product_id is NULL
SELECT product_name
FROM products
WHERE product_id NOT IN (SELECT product_id FROM sales);

-- Safe alternative
SELECT product_name
FROM products
WHERE product_id NOT IN (
    SELECT product_id 
    FROM sales 
    WHERE product_id IS NOT NULL
);

-- Even better with EXISTS
SELECT product_name
FROM products p
WHERE NOT EXISTS (
    SELECT 1 
    FROM sales s 
    WHERE s.product_id = p.product_id
);

Tip: When debugging complex queries, build them incrementally. Start with the innermost subquery or first CTE, verify its results, then add the next layer.

Performance Considerations

Subquery Performance:

Uncorrelated subqueries often perform well since they run once
Correlated subqueries can be slow with large datasets
Consider converting correlated subqueries to JOINs when possible

CTE Performance:

CTEs are generally optimized well by modern databases
Recursive CTEs can be resource-intensive with deep hierarchies
Some databases materialize CTE results, others treat them as views

General Tips:

Use EXPLAIN or EXPLAIN PLAN to understand query execution
Index columns used in subquery WHERE clauses
Consider breaking very complex queries into temporary tables for better performance

Summary & Next Steps

You've now mastered two powerful SQL techniques that transform how you approach complex data analysis. Subqueries let you embed calculations and filters directly into your queries, while CTEs provide a clean, readable way to build multi-step logic.

The key insights to remember:

Subqueries excel for simple, focused calculations embedded in WHERE, SELECT, or FROM clauses
CTEs shine when you need multiple steps, complex logic, or want to improve readability
Correlated subqueries run once per outer row and can access outer query values
Uncorrelated subqueries run once and provide the same result for all outer rows

You're now equipped to tackle sophisticated business questions like customer segmentation, comparative analysis, and hierarchical reporting. These techniques form the foundation for advanced SQL patterns like window functions and complex analytical queries.

Next steps in your SQL journey:

Explore window functions for advanced analytics
Learn about query optimization and execution plans
Practice with increasingly complex real-world datasets
Study database-specific extensions to CTE functionality

Start applying these concepts to your own datasets immediately. The more you practice building layered queries, the more natural this problem-solving approach will become.