Imagine you're analyzing customer data for an e-commerce company, and your manager asks: "Show me all customers who spent more than the average order amount in the last quarter." Your first instinct might be to calculate the average separately, write it down, then use that number in another query. But what if that average changes daily? You'd be constantly updating your queries with new hardcoded values.
This is exactly the kind of problem that subqueries and Common Table Expressions (CTEs) solve elegantly. Instead of breaking your analysis into multiple manual steps, you can write a single, dynamic query that calculates the average and uses it immediately—all in one seamless operation.
By the end of this lesson, you'll master two of SQL's most powerful techniques for building complex, multi-step analyses within a single query. You'll understand when to use each approach, how they work under the hood, and how to troubleshoot common issues that trip up even experienced developers.
What you'll learn:
You should be comfortable with basic SQL operations including SELECT statements, WHERE clauses, JOINs, and aggregate functions like COUNT, SUM, and AVG. We'll build on these fundamentals to create more sophisticated queries.
A subquery is simply a query nested inside another query. Think of it like a function call in programming—the inner query executes first, returns a result, and that result gets used by the outer query.
Let's start with a practical example using a fictional e-commerce database:
-- Find all products that cost more than the average product price
SELECT product_name, price
FROM products
WHERE price > (
SELECT AVG(price)
FROM products
);
Here, the subquery SELECT AVG(price) FROM products calculates the average price of all products. The outer query then uses this average to filter products. The subquery runs first, returns a single value (let's say $45.67), and then the outer query becomes:
SELECT product_name, price
FROM products
WHERE price > 45.67;
But the beauty is that this average updates automatically as your data changes—no manual intervention required.
Subqueries come in three main flavors, each serving different analytical needs:
Scalar subqueries return exactly one value—one row with one column. They're perfect for comparisons against calculated values:
-- Find customers who placed orders larger than the average order
SELECT customer_id, order_total
FROM orders
WHERE order_total > (
SELECT AVG(order_total)
FROM orders
);
The key characteristic: the subquery must return exactly one value, or you'll get an error.
Less common but useful when you need to compare against multiple related values:
-- Find products with the same category and supplier as the most expensive product
SELECT product_name, category, supplier_id, price
FROM products
WHERE (category, supplier_id) = (
SELECT category, supplier_id
FROM products
ORDER BY price DESC
LIMIT 1
);
These return multiple rows and are often used with IN, EXISTS, or comparison operators:
-- Find customers who have placed orders in the last 30 days
SELECT customer_id, customer_name
FROM customers
WHERE customer_id IN (
SELECT DISTINCT customer_id
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL 30 DAY
);
Subqueries aren't limited to WHERE clauses. They're incredibly versatile:
-- Show each customer with their total lifetime value
SELECT
customer_name,
email,
(SELECT SUM(order_total)
FROM orders
WHERE orders.customer_id = customers.customer_id) as lifetime_value
FROM customers;
This creates a computed column that calculates each customer's total spending.
-- Analyze monthly sales trends
SELECT month_year, avg_monthly_sales
FROM (
SELECT
DATE_FORMAT(order_date, '%Y-%m') as month_year,
AVG(order_total) as avg_monthly_sales
FROM orders
GROUP BY DATE_FORMAT(order_date, '%Y-%m')
) as monthly_averages
WHERE avg_monthly_sales > 1000;
The subquery creates a temporary result set that the outer query can filter and manipulate.
-- Find product categories with above-average sales
SELECT category, SUM(quantity_sold * price) as category_revenue
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY category
HAVING SUM(quantity_sold * price) > (
SELECT AVG(category_total)
FROM (
SELECT SUM(quantity_sold * price) as category_total
FROM products p2
JOIN order_items oi2 ON p2.product_id = oi2.product_id
GROUP BY category
) category_averages
);
Performance Tip: Subqueries in SELECT clauses run once for each row in the result set. With large datasets, this can become very slow. Consider JOINs or window functions for better performance.
While subqueries are powerful, they can make complex queries hard to read and maintain. Enter Common Table Expressions (CTEs)—a way to define temporary, named result sets that exist only for the duration of your query.
Think of CTEs as creating temporary "views" that you can reference throughout your main query. Here's the basic syntax:
WITH cte_name AS (
-- Your query here
SELECT column1, column2
FROM some_table
WHERE some_condition
)
SELECT *
FROM cte_name
WHERE another_condition;
Let's rewrite our earlier customer lifetime value example using a CTE:
-- Calculate customer lifetime value with a CTE
WITH customer_totals AS (
SELECT
customer_id,
SUM(order_total) as lifetime_value
FROM orders
GROUP BY customer_id
)
SELECT
c.customer_name,
c.email,
ct.lifetime_value
FROM customers c
JOIN customer_totals ct ON c.customer_id = ct.customer_id
WHERE ct.lifetime_value > 1000;
This is much more readable than the equivalent subquery version. The CTE clearly separates the calculation logic from the final result formatting.
You can define multiple CTEs in a single query, with later CTEs referencing earlier ones:
-- Complex customer segmentation analysis
WITH monthly_sales AS (
-- First, calculate monthly sales per customer
SELECT
customer_id,
DATE_FORMAT(order_date, '%Y-%m') as month_year,
SUM(order_total) as monthly_total
FROM orders
GROUP BY customer_id, DATE_FORMAT(order_date, '%Y-%m')
),
customer_metrics AS (
-- Then, calculate customer-level metrics
SELECT
customer_id,
AVG(monthly_total) as avg_monthly_spend,
COUNT(DISTINCT month_year) as active_months,
MAX(monthly_total) as highest_month
FROM monthly_sales
GROUP BY customer_id
),
customer_segments AS (
-- Finally, assign customer segments
SELECT
customer_id,
avg_monthly_spend,
active_months,
CASE
WHEN avg_monthly_spend > 500 AND active_months > 6 THEN 'VIP'
WHEN avg_monthly_spend > 200 THEN 'Regular'
ELSE 'Occasional'
END as segment
FROM customer_metrics
)
SELECT
segment,
COUNT(*) as customer_count,
AVG(avg_monthly_spend) as avg_segment_spend
FROM customer_segments
GROUP BY segment
ORDER BY avg_segment_spend DESC;
This query builds analysis in logical steps: raw data aggregation, metric calculation, segmentation logic, and final summary. Each CTE has a clear, single purpose.
One of the most powerful CTE features is recursion—the ability for a CTE to reference itself. This is perfect for hierarchical data like organizational charts, product categories, or geographic regions.
Here's how to traverse an employee hierarchy:
-- Find all employees in a management chain
WITH RECURSIVE management_chain AS (
-- Base case: start with the CEO
SELECT
employee_id,
employee_name,
manager_id,
1 as level
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive case: find direct reports
SELECT
e.employee_id,
e.employee_name,
e.manager_id,
mc.level + 1
FROM employees e
INNER JOIN management_chain mc ON e.manager_id = mc.employee_id
)
SELECT
REPEAT(' ', level - 1) || employee_name as org_chart,
level
FROM management_chain
ORDER BY level, employee_name;
The recursive CTE works in two parts:
Warning: Always include a termination condition in recursive CTEs to prevent infinite loops. Most databases have built-in limits, but it's better to be explicit.
Understanding when to choose CTEs versus subqueries is crucial for writing maintainable SQL:
Here's a side-by-side comparison:
-- Subquery approach: harder to read, logic mixed together
SELECT
customer_name,
total_orders,
avg_order_value
FROM customers c
JOIN (
SELECT
customer_id,
COUNT(*) as total_orders,
AVG(order_total) as avg_order_value
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
HAVING COUNT(*) > (
SELECT AVG(order_count)
FROM (
SELECT COUNT(*) as order_count
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
) customer_counts
)
) order_stats ON c.customer_id = order_stats.customer_id;
-- CTE approach: clear, logical flow
WITH recent_orders AS (
SELECT customer_id, order_total
FROM orders
WHERE order_date >= '2024-01-01'
),
customer_order_stats AS (
SELECT
customer_id,
COUNT(*) as total_orders,
AVG(order_total) as avg_order_value
FROM recent_orders
GROUP BY customer_id
),
avg_orders_per_customer AS (
SELECT AVG(total_orders) as avg_order_count
FROM customer_order_stats
)
SELECT
c.customer_name,
cos.total_orders,
cos.avg_order_value
FROM customers c
JOIN customer_order_stats cos ON c.customer_id = cos.customer_id
CROSS JOIN avg_orders_per_customer aopc
WHERE cos.total_orders > aopc.avg_order_count;
The CTE version is longer but much clearer about what each step does.
Let's practice with a realistic scenario. You're analyzing an online learning platform's data to understand course performance and student engagement.
Database Schema:
courses (course_id, course_name, category, price)enrollments (enrollment_id, student_id, course_id, enrollment_date)course_completions (completion_id, student_id, course_id, completion_date, rating)Challenge: Write a query to find courses that have both:
Try to solve this first using subqueries, then rewrite it using CTEs.
Solution with Subqueries:
SELECT
c.course_name,
c.category,
completion_stats.completion_rate,
completion_stats.avg_rating
FROM courses c
JOIN (
SELECT
course_id,
COUNT(DISTINCT cc.student_id) * 1.0 / COUNT(DISTINCT e.student_id) as completion_rate,
AVG(cc.rating) as avg_rating
FROM enrollments e
LEFT JOIN course_completions cc ON e.course_id = cc.course_id
AND e.student_id = cc.student_id
GROUP BY course_id
) completion_stats ON c.course_id = completion_stats.course_id
WHERE completion_stats.completion_rate > (
SELECT AVG(course_completion_rate)
FROM (
SELECT
COUNT(DISTINCT cc.student_id) * 1.0 / COUNT(DISTINCT e.student_id) as course_completion_rate
FROM enrollments e
LEFT JOIN course_completions cc ON e.course_id = cc.course_id
AND e.student_id = cc.student_id
GROUP BY e.course_id
) rates
)
AND completion_stats.avg_rating > (
SELECT AVG(rating)
FROM course_completions
);
Solution with CTEs:
WITH course_stats AS (
SELECT
e.course_id,
COUNT(DISTINCT e.student_id) as total_enrollments,
COUNT(DISTINCT cc.student_id) as total_completions,
COUNT(DISTINCT cc.student_id) * 1.0 / COUNT(DISTINCT e.student_id) as completion_rate
FROM enrollments e
LEFT JOIN course_completions cc ON e.course_id = cc.course_id
AND e.student_id = cc.student_id
GROUP BY e.course_id
),
course_ratings AS (
SELECT
course_id,
AVG(rating) as avg_rating
FROM course_completions
GROUP BY course_id
),
benchmarks AS (
SELECT
AVG(completion_rate) as avg_completion_rate,
(SELECT AVG(rating) FROM course_completions) as avg_rating
FROM course_stats
)
SELECT
c.course_name,
c.category,
cs.completion_rate,
cr.avg_rating
FROM courses c
JOIN course_stats cs ON c.course_id = cs.course_id
JOIN course_ratings cr ON c.course_id = cr.course_id
CROSS JOIN benchmarks b
WHERE cs.completion_rate > b.avg_completion_rate
AND cr.avg_rating > b.avg_rating
ORDER BY cs.completion_rate DESC, cr.avg_rating DESC;
Notice how the CTE version clearly separates the different calculations and makes the final filtering logic much more readable.
-- This will error if multiple customers have the same max order total
SELECT customer_name
FROM customers
WHERE customer_id = (
SELECT customer_id
FROM orders
ORDER BY order_total DESC
);
Fix: Use IN instead of = for potentially multiple results:
SELECT customer_name
FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM orders
WHERE order_total = (
SELECT MAX(order_total)
FROM orders
)
);
-- This won't work - can't reference outer query in a CTE definition
WITH expensive_orders AS (
SELECT order_id, order_total
FROM orders o1
WHERE order_total > (
SELECT AVG(order_total)
FROM orders o2
WHERE o2.customer_id = customers.customer_id -- Error: customers not in scope
)
)
SELECT * FROM expensive_orders;
Fix: Structure your query so all references are properly scoped:
WITH customer_avg_orders AS (
SELECT
customer_id,
AVG(order_total) as avg_order_total
FROM orders
GROUP BY customer_id
),
expensive_orders AS (
SELECT
o.order_id,
o.customer_id,
o.order_total
FROM orders o
JOIN customer_avg_orders cao ON o.customer_id = cao.customer_id
WHERE o.order_total > cao.avg_order_total
)
SELECT * FROM expensive_orders;
Correlated subqueries (ones that reference the outer query) can be extremely slow because they execute once for each row in the outer query:
-- Slow: subquery runs once per customer
SELECT
customer_name,
(SELECT COUNT(*) FROM orders WHERE customer_id = c.customer_id) as order_count
FROM customers c;
Fix: Use JOINs or window functions instead:
-- Much faster: single scan of orders table
SELECT
c.customer_name,
COALESCE(o.order_count, 0) as order_count
FROM customers c
LEFT JOIN (
SELECT
customer_id,
COUNT(*) as order_count
FROM orders
GROUP BY customer_id
) o ON c.customer_id = o.customer_id;
-- This won't work for recursive CTEs
WITH employee_hierarchy AS (
SELECT employee_id, manager_id, 1 as level
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.manager_id, eh.level + 1
FROM employees e
JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
)
SELECT * FROM employee_hierarchy;
Fix: Always include RECURSIVE keyword:
WITH RECURSIVE employee_hierarchy AS (
-- rest of query unchanged
)
Debugging Tip: When troubleshooting complex CTEs, test each CTE individually by selecting from it directly. This helps isolate where problems occur.
Understanding the performance characteristics of subqueries and CTEs is crucial for production systems:
-- Usually faster
SELECT customer_name
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date >= '2024-01-01'
);
-- Can be slower with large subquery results
SELECT customer_name
FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM orders
WHERE order_date >= '2024-01-01'
);
You've now mastered two essential SQL techniques that will dramatically expand what you can accomplish in a single query:
Subqueries excel at:
CTEs are perfect for:
The key insight is that both techniques let you think in terms of data transformation pipelines—taking raw data through a series of logical steps to arrive at your final answer. This approach makes complex analysis much more manageable and your SQL much more powerful.
Practice suggestions:
What's next: Now that you can handle complex single-table and multi-step analyses, you're ready to dive into advanced JOIN techniques, window functions, and query optimization strategies. These skills will let you tackle virtually any analytical challenge SQL can solve.
Learning Path: SQL Fundamentals