
Imagine you're analyzing sales data and need to find the top-performing sales representatives — but only those who exceeded the company average. With basic SQL, you'd need to run one query to calculate the average, memorize that number, then write another query using that hardcoded value. That's clunky and error-prone.
This is where subqueries and Common Table Expressions (CTEs) become game-changers. They let you write queries that reference the results of other queries, all within a single SQL statement. Instead of juggling multiple queries and manual calculations, you can build sophisticated, self-contained analyses that adapt automatically when your data changes.
By the end of this lesson, you'll transform from writing simple, flat queries to crafting complex, multi-layered SQL that solves real business problems efficiently and elegantly.
What you'll learn:
You should be comfortable with basic SQL SELECT statements, including WHERE clauses, JOINs, and aggregate functions like COUNT, SUM, and AVG. If you can write a query to find the average salary in a table or join two tables together, you're ready for this lesson.
A subquery is simply a query nested inside another query. Think of it as asking a question that depends on the answer to another question. For example: "Show me all products that cost more than the average price" requires first calculating the average price, then finding products above that threshold.
Let's work with a realistic scenario. You're analyzing data for an e-commerce company with these tables:
-- Products table
CREATE TABLE products (
product_id INT PRIMARY KEY,
product_name VARCHAR(100),
category VARCHAR(50),
price DECIMAL(10,2),
supplier_id INT
);
-- Sales table
CREATE TABLE sales (
sale_id INT PRIMARY KEY,
product_id INT,
customer_id INT,
quantity INT,
sale_date DATE,
total_amount DECIMAL(10,2)
);
-- Customers table
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100),
city VARCHAR(50),
registration_date DATE
);
Here's your first subquery — finding products priced above average:
SELECT product_name, price
FROM products
WHERE price > (
SELECT AVG(price)
FROM products
);
The inner query SELECT AVG(price) FROM products runs first, calculating the average price. Let's say that returns $45.50. The outer query then becomes WHERE price > 45.50, showing all products above that threshold.
This is called an uncorrelated subquery because the inner query doesn't depend on any values from the outer query. It runs once and provides a single result.
You can use subqueries to add calculated columns. Here's how to show each product's price compared to the category average:
SELECT
product_name,
category,
price,
(SELECT AVG(price)
FROM products p2
WHERE p2.category = products.category) AS category_avg_price
FROM products;
This creates a column showing the average price for each product's category. Notice how the subquery references products.category from the outer query — this makes it a correlated subquery. For each row in the outer query, the subquery runs again with that row's category value.
WHERE clause subqueries are perfect for filtering based on calculated values. Find customers who made purchases above the average order amount:
SELECT DISTINCT c.customer_name, c.city
FROM customers c
JOIN sales s ON c.customer_id = s.customer_id
WHERE s.total_amount > (
SELECT AVG(total_amount)
FROM sales
);
You can also use subqueries with IN, EXISTS, and comparison operators. Here's how to find products that have never been sold:
SELECT product_name, price
FROM products
WHERE product_id NOT IN (
SELECT DISTINCT product_id
FROM sales
WHERE product_id IS NOT NULL
);
Warning: Be careful with NOT IN when the subquery might return NULL values. If any value in the subquery is NULL, NOT IN returns no results at all. Use NOT EXISTS instead for safer null handling.
FROM clause subqueries let you treat query results as temporary tables. This is useful for pre-aggregating data:
SELECT
category,
AVG(monthly_sales) AS avg_monthly_sales
FROM (
SELECT
p.category,
DATE_FORMAT(s.sale_date, '%Y-%m') AS month,
SUM(s.total_amount) AS monthly_sales
FROM sales s
JOIN products p ON s.product_id = p.product_id
GROUP BY p.category, DATE_FORMAT(s.sale_date, '%Y-%m')
) AS monthly_category_sales
GROUP BY category;
This query first calculates monthly sales by category, then averages those monthly totals. Without the subquery, this would require multiple steps or temporary tables.
Common Table Expressions provide a cleaner, more readable way to write complex queries with multiple levels. A CTE creates a named temporary result set that exists only for the duration of your query.
Here's the basic syntax:
WITH cte_name AS (
-- Your query here
)
SELECT * FROM cte_name;
Let's rewrite our category average example using a CTE:
WITH category_averages AS (
SELECT
category,
AVG(price) AS avg_price
FROM products
GROUP BY category
)
SELECT
p.product_name,
p.category,
p.price,
ca.avg_price,
p.price - ca.avg_price AS price_difference
FROM products p
JOIN category_averages ca ON p.category = ca.category;
The CTE category_averages calculates average prices by category, then we join it with the products table. This is much easier to read and understand than nested subqueries.
You can define multiple CTEs in a single query, each building on previous results. Let's analyze customer purchasing patterns:
WITH customer_stats AS (
SELECT
c.customer_id,
c.customer_name,
COUNT(s.sale_id) AS total_orders,
SUM(s.total_amount) AS total_spent,
AVG(s.total_amount) AS avg_order_value
FROM customers c
LEFT JOIN sales s ON c.customer_id = s.customer_id
GROUP BY c.customer_id, c.customer_name
),
spending_tiers AS (
SELECT
*,
CASE
WHEN total_spent >= 1000 THEN 'High Value'
WHEN total_spent >= 500 THEN 'Medium Value'
WHEN total_spent > 0 THEN 'Low Value'
ELSE 'No Purchases'
END AS customer_tier
FROM customer_stats
)
SELECT
customer_tier,
COUNT(*) AS customer_count,
AVG(total_spent) AS avg_spent_in_tier,
AVG(avg_order_value) AS avg_order_value_in_tier
FROM spending_tiers
GROUP BY customer_tier
ORDER BY avg_spent_in_tier DESC;
This query flows logically: first calculate customer statistics, then assign tiers, finally summarize by tier. Each CTE has a clear purpose and builds on the previous one.
CTEs can reference themselves, creating recursive queries perfect for hierarchical data. Suppose you have an organizational chart stored in a table:
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100),
manager_id INT,
department VARCHAR(50)
);
Find all employees in a management hierarchy starting from a specific manager:
WITH RECURSIVE management_tree AS (
-- Base case: start with the top manager
SELECT
employee_id,
employee_name,
manager_id,
0 AS level
FROM employees
WHERE employee_id = 101 -- Starting manager
UNION ALL
-- Recursive case: find direct reports
SELECT
e.employee_id,
e.employee_name,
e.manager_id,
mt.level + 1
FROM employees e
JOIN management_tree mt ON e.manager_id = mt.employee_id
)
SELECT
REPEAT(' ', level) || employee_name AS indented_name,
level
FROM management_tree
ORDER BY level, employee_name;
The recursive CTE starts with one manager and repeatedly finds their direct reports, building the complete hierarchy.
Understanding the difference between correlated and uncorrelated subqueries is crucial for performance and logic.
Uncorrelated subqueries run once and return the same result for every row in the outer query:
-- This subquery runs once
SELECT product_name, price
FROM products
WHERE price > (SELECT AVG(price) FROM products);
Correlated subqueries run once for each row in the outer query, using values from that row:
-- This subquery runs once per product
SELECT
product_name,
price,
(SELECT COUNT(*)
FROM sales s
WHERE s.product_id = products.product_id) AS times_sold
FROM products;
Correlated subqueries are more flexible but can be slower with large datasets since they execute repeatedly.
Choose subqueries when:
Choose CTEs when:
Here's the same logic implemented both ways:
-- Subquery approach
SELECT
c.customer_name,
(SELECT SUM(total_amount)
FROM sales s
WHERE s.customer_id = c.customer_id) AS total_spent
FROM customers c
WHERE (SELECT COUNT(*)
FROM sales s
WHERE s.customer_id = c.customer_id) > 5;
-- CTE approach
WITH customer_summary AS (
SELECT
c.customer_id,
c.customer_name,
COALESCE(SUM(s.total_amount), 0) AS total_spent,
COUNT(s.sale_id) AS order_count
FROM customers c
LEFT JOIN sales s ON c.customer_id = s.customer_id
GROUP BY c.customer_id, c.customer_name
)
SELECT customer_name, total_spent
FROM customer_summary
WHERE order_count > 5;
The CTE version is clearer, more efficient (no repeated subqueries), and easier to modify.
Let's put your new skills to work with a comprehensive exercise. You're analyzing data for a subscription-based software company. Work through these increasingly complex challenges:
-- Sample data setup
CREATE TABLE subscriptions (
subscription_id INT PRIMARY KEY,
customer_id INT,
plan_type VARCHAR(20), -- 'Basic', 'Pro', 'Enterprise'
monthly_fee DECIMAL(8,2),
start_date DATE,
end_date DATE
);
CREATE TABLE usage_logs (
log_id INT PRIMARY KEY,
customer_id INT,
feature_used VARCHAR(50),
usage_date DATE,
session_minutes INT
);
Challenge 1: Find all customers paying above the average monthly fee for their plan type.
-- Your solution here
WITH plan_averages AS (
SELECT
plan_type,
AVG(monthly_fee) AS avg_fee
FROM subscriptions
WHERE end_date IS NULL -- Active subscriptions only
GROUP BY plan_type
)
SELECT
s.customer_id,
s.plan_type,
s.monthly_fee,
pa.avg_fee
FROM subscriptions s
JOIN plan_averages pa ON s.plan_type = pa.plan_type
WHERE s.monthly_fee > pa.avg_fee
AND s.end_date IS NULL;
Challenge 2: Identify customers who use the product more than the average for their subscription tier, but pay less than the tier average.
-- Your solution here
WITH tier_stats AS (
SELECT
s.plan_type,
AVG(s.monthly_fee) AS avg_fee,
AVG(ul.total_minutes) AS avg_usage
FROM subscriptions s
JOIN (
SELECT
customer_id,
SUM(session_minutes) AS total_minutes
FROM usage_logs
WHERE usage_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
GROUP BY customer_id
) ul ON s.customer_id = ul.customer_id
WHERE s.end_date IS NULL
GROUP BY s.plan_type
),
customer_analysis AS (
SELECT
s.customer_id,
s.plan_type,
s.monthly_fee,
COALESCE(SUM(ul.session_minutes), 0) AS monthly_usage
FROM subscriptions s
LEFT JOIN usage_logs ul ON s.customer_id = ul.customer_id
AND ul.usage_date >= DATE_SUB(CURRENT_DATE, INTERVAL 30 DAY)
WHERE s.end_date IS NULL
GROUP BY s.customer_id, s.plan_type, s.monthly_fee
)
SELECT
ca.customer_id,
ca.plan_type,
ca.monthly_fee,
ts.avg_fee,
ca.monthly_usage,
ts.avg_usage
FROM customer_analysis ca
JOIN tier_stats ts ON ca.plan_type = ts.plan_type
WHERE ca.monthly_usage > ts.avg_usage
AND ca.monthly_fee < ts.avg_fee;
Work through these challenges step by step. Start with the inner CTEs to understand what data you're working with, then build the outer query logic.
Mistake 1: Forgetting column aliases in CTEs
-- Wrong - no alias for calculated column
WITH sales_summary AS (
SELECT customer_id, SUM(total_amount)
FROM sales
GROUP BY customer_id
)
SELECT * FROM sales_summary; -- Column has no name!
-- Correct
WITH sales_summary AS (
SELECT customer_id, SUM(total_amount) AS total_spent
FROM sales
GROUP BY customer_id
)
SELECT * FROM sales_summary;
Mistake 2: Using correlated subqueries when CTEs would be clearer
-- Hard to read and potentially slow
SELECT
customer_name,
(SELECT AVG(total_amount) FROM sales s WHERE s.customer_id = c.customer_id),
(SELECT COUNT(*) FROM sales s WHERE s.customer_id = c.customer_id),
(SELECT MAX(total_amount) FROM sales s WHERE s.customer_id = c.customer_id)
FROM customers c;
-- Much clearer with CTE
WITH customer_stats AS (
SELECT
customer_id,
AVG(total_amount) AS avg_order,
COUNT(*) AS order_count,
MAX(total_amount) AS largest_order
FROM sales
GROUP BY customer_id
)
SELECT
c.customer_name,
cs.avg_order,
cs.order_count,
cs.largest_order
FROM customers c
JOIN customer_stats cs ON c.customer_id = cs.customer_id;
Mistake 3: NULL handling in subqueries When using NOT IN with subqueries, NULL values cause unexpected results:
-- Dangerous - returns no rows if any product_id is NULL
SELECT product_name
FROM products
WHERE product_id NOT IN (SELECT product_id FROM sales);
-- Safe alternative
SELECT product_name
FROM products
WHERE product_id NOT IN (
SELECT product_id
FROM sales
WHERE product_id IS NOT NULL
);
-- Even better with EXISTS
SELECT product_name
FROM products p
WHERE NOT EXISTS (
SELECT 1
FROM sales s
WHERE s.product_id = p.product_id
);
Tip: When debugging complex queries, build them incrementally. Start with the innermost subquery or first CTE, verify its results, then add the next layer.
Subquery Performance:
CTE Performance:
General Tips:
You've now mastered two powerful SQL techniques that transform how you approach complex data analysis. Subqueries let you embed calculations and filters directly into your queries, while CTEs provide a clean, readable way to build multi-step logic.
The key insights to remember:
You're now equipped to tackle sophisticated business questions like customer segmentation, comparative analysis, and hierarchical reporting. These techniques form the foundation for advanced SQL patterns like window functions and complex analytical queries.
Next steps in your SQL journey:
Start applying these concepts to your own datasets immediately. The more you practice building layered queries, the more natural this problem-solving approach will become.
Learning Path: SQL Fundamentals