You're staring at a complex data problem that seems to require multiple steps: first, you need to identify high-value customers from the past year, then calculate their average order value, and finally compare that against regional benchmarks. Your first instinct might be to create temporary tables or run multiple queries, but there's a more elegant solution hiding in plain sight. Advanced SQL practitioners know that subqueries and Common Table Expressions (CTEs) can transform unwieldy multi-step problems into readable, maintainable single queries.
While basic SQL gets you far with simple SELECT statements, real-world analytics demands the ability to break complex logic into manageable pieces while maintaining query performance. Subqueries and CTEs are your tools for building sophisticated data transformations that remain comprehensible to your future self and your colleagues. They're not just syntactic sugar—they're fundamental to writing production-quality SQL that scales.
By the end of this lesson, you'll wield subqueries and CTEs with the precision of a surgeon and the vision of an architect. You'll understand not just how to write them, but when to choose one over the other, how to optimize their performance, and how to avoid the subtle traps that catch even experienced developers.
What you'll learn:
You should be comfortable with intermediate SQL concepts including JOINs, aggregate functions, and window functions. Experience with query optimization and execution plans will enhance your understanding, though we'll cover the essentials as we go.
Subqueries are queries nested within other queries, but this simple definition masks their true power. Think of them as functions in programming—they encapsulate logic, return values, and can be composed to build complex operations. However, unlike functions, subqueries execute within the database engine's query optimizer, which opens both opportunities and pitfalls.
Let's start with a realistic scenario. You're analyzing e-commerce data and need to identify customers who placed orders above the average order value in each product category. Here's our sample data structure:
-- Sample data setup
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR(100),
region VARCHAR(50),
signup_date DATE
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
total_amount DECIMAL(10,2)
);
CREATE TABLE order_items (
item_id INT PRIMARY KEY,
order_id INT,
product_id INT,
quantity INT,
unit_price DECIMAL(8,2)
);
CREATE TABLE products (
product_id INT PRIMARY KEY,
name VARCHAR(200),
category VARCHAR(100),
cost DECIMAL(8,2)
);
Scalar subqueries return exactly one value and can be used anywhere you'd use a literal value or column reference. They're deceptively powerful for comparative analysis:
-- Find customers with above-average order totals
SELECT c.name, c.region, o.total_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.total_amount > (
SELECT AVG(total_amount)
FROM orders
)
ORDER BY o.total_amount DESC;
This looks straightforward, but the execution story is more complex. The database engine must decide whether to execute the subquery once (if it's uncorrelated) or multiple times (if it references outer query columns). Most modern optimizers are smart enough to cache scalar subquery results, but understanding this distinction is crucial for performance tuning.
Here's a more sophisticated example that reveals the power of scalar subqueries:
-- Calculate each customer's lifetime value as a percentage of total company revenue
SELECT
c.name,
c.region,
SUM(o.total_amount) as lifetime_value,
ROUND(
(SUM(o.total_amount) * 100.0) / (
SELECT SUM(total_amount) FROM orders
), 2
) as pct_of_total_revenue
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name, c.region
HAVING SUM(o.total_amount) > (
SELECT AVG(customer_total)
FROM (
SELECT SUM(total_amount) as customer_total
FROM orders
GROUP BY customer_id
) customer_averages
)
ORDER BY lifetime_value DESC;
Notice how we nest a subquery within a HAVING clause that itself contains a subquery. This is where query readability starts to suffer, foreshadowing why CTEs become essential.
Row subqueries return a single row with multiple columns. They're particularly useful for complex comparisons and existence checks:
-- Find orders that match the highest single-day sales volume
SELECT o.order_id, o.order_date, o.total_amount
FROM orders o
WHERE (o.order_date, o.total_amount) = (
SELECT order_date, MAX(total_amount)
FROM orders
GROUP BY order_date
ORDER BY MAX(total_amount) DESC
LIMIT 1
);
Row subqueries excel in scenarios where you need to match multiple criteria simultaneously. However, they're often overlooked because their syntax feels unfamiliar to developers coming from procedural programming backgrounds.
Table subqueries return multiple rows and columns, effectively creating temporary result sets. They appear in FROM clauses and are essential for multi-step analytics:
-- Analyze customer purchasing patterns by cohort
SELECT
cohort.signup_month,
cohort.customer_count,
cohort.avg_first_order_value,
recent_performance.avg_recent_order_value,
recent_performance.retention_rate
FROM (
SELECT
DATE_TRUNC('month', c.signup_date) as signup_month,
COUNT(*) as customer_count,
AVG(first_order.amount) as avg_first_order_value
FROM customers c
JOIN (
SELECT
customer_id,
MIN(order_date) as first_order_date,
MIN(total_amount) as amount
FROM orders
GROUP BY customer_id
) first_order ON c.customer_id = first_order.customer_id
GROUP BY DATE_TRUNC('month', c.signup_date)
) cohort
JOIN (
SELECT
DATE_TRUNC('month', c.signup_date) as signup_month,
AVG(o.total_amount) as avg_recent_order_value,
COUNT(DISTINCT o.customer_id) * 100.0 / cohort_size.total as retention_rate
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN (
SELECT
DATE_TRUNC('month', signup_date) as month,
COUNT(*) as total
FROM customers
GROUP BY DATE_TRUNC('month', signup_date)
) cohort_size ON DATE_TRUNC('month', c.signup_date) = cohort_size.month
WHERE o.order_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY DATE_TRUNC('month', c.signup_date), cohort_size.total
) recent_performance ON cohort.signup_month = recent_performance.signup_month
ORDER BY cohort.signup_month;
This query demonstrates how table subqueries can create sophisticated analytics, but it also shows the readability cliff we face with deeply nested subqueries. The logic is sound, but the structure is becoming unwieldy.
Correlated subqueries reference columns from the outer query, creating a dependency that fundamentally changes execution behavior. Instead of running once, correlated subqueries execute once for each row processed by the outer query. This can be powerful for row-by-row comparisons but dangerous for performance.
-- Find each customer's most expensive order
SELECT c.name, o.order_date, o.total_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.total_amount = (
SELECT MAX(total_amount)
FROM orders o2
WHERE o2.customer_id = c.customer_id
);
The inner query runs once per customer, using the c.customer_id from the outer query. Modern optimizers often transform these into more efficient JOINs, but understanding the logical execution helps you predict performance characteristics.
Correlated subqueries shine in existence and ranking scenarios:
-- Find customers who haven't ordered in their preferred category recently
SELECT DISTINCT c.name, c.region
FROM customers c
WHERE NOT EXISTS (
SELECT 1
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.customer_id = c.customer_id
AND p.category = (
-- Find each customer's most-ordered category
SELECT p2.category
FROM orders o2
JOIN order_items oi2 ON o2.order_id = oi2.order_id
JOIN products p2 ON oi2.product_id = p2.product_id
WHERE o2.customer_id = c.customer_id
GROUP BY p2.category
ORDER BY SUM(oi2.quantity) DESC
LIMIT 1
)
AND o.order_date >= CURRENT_DATE - INTERVAL '60 days'
);
This query finds customers who haven't recently ordered from their historically preferred product category. It's a complex business requirement that would be difficult to express without correlated subqueries.
Performance Warning: Correlated subqueries can create N+1 query problems. If your outer query returns 10,000 rows, your inner query might execute 10,000 times. Always examine execution plans for correlated subqueries in production environments.
CTEs transform the nested complexity we've seen into readable, debuggable code. Think of them as named subqueries that exist for the duration of a single SQL statement. They're particularly powerful for breaking complex logic into digestible steps.
A CTE follows this pattern:
WITH cte_name AS (
SELECT ...
)
SELECT ...
FROM cte_name
Let's refactor our earlier cohort analysis using CTEs:
-- Customer cohort analysis with CTEs
WITH customer_cohorts AS (
SELECT
c.customer_id,
c.name,
c.region,
DATE_TRUNC('month', c.signup_date) as cohort_month
FROM customers c
),
first_orders AS (
SELECT
customer_id,
MIN(order_date) as first_order_date,
AVG(total_amount) as first_order_amount
FROM orders
GROUP BY customer_id
),
recent_orders AS (
SELECT
customer_id,
AVG(total_amount) as recent_avg_amount,
COUNT(*) as recent_order_count
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY customer_id
),
cohort_summary AS (
SELECT
cc.cohort_month,
COUNT(*) as total_customers,
AVG(fo.first_order_amount) as avg_first_order,
AVG(ro.recent_avg_amount) as avg_recent_order,
COUNT(ro.customer_id) * 100.0 / COUNT(*) as retention_rate
FROM customer_cohorts cc
LEFT JOIN first_orders fo ON cc.customer_id = fo.customer_id
LEFT JOIN recent_orders ro ON cc.customer_id = ro.customer_id
GROUP BY cc.cohort_month
)
SELECT
cohort_month,
total_customers,
ROUND(avg_first_order, 2) as avg_first_order,
ROUND(avg_recent_order, 2) as avg_recent_order,
ROUND(retention_rate, 2) as retention_pct
FROM cohort_summary
ORDER BY cohort_month;
The logic is identical to our nested subquery version, but the structure is dramatically more readable. Each CTE represents a clear step in our analytical process, making it easy to debug, modify, or explain to stakeholders.
CTEs can reference earlier CTEs in the same WITH clause, creating dependency chains that mirror your analytical thinking:
-- Advanced customer segmentation with chained CTEs
WITH order_metrics AS (
SELECT
customer_id,
COUNT(*) as total_orders,
SUM(total_amount) as lifetime_value,
AVG(total_amount) as avg_order_value,
MIN(order_date) as first_order_date,
MAX(order_date) as last_order_date
FROM orders
GROUP BY customer_id
),
customer_scores AS (
SELECT
customer_id,
total_orders,
lifetime_value,
avg_order_value,
EXTRACT(days FROM CURRENT_DATE - last_order_date) as days_since_last_order,
CASE
WHEN total_orders >= 10 AND lifetime_value >= 1000 THEN 'VIP'
WHEN total_orders >= 5 OR lifetime_value >= 500 THEN 'Premium'
WHEN EXTRACT(days FROM CURRENT_DATE - last_order_date) <= 30 THEN 'Active'
WHEN EXTRACT(days FROM CURRENT_DATE - last_order_date) <= 90 THEN 'At_Risk'
ELSE 'Inactive'
END as customer_segment
FROM order_metrics
),
segment_analysis AS (
SELECT
customer_segment,
COUNT(*) as customer_count,
AVG(lifetime_value) as avg_lifetime_value,
AVG(avg_order_value) as avg_order_value,
AVG(days_since_last_order) as avg_days_since_last_order
FROM customer_scores
GROUP BY customer_segment
),
total_customers AS (
SELECT COUNT(*) as total FROM customer_scores
)
SELECT
sa.customer_segment,
sa.customer_count,
ROUND(sa.customer_count * 100.0 / tc.total, 2) as pct_of_customers,
ROUND(sa.avg_lifetime_value, 2) as avg_lifetime_value,
ROUND(sa.avg_order_value, 2) as avg_order_value,
ROUND(sa.avg_days_since_last_order, 0) as avg_days_since_last_order
FROM segment_analysis sa
CROSS JOIN total_customers tc
ORDER BY sa.avg_lifetime_value DESC;
Each CTE builds on previous ones, creating a clear pipeline from raw order data to business intelligence insights. This approach makes complex analytics maintainable and allows for easy modification of individual steps without rewriting the entire query.
Recursive CTEs handle hierarchical data, graph traversal, and iterative calculations. They consist of two parts: an anchor query (the starting point) and a recursive query that references the CTE itself.
Let's add organizational hierarchy to our data model:
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
name VARCHAR(100),
manager_id INT,
department VARCHAR(50),
salary DECIMAL(10,2)
);
A recursive CTE can traverse this hierarchy:
-- Find all employees in a management chain
WITH management_hierarchy AS (
-- Anchor: start with CEO (no manager)
SELECT
employee_id,
name,
manager_id,
department,
salary,
0 as level,
CAST(name AS VARCHAR(500)) as hierarchy_path
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive: find direct reports
SELECT
e.employee_id,
e.name,
e.manager_id,
e.department,
e.salary,
mh.level + 1,
CAST(mh.hierarchy_path + ' -> ' + e.name AS VARCHAR(500))
FROM employees e
JOIN management_hierarchy mh ON e.manager_id = mh.employee_id
)
SELECT
employee_id,
name,
department,
salary,
level,
hierarchy_path
FROM management_hierarchy
ORDER BY level, department, name;
The anchor query finds top-level employees (those without managers), while the recursive portion finds their direct reports, then their reports' reports, continuing until no more levels exist.
Recursive CTEs can solve sophisticated problems like finding connected components in graphs or calculating running totals with complex business rules:
-- Calculate departmental budget rollups with approval chains
WITH budget_rollup AS (
-- Anchor: leaf departments (no subordinate departments)
SELECT
department,
manager_id,
SUM(salary) as department_budget,
COUNT(*) as employee_count,
0 as rollup_level,
department as budget_path
FROM employees
WHERE department NOT IN (
SELECT DISTINCT e2.department
FROM employees e1
JOIN employees e2 ON e1.employee_id = e2.manager_id
WHERE e1.department != e2.department
)
GROUP BY department, manager_id
UNION ALL
-- Recursive: roll up to parent departments
SELECT
e.department,
e.manager_id,
br.department_budget + COALESCE(parent_budget.direct_budget, 0),
br.employee_count + COALESCE(parent_budget.direct_count, 0),
br.rollup_level + 1,
e.department + ' <- ' + br.budget_path
FROM budget_rollup br
JOIN employees e ON br.manager_id = e.employee_id
LEFT JOIN (
SELECT
department,
SUM(salary) as direct_budget,
COUNT(*) as direct_count
FROM employees
GROUP BY department
) parent_budget ON e.department = parent_budget.department
)
SELECT
department,
department_budget,
employee_count,
rollup_level,
budget_path
FROM budget_rollup
WHERE rollup_level = (SELECT MAX(rollup_level) FROM budget_rollup)
ORDER BY department_budget DESC;
Recursion Limits: Most databases impose recursion limits (typically 100 levels) to prevent infinite loops. Use
OPTION (MAXRECURSION n)in SQL Server or equivalent settings in other databases for deeper hierarchies.
Understanding how databases execute subqueries and CTEs is crucial for production performance. The query optimizer has several strategies, and knowing when to expect each helps you write efficient code.
Modern optimizers typically transform subqueries using these strategies:
Let's examine these with execution plan analysis:
-- Potentially inefficient correlated subquery
SELECT c.name, c.region
FROM customers c
WHERE c.customer_id IN (
SELECT o.customer_id
FROM orders o
WHERE o.total_amount > (
SELECT AVG(total_amount) * 1.5
FROM orders o2
WHERE o2.customer_id = o.customer_id
)
);
This query has a correlated subquery nested within another subquery. The execution plan reveals whether the optimizer successfully transforms this into efficient joins or if it's executing the inner query repeatedly.
A more efficient approach uses window functions:
-- Optimized version using window functions and CTEs
WITH customer_order_stats AS (
SELECT
customer_id,
total_amount,
AVG(total_amount) OVER (PARTITION BY customer_id) as customer_avg
FROM orders
),
high_value_customers AS (
SELECT DISTINCT customer_id
FROM customer_order_stats
WHERE total_amount > customer_avg * 1.5
)
SELECT c.name, c.region
FROM customers c
JOIN high_value_customers hvc ON c.customer_id = hvc.customer_id;
CTEs in most databases are materialized once and reused, but some optimizers inline simple CTEs. Understanding your database's behavior helps predict performance:
-- CTE that benefits from materialization
WITH expensive_calculation AS (
SELECT
customer_id,
SUM(oi.quantity * oi.unit_price) as total_product_revenue,
COUNT(DISTINCT o.order_date) as shopping_days,
AVG(p.cost) as avg_product_cost
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
GROUP BY customer_id
)
SELECT
'High Value' as segment,
COUNT(*) as customer_count,
AVG(total_product_revenue) as avg_revenue
FROM expensive_calculation
WHERE total_product_revenue > 1000
UNION ALL
SELECT
'Frequent Shoppers' as segment,
COUNT(*) as customer_count,
AVG(total_product_revenue) as avg_revenue
FROM expensive_calculation
WHERE shopping_days > 10;
The CTE is referenced twice in the UNION, so materialization prevents duplicate calculation of the expensive aggregation.
Effective indexing dramatically improves subquery performance:
-- Indexes that support our common subquery patterns
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
CREATE INDEX idx_orders_amount_date ON orders (total_amount, order_date);
CREATE INDEX idx_order_items_covering ON order_items (order_id, product_id)
INCLUDE (quantity, unit_price);
CREATE INDEX idx_products_category ON products (category) INCLUDE (cost);
These indexes support the filtering, joining, and aggregation patterns we've used throughout our examples.
CTEs excel at data shape transformations that are awkward with traditional SQL:
-- Dynamic pivot of product sales by month
WITH monthly_sales AS (
SELECT
p.category,
DATE_TRUNC('month', o.order_date) as sale_month,
SUM(oi.quantity * oi.unit_price) as monthly_revenue
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '12 months'
GROUP BY p.category, DATE_TRUNC('month', o.order_date)
),
pivot_data AS (
SELECT
category,
SUM(CASE WHEN sale_month = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '11 months')
THEN monthly_revenue END) as month_01,
SUM(CASE WHEN sale_month = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '10 months')
THEN monthly_revenue END) as month_02,
SUM(CASE WHEN sale_month = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '9 months')
THEN monthly_revenue END) as month_03,
-- ... continue for all months
SUM(CASE WHEN sale_month = DATE_TRUNC('month', CURRENT_DATE)
THEN monthly_revenue END) as month_12
FROM monthly_sales
GROUP BY category
)
SELECT
category,
COALESCE(month_01, 0) as month_01,
COALESCE(month_02, 0) as month_02,
COALESCE(month_03, 0) as month_03,
-- Calculate growth rates
CASE
WHEN month_01 > 0 THEN ROUND((month_02 - month_01) * 100.0 / month_01, 2)
ELSE NULL
END as m2_growth_pct
FROM pivot_data
ORDER BY category;
Combining CTEs with window functions creates powerful analytical patterns:
-- Customer lifecycle analysis with advanced windowing
WITH customer_orders AS (
SELECT
c.customer_id,
c.name,
c.signup_date,
o.order_date,
o.total_amount,
ROW_NUMBER() OVER (PARTITION BY c.customer_id ORDER BY o.order_date) as order_sequence,
LAG(o.order_date) OVER (PARTITION BY c.customer_id ORDER BY o.order_date) as prev_order_date,
SUM(o.total_amount) OVER (PARTITION BY c.customer_id ORDER BY o.order_date
ROWS UNBOUNDED PRECEDING) as cumulative_value
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
),
order_intervals AS (
SELECT
*,
CASE
WHEN prev_order_date IS NOT NULL
THEN EXTRACT(days FROM order_date - prev_order_date)
ELSE EXTRACT(days FROM order_date - signup_date)
END as days_between_orders,
CASE
WHEN order_sequence = 1 THEN 'First Order'
WHEN order_sequence = 2 THEN 'Second Order'
WHEN order_sequence <= 5 THEN 'Early Orders'
ELSE 'Repeat Customer'
END as order_stage
FROM customer_orders
),
customer_patterns AS (
SELECT
customer_id,
name,
COUNT(*) as total_orders,
AVG(days_between_orders) as avg_days_between_orders,
STDDEV(days_between_orders) as order_frequency_volatility,
MAX(cumulative_value) as lifetime_value,
CASE
WHEN AVG(days_between_orders) <= 30 THEN 'Frequent'
WHEN AVG(days_between_orders) <= 90 THEN 'Regular'
WHEN AVG(days_between_orders) <= 180 THEN 'Occasional'
ELSE 'Infrequent'
END as purchase_frequency_segment
FROM order_intervals
GROUP BY customer_id, name
)
SELECT
purchase_frequency_segment,
COUNT(*) as customer_count,
ROUND(AVG(total_orders), 1) as avg_orders_per_customer,
ROUND(AVG(avg_days_between_orders), 1) as avg_days_between_orders,
ROUND(AVG(lifetime_value), 2) as avg_lifetime_value,
ROUND(AVG(order_frequency_volatility), 1) as avg_frequency_volatility
FROM customer_patterns
GROUP BY purchase_frequency_segment
ORDER BY avg_lifetime_value DESC;
This analysis reveals customer purchasing patterns through sophisticated windowing that would be nearly impossible without CTEs to organize the logic.
Let's apply these concepts to a realistic business scenario. You're analyzing subscription data for a SaaS company and need to create a comprehensive churn analysis report.
-- Setup sample subscription data
CREATE TABLE subscriptions (
subscription_id INT PRIMARY KEY,
customer_id INT,
plan_type VARCHAR(50),
start_date DATE,
end_date DATE,
monthly_fee DECIMAL(8,2)
);
CREATE TABLE usage_events (
event_id INT PRIMARY KEY,
customer_id INT,
event_date DATE,
feature_used VARCHAR(100),
usage_count INT
);
-- Sample data
INSERT INTO subscriptions VALUES
(1, 101, 'Basic', '2023-01-15', '2023-06-15', 29.99),
(2, 102, 'Premium', '2023-02-01', NULL, 79.99),
(3, 103, 'Basic', '2023-03-10', '2023-05-10', 29.99),
(4, 104, 'Enterprise', '2023-01-01', NULL, 199.99);
Your task: Create a query using both subqueries and CTEs that analyzes:
Try to solve this before looking at the solution:
-- Solution: Comprehensive churn analysis
WITH customer_cohorts AS (
SELECT
customer_id,
plan_type,
monthly_fee,
start_date,
end_date,
DATE_TRUNC('month', start_date) as signup_cohort,
CASE
WHEN end_date IS NULL THEN 'Active'
ELSE 'Churned'
END as status,
CASE
WHEN end_date IS NOT NULL
THEN EXTRACT(days FROM end_date - start_date)
ELSE EXTRACT(days FROM CURRENT_DATE - start_date)
END as tenure_days
FROM subscriptions
),
usage_summary AS (
SELECT
ue.customer_id,
COUNT(*) as total_events,
COUNT(DISTINCT ue.feature_used) as features_used,
MAX(ue.event_date) as last_usage_date,
AVG(ue.usage_count) as avg_usage_intensity
FROM usage_events ue
GROUP BY ue.customer_id
),
pre_churn_usage AS (
SELECT
ue.customer_id,
COUNT(*) as pre_churn_events,
AVG(ue.usage_count) as pre_churn_intensity
FROM usage_events ue
JOIN customer_cohorts cc ON ue.customer_id = cc.customer_id
WHERE cc.status = 'Churned'
AND ue.event_date BETWEEN cc.end_date - INTERVAL '30 days' AND cc.end_date
GROUP BY ue.customer_id
),
churn_analysis AS (
SELECT
cc.signup_cohort,
cc.plan_type,
COUNT(*) as total_customers,
COUNT(CASE WHEN cc.status = 'Churned' THEN 1 END) as churned_customers,
AVG(cc.tenure_days) as avg_tenure_days,
SUM(CASE WHEN cc.status = 'Churned'
THEN cc.monthly_fee * (cc.tenure_days / 30.0)
END) as lost_revenue,
AVG(CASE WHEN cc.status = 'Active' THEN us.total_events END) as avg_active_usage,
AVG(pcu.pre_churn_events) as avg_pre_churn_usage
FROM customer_cohorts cc
LEFT JOIN usage_summary us ON cc.customer_id = us.customer_id
LEFT JOIN pre_churn_usage pcu ON cc.customer_id = pcu.customer_id
GROUP BY cc.signup_cohort, cc.plan_type
),
at_risk_customers AS (
SELECT
cc.customer_id,
cc.plan_type,
cc.monthly_fee,
us.last_usage_date,
us.total_events,
EXTRACT(days FROM CURRENT_DATE - us.last_usage_date) as days_since_usage,
-- Compare to average usage of churned customers
CASE
WHEN us.total_events < (
SELECT AVG(pre_churn_events)
FROM pre_churn_usage
) AND EXTRACT(days FROM CURRENT_DATE - us.last_usage_date) > 14
THEN 'High Risk'
WHEN EXTRACT(days FROM CURRENT_DATE - us.last_usage_date) > 7
THEN 'Medium Risk'
ELSE 'Low Risk'
END as risk_level
FROM customer_cohorts cc
JOIN usage_summary us ON cc.customer_id = us.customer_id
WHERE cc.status = 'Active'
)
-- Final report combining all analyses
SELECT
'Churn by Cohort' as analysis_type,
signup_cohort::TEXT as dimension,
plan_type as subdimension,
ROUND(churned_customers * 100.0 / total_customers, 2) as churn_rate,
ROUND(lost_revenue, 2) as impact_value
FROM churn_analysis
UNION ALL
SELECT
'At Risk Customers' as analysis_type,
risk_level as dimension,
plan_type as subdimension,
COUNT(*)::DECIMAL as churn_rate,
ROUND(SUM(monthly_fee), 2) as impact_value
FROM at_risk_customers
GROUP BY risk_level, plan_type
ORDER BY analysis_type, impact_value DESC;
This exercise demonstrates how CTEs can organize complex analytical logic while subqueries handle specific comparative calculations.
Mistake #1: Overusing Correlated Subqueries
-- Inefficient: Multiple correlated subqueries
SELECT
c.name,
(SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) as order_count,
(SELECT AVG(total_amount) FROM orders o WHERE o.customer_id = c.customer_id) as avg_order,
(SELECT MAX(order_date) FROM orders o WHERE o.customer_id = c.customer_id) as last_order
FROM customers c;
Better: Single aggregation with window functions
-- Efficient: One join with multiple aggregations
WITH customer_metrics AS (
SELECT
customer_id,
COUNT(*) as order_count,
AVG(total_amount) as avg_order,
MAX(order_date) as last_order
FROM orders
GROUP BY customer_id
)
SELECT
c.name,
cm.order_count,
cm.avg_order,
cm.last_order
FROM customers c
LEFT JOIN customer_metrics cm ON c.customer_id = cm.customer_id;
Mistake #2: CTE Scope Confusion
-- Wrong: Trying to reference CTE outside its scope
WITH monthly_sales AS (
SELECT DATE_TRUNC('month', order_date) as month, SUM(total_amount) as revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
)
SELECT * FROM monthly_sales;
-- This would fail - CTEs don't persist across statements
SELECT month FROM monthly_sales WHERE revenue > 1000;
Mistake #3: Recursive CTE Infinite Loops
-- Dangerous: No termination condition
WITH hierarchy AS (
SELECT employee_id, manager_id, 0 as level
FROM employees
WHERE employee_id = 1
UNION ALL
SELECT e.employee_id, e.manager_id, h.level + 1
FROM employees e
JOIN hierarchy h ON e.manager_id = h.employee_id
-- Missing: WHERE h.level < some_limit
)
SELECT * FROM hierarchy;
When subqueries or CTEs aren't returning expected results:
-- Test individual CTEs by selecting from them directly
WITH customer_metrics AS (
SELECT customer_id, COUNT(*) as order_count
FROM orders
GROUP BY customer_id
)
SELECT * FROM customer_metrics WHERE customer_id = 101;
WITH debug_customers AS (
SELECT
customer_id,
name,
COUNT(*) as order_count,
'Debug: ' || COUNT(*) as debug_info
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY customer_id, name
)
SELECT * FROM debug_customers;
-- Most databases support EXPLAIN or similar
EXPLAIN (ANALYZE, BUFFERS)
WITH complex_cte AS (...)
SELECT * FROM complex_cte;
Subqueries can introduce subtle NULL handling issues:
-- Potential issue: NULL comparisons in subqueries
SELECT customer_id
FROM customers c
WHERE customer_id NOT IN (
SELECT customer_id FROM orders -- If any customer_id is NULL, returns no rows!
);
-- Safe version: Handle NULLs explicitly
SELECT customer_id
FROM customers c
WHERE customer_id NOT IN (
SELECT customer_id FROM orders WHERE customer_id IS NOT NULL
);
-- Even better: Use EXISTS
SELECT customer_id
FROM customers c
WHERE NOT EXISTS (
SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id
);
You've now mastered the sophisticated art of subqueries and CTEs, transforming from someone who writes simple SELECT statements into a practitioner who can tackle complex analytical challenges with readable, maintainable SQL. The techniques we've covered—from basic scalar subqueries to recursive CTEs with advanced windowing—represent the backbone of production data analysis.
The key insights to remember:
Your next steps should focus on applying these patterns to real production scenarios in your environment. Start by identifying complex queries in your current work that could benefit from CTE refactoring. Practice writing recursive CTEs for any hierarchical data you encounter. Most importantly, develop the habit of examining execution plans for your more complex queries.
Advanced topics to explore next include materialized views (which can cache CTE-like logic), stored procedures that use these patterns, and database-specific extensions like SQL Server's MERGE statements or PostgreSQL's lateral joins. The foundation you've built here will make those advanced topics much more accessible.
Learning Path: SQL Fundamentals