Picture this: You're analyzing customer behavior for an e-commerce platform, and your stakeholder asks a seemingly simple question: "Which customers made purchases above our average order value, and what was the total revenue from those high-value customers by region?" Your first instinct might be to write multiple queries, export data to Excel, and manually combine the results. But seasoned SQL practitioners know there's a more elegant path—one that leverages the full power of subqueries and Common Table Expressions (CTEs) to solve complex analytical problems in a single, readable query.
Subqueries and CTEs represent the bridge between basic SQL operations and advanced analytical thinking. They allow you to break down complex business logic into manageable, composable pieces while maintaining the performance benefits of database-native processing. More importantly, they enable you to express sophisticated data transformations that mirror how you naturally think about business problems—in layers of logic that build upon each other.
What you'll learn:
This lesson assumes you're comfortable with intermediate SQL concepts including JOINs, GROUP BY operations, window functions, and aggregate functions. You should understand query execution order and have experience writing multi-table queries. Familiarity with query optimization concepts will help you grasp the performance implications we'll discuss.
A subquery is simply a query nested inside another query, but this simple definition belies their power and complexity. Let's start with a realistic scenario using an e-commerce database with customers, orders, and products tables.
-- Sample data structure we'll work with
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100),
region VARCHAR(50),
registration_date DATE,
customer_tier VARCHAR(20)
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
order_total DECIMAL(10,2),
order_status VARCHAR(20),
shipping_method VARCHAR(30)
);
CREATE TABLE order_items (
order_item_id INT PRIMARY KEY,
order_id INT,
product_id INT,
quantity INT,
unit_price DECIMAL(10,2)
);
The most straightforward subquery returns a single value that you can use anywhere you'd use a literal value. Here's how we might identify customers who've spent more than the average:
SELECT
customer_name,
region,
(SELECT SUM(order_total)
FROM orders o
WHERE o.customer_id = c.customer_id) as total_spent
FROM customers c
WHERE (SELECT SUM(order_total)
FROM orders o
WHERE o.customer_id = c.customer_id) >
(SELECT AVG(order_total) FROM orders);
Notice how we're using the same subquery twice—once in the SELECT clause and once in the WHERE clause. This redundancy is both inefficient and hard to maintain, which foreshadows why we'll eventually prefer CTEs for complex logic.
The example above demonstrates a correlated subquery—the inner query references columns from the outer query (c.customer_id). This creates a dependency that affects both performance and execution strategy. The database must execute the subquery once for each row in the outer query.
Compare this to an uncorrelated subquery:
-- Find customers in regions with above-average customer counts
SELECT customer_name, region
FROM customers
WHERE region IN (
SELECT region
FROM customers
GROUP BY region
HAVING COUNT(*) > (SELECT COUNT(*)/COUNT(DISTINCT region) FROM customers)
);
Here, the inner subquery can be executed once and cached, making it significantly more efficient for large datasets.
When you need to test whether related data exists without actually retrieving it, EXISTS clauses provide elegant and performant solutions:
-- Find customers who have made orders but never purchased premium products
SELECT DISTINCT c.customer_name, c.region
FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id
)
AND NOT EXISTS (
SELECT 1
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.customer_id = c.customer_id
AND p.category = 'Premium'
);
The SELECT 1 is a SQL idiom—since EXISTS only tests for the presence of rows, not their content, we can select any constant value for optimal performance.
Performance Tip: EXISTS typically outperforms IN when dealing with large datasets because it can short-circuit (stop searching) as soon as it finds a matching row, while IN must evaluate all possibilities.
CTEs transform how we think about complex queries by allowing us to name and reuse intermediate results. They're like creating temporary views that exist only for the duration of a single query.
WITH average_order_value AS (
SELECT AVG(order_total) as avg_total
FROM orders
WHERE order_status = 'completed'
),
customer_totals AS (
SELECT
c.customer_id,
c.customer_name,
c.region,
SUM(o.order_total) as total_spent,
COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_status = 'completed'
GROUP BY c.customer_id, c.customer_name, c.region
)
SELECT
ct.customer_name,
ct.region,
ct.total_spent,
ct.order_count,
CASE
WHEN ct.total_spent > av.avg_total * ct.order_count THEN 'High Value'
WHEN ct.total_spent > av.avg_total * ct.order_count * 0.7 THEN 'Medium Value'
ELSE 'Low Value'
END as customer_segment
FROM customer_totals ct
CROSS JOIN average_order_value av
WHERE ct.order_count >= 2
ORDER BY ct.total_spent DESC;
This query demonstrates the CTE's power to break complex logic into named, reusable components. Each CTE acts like a stepping stone, building toward the final result in a way that mirrors human problem-solving.
Real analytical work often requires multiple transformation steps. CTEs excel at creating readable data pipelines:
WITH monthly_sales AS (
-- Step 1: Aggregate sales by customer and month
SELECT
c.customer_id,
c.region,
DATE_TRUNC('month', o.order_date) as order_month,
SUM(o.order_total) as monthly_total,
COUNT(o.order_id) as monthly_orders
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_status = 'completed'
GROUP BY c.customer_id, c.region, DATE_TRUNC('month', o.order_date)
),
customer_trends AS (
-- Step 2: Calculate month-over-month growth
SELECT
customer_id,
region,
order_month,
monthly_total,
monthly_orders,
LAG(monthly_total) OVER (
PARTITION BY customer_id
ORDER BY order_month
) as previous_month_total,
LAG(monthly_orders) OVER (
PARTITION BY customer_id
ORDER BY order_month
) as previous_month_orders
FROM monthly_sales
),
growth_metrics AS (
-- Step 3: Calculate growth percentages and categorize trends
SELECT
customer_id,
region,
order_month,
monthly_total,
monthly_orders,
CASE
WHEN previous_month_total IS NULL THEN NULL
WHEN previous_month_total = 0 THEN 999.99 -- Handle division by zero
ELSE ROUND(
((monthly_total - previous_month_total) / previous_month_total) * 100, 2
)
END as revenue_growth_pct,
CASE
WHEN previous_month_orders IS NULL THEN NULL
WHEN previous_month_orders = 0 THEN 999.99
ELSE ROUND(
((monthly_orders::DECIMAL - previous_month_orders) / previous_month_orders) * 100, 2
)
END as order_growth_pct
FROM customer_trends
)
-- Step 4: Final analysis with trend categorization
SELECT
region,
COUNT(*) as total_customer_months,
AVG(revenue_growth_pct) as avg_revenue_growth,
AVG(order_growth_pct) as avg_order_growth,
COUNT(CASE WHEN revenue_growth_pct > 20 THEN 1 END) as high_growth_months,
COUNT(CASE WHEN revenue_growth_pct < -10 THEN 1 END) as declining_months
FROM growth_metrics
WHERE revenue_growth_pct IS NOT NULL
GROUP BY region
ORDER BY avg_revenue_growth DESC;
This pipeline approach makes complex analytics maintainable and debuggable. You can test each CTE independently and build confidence in your logic step by step.
Recursive CTEs solve problems that would otherwise require application-level loops or specialized graph databases. They're essential for working with organizational hierarchies, bill of materials, or any tree-like data structure.
-- Sample hierarchical data: employee reporting structure
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100),
manager_id INT,
department VARCHAR(50),
salary DECIMAL(10,2),
hire_date DATE
);
-- The recursive CTE has two parts: anchor and recursive
WITH RECURSIVE org_hierarchy AS (
-- Anchor: Start with top-level managers (no manager_id)
SELECT
employee_id,
employee_name,
manager_id,
department,
salary,
0 as level,
CAST(employee_name AS VARCHAR(1000)) as hierarchy_path,
ARRAY[employee_id] as path_array
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive: Find direct reports of current level
SELECT
e.employee_id,
e.employee_name,
e.manager_id,
e.department,
e.salary,
oh.level + 1,
oh.hierarchy_path || ' -> ' || e.employee_name,
oh.path_array || e.employee_id
FROM employees e
JOIN org_hierarchy oh ON e.manager_id = oh.employee_id
WHERE NOT (e.employee_id = ANY(oh.path_array)) -- Prevent infinite loops
)
SELECT
level,
REPEAT(' ', level) || employee_name as indented_name,
department,
salary,
hierarchy_path
FROM org_hierarchy
ORDER BY hierarchy_path;
The path_array and cycle detection (NOT (e.employee_id = ANY(oh.path_array))) prevent infinite loops if your data has circular references—a critical consideration for production systems.
Here's a more complex example that calculates cumulative team sizes and budget allocations:
WITH RECURSIVE team_analysis AS (
-- Anchor: Department heads
SELECT
employee_id,
employee_name,
department,
salary,
0 as level,
1 as team_size,
salary as total_team_salary
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive: Build teams
SELECT
e.employee_id,
e.employee_name,
e.department,
e.salary,
ta.level + 1,
1 as team_size, -- Individual contribution
e.salary as total_team_salary
FROM employees e
JOIN team_analysis ta ON e.manager_id = ta.employee_id
),
aggregated_teams AS (
SELECT
department,
level,
COUNT(*) as employees_at_level,
SUM(salary) as salary_at_level,
AVG(salary) as avg_salary_at_level
FROM team_analysis
GROUP BY department, level
)
SELECT
department,
SUM(employees_at_level) as total_employees,
SUM(salary_at_level) as total_department_salary,
MAX(level) as max_hierarchy_depth,
ROUND(AVG(avg_salary_at_level), 2) as overall_avg_salary
FROM aggregated_teams
GROUP BY department
ORDER BY total_department_salary DESC;
CTEs shine when combined with window functions for sophisticated analytical queries:
WITH daily_sales AS (
SELECT
o.order_date,
c.region,
SUM(o.order_total) as daily_revenue,
COUNT(DISTINCT o.customer_id) as daily_customers,
COUNT(o.order_id) as daily_orders
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_status = 'completed'
GROUP BY o.order_date, c.region
),
sales_with_trends AS (
SELECT
order_date,
region,
daily_revenue,
daily_customers,
daily_orders,
-- Rolling 7-day averages
AVG(daily_revenue) OVER (
PARTITION BY region
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as revenue_7day_avg,
AVG(daily_customers::DECIMAL) OVER (
PARTITION BY region
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as customers_7day_avg,
-- Month-to-date comparisons
SUM(daily_revenue) OVER (
PARTITION BY region, DATE_TRUNC('month', order_date)
ORDER BY order_date
) as revenue_mtd,
-- Previous period comparisons
LAG(daily_revenue, 7) OVER (
PARTITION BY region
ORDER BY order_date
) as revenue_same_day_last_week
FROM daily_sales
),
anomaly_detection AS (
SELECT
*,
CASE
WHEN revenue_same_day_last_week IS NOT NULL
THEN ROUND(
((daily_revenue - revenue_same_day_last_week) / revenue_same_day_last_week) * 100, 2
)
END as wow_growth_pct,
CASE
WHEN daily_revenue > revenue_7day_avg * 1.5 THEN 'High'
WHEN daily_revenue < revenue_7day_avg * 0.7 THEN 'Low'
ELSE 'Normal'
END as revenue_anomaly_flag
FROM sales_with_trends
)
SELECT
region,
COUNT(*) as total_days,
COUNT(CASE WHEN revenue_anomaly_flag = 'High' THEN 1 END) as high_days,
COUNT(CASE WHEN revenue_anomaly_flag = 'Low' THEN 1 END) as low_days,
AVG(wow_growth_pct) as avg_weekly_growth,
STDDEV(daily_revenue) as revenue_volatility
FROM anomaly_detection
WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY region
ORDER BY avg_weekly_growth DESC;
CTEs excel at multi-step data validation workflows:
WITH data_quality_checks AS (
-- Step 1: Identify potential data quality issues
SELECT
customer_id,
customer_name,
region,
registration_date,
CASE
WHEN customer_name IS NULL OR TRIM(customer_name) = '' THEN 'missing_name'
WHEN region IS NULL OR TRIM(region) = '' THEN 'missing_region'
WHEN registration_date > CURRENT_DATE THEN 'future_registration'
WHEN registration_date < '2000-01-01' THEN 'invalid_date'
ELSE 'valid'
END as quality_status
FROM customers
),
order_validations AS (
-- Step 2: Cross-reference with order data
SELECT
o.customer_id,
COUNT(o.order_id) as order_count,
SUM(o.order_total) as total_spent,
MAX(o.order_date) as last_order_date,
CASE
WHEN COUNT(o.order_id) = 0 THEN 'no_orders'
WHEN SUM(o.order_total) <= 0 THEN 'invalid_totals'
WHEN MAX(o.order_date) < CURRENT_DATE - INTERVAL '2 years' THEN 'inactive'
ELSE 'active'
END as order_status
FROM orders o
GROUP BY o.customer_id
),
combined_assessment AS (
-- Step 3: Combine assessments
SELECT
dq.customer_id,
dq.customer_name,
dq.region,
dq.quality_status,
COALESCE(ov.order_status, 'no_orders') as order_status,
COALESCE(ov.order_count, 0) as order_count,
COALESCE(ov.total_spent, 0) as total_spent
FROM data_quality_checks dq
LEFT JOIN order_validations ov ON dq.customer_id = ov.customer_id
)
-- Step 4: Generate data quality report
SELECT
quality_status,
order_status,
COUNT(*) as customer_count,
SUM(total_spent) as total_revenue,
AVG(order_count) as avg_orders_per_customer,
ROUND(
(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER ()), 2
) as percentage_of_total
FROM combined_assessment
GROUP BY quality_status, order_status
ORDER BY customer_count DESC;
Understanding the performance implications of subqueries and CTEs is crucial for production systems handling large datasets.
Different database systems handle CTEs differently. PostgreSQL and SQL Server typically materialize CTE results (store them temporarily), while others may inline them like subqueries. Understanding your system's behavior is crucial:
-- PostgreSQL: Use EXPLAIN ANALYZE to see execution plans
EXPLAIN (ANALYZE, BUFFERS)
WITH customer_summary AS (
SELECT
customer_id,
COUNT(*) as order_count,
SUM(order_total) as total_spent
FROM orders
WHERE order_date >= '2023-01-01'
GROUP BY customer_id
)
SELECT c.customer_name, cs.order_count, cs.total_spent
FROM customers c
JOIN customer_summary cs ON c.customer_id = cs.customer_id
WHERE cs.order_count >= 10;
Use subqueries when:
Use CTEs when:
Use temporary tables when:
-- Sometimes temporary tables outperform CTEs for large datasets
CREATE TEMPORARY TABLE customer_aggregates AS
SELECT
customer_id,
COUNT(*) as order_count,
SUM(order_total) as total_spent,
AVG(order_total) as avg_order_value
FROM orders
WHERE order_date >= '2022-01-01'
GROUP BY customer_id;
-- Add index for subsequent joins
CREATE INDEX idx_customer_agg_id ON customer_aggregates(customer_id);
-- Now use in multiple queries efficiently
SELECT COUNT(*) FROM customer_aggregates WHERE total_spent > 10000;
SELECT COUNT(*) FROM customer_aggregates WHERE order_count > 20;
CTEs can consume significant memory if they materialize large result sets. Monitor resource usage and consider alternatives:
-- This CTE might materialize millions of rows
WITH all_order_details AS (
SELECT
o.order_id,
o.customer_id,
oi.product_id,
oi.quantity,
oi.unit_price,
p.category,
p.brand
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.order_date >= '2020-01-01' -- Potentially millions of rows
)
SELECT category, COUNT(*)
FROM all_order_details
GROUP BY category;
-- Better: Push filtering down to reduce intermediate result size
WITH filtered_order_details AS (
SELECT
p.category,
oi.quantity
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.order_date >= '2023-01-01' -- More selective filter
AND p.category IN ('Electronics', 'Clothing') -- Further filtering
)
SELECT category, COUNT(*)
FROM filtered_order_details
GROUP BY category;
CTEs integrate seamlessly with stored procedures for complex business logic:
-- PostgreSQL function using CTEs for customer lifecycle analysis
CREATE OR REPLACE FUNCTION analyze_customer_lifecycle(
p_start_date DATE DEFAULT NULL,
p_end_date DATE DEFAULT NULL
)
RETURNS TABLE (
lifecycle_stage VARCHAR(20),
customer_count INT,
avg_days_in_stage NUMERIC,
total_revenue NUMERIC
) AS $$
BEGIN
RETURN QUERY
WITH customer_metrics AS (
SELECT
c.customer_id,
c.registration_date,
MIN(o.order_date) as first_order_date,
MAX(o.order_date) as last_order_date,
COUNT(o.order_id) as total_orders,
SUM(o.order_total) as total_spent,
CURRENT_DATE - MAX(o.order_date) as days_since_last_order
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE (p_start_date IS NULL OR c.registration_date >= p_start_date)
AND (p_end_date IS NULL OR c.registration_date <= p_end_date)
GROUP BY c.customer_id, c.registration_date
),
lifecycle_classification AS (
SELECT
*,
CASE
WHEN total_orders = 0 THEN 'prospect'
WHEN total_orders = 1 AND days_since_last_order > 90 THEN 'one_time'
WHEN days_since_last_order <= 30 THEN 'active'
WHEN days_since_last_order <= 90 THEN 'at_risk'
WHEN days_since_last_order <= 365 THEN 'inactive'
ELSE 'churned'
END as lifecycle_stage,
CASE
WHEN first_order_date IS NOT NULL
THEN last_order_date - first_order_date
ELSE CURRENT_DATE - registration_date
END as days_in_lifecycle
FROM customer_metrics
)
SELECT
lc.lifecycle_stage,
COUNT(*)::INT,
AVG(lc.days_in_lifecycle)::NUMERIC(10,2),
SUM(lc.total_spent)::NUMERIC(12,2)
FROM lifecycle_classification lc
GROUP BY lc.lifecycle_stage
ORDER BY
CASE lc.lifecycle_stage
WHEN 'prospect' THEN 1
WHEN 'active' THEN 2
WHEN 'at_risk' THEN 3
WHEN 'one_time' THEN 4
WHEN 'inactive' THEN 5
WHEN 'churned' THEN 6
END;
END;
$$ LANGUAGE plpgsql;
CTEs are excellent for complex transformations in ETL processes:
-- Data pipeline stage using CTEs for customer segmentation
WITH raw_customer_data AS (
-- Step 1: Extract raw data with basic cleansing
SELECT
customer_id,
UPPER(TRIM(customer_name)) as customer_name,
LOWER(TRIM(email)) as email,
CASE
WHEN region IN ('', 'NULL', 'null') THEN 'Unknown'
ELSE INITCAP(TRIM(region))
END as region,
registration_date
FROM staging.customer_imports
WHERE customer_id IS NOT NULL
),
order_aggregations AS (
-- Step 2: Calculate customer metrics
SELECT
customer_id,
COUNT(*) as lifetime_orders,
SUM(order_total) as lifetime_value,
AVG(order_total) as avg_order_value,
MAX(order_date) as last_order_date,
EXTRACT(DAYS FROM (CURRENT_DATE - MAX(order_date))) as days_since_last_order
FROM orders
WHERE order_status = 'completed'
GROUP BY customer_id
),
rfm_scores AS (
-- Step 3: Calculate RFM scores for segmentation
SELECT
customer_id,
NTILE(5) OVER (ORDER BY days_since_last_order) as recency_score,
NTILE(5) OVER (ORDER BY lifetime_orders) as frequency_score,
NTILE(5) OVER (ORDER BY lifetime_value) as monetary_score
FROM order_aggregations
),
customer_segments AS (
-- Step 4: Assign business segments
SELECT
customer_id,
recency_score,
frequency_score,
monetary_score,
CASE
WHEN recency_score >= 4 AND frequency_score >= 4 AND monetary_score >= 4
THEN 'Champions'
WHEN recency_score >= 3 AND frequency_score >= 3 AND monetary_score >= 3
THEN 'Loyal Customers'
WHEN recency_score >= 4 AND frequency_score <= 2
THEN 'New Customers'
WHEN recency_score <= 2 AND frequency_score >= 3
THEN 'At Risk'
WHEN recency_score <= 2 AND frequency_score <= 2
THEN 'Lost Customers'
ELSE 'Potential Loyalists'
END as customer_segment
FROM rfm_scores
)
-- Step 5: Final insert into production table
INSERT INTO production.customer_analytics (
customer_id,
customer_name,
email,
region,
customer_segment,
recency_score,
frequency_score,
monetary_score,
last_updated
)
SELECT
rcd.customer_id,
rcd.customer_name,
rcd.email,
rcd.region,
cs.customer_segment,
cs.recency_score,
cs.frequency_score,
cs.monetary_score,
CURRENT_TIMESTAMP
FROM raw_customer_data rcd
JOIN customer_segments cs ON rcd.customer_id = cs.customer_id
ON CONFLICT (customer_id)
DO UPDATE SET
customer_segment = EXCLUDED.customer_segment,
recency_score = EXCLUDED.recency_score,
frequency_score = EXCLUDED.frequency_score,
monetary_score = EXCLUDED.monetary_score,
last_updated = EXCLUDED.last_updated;
Let's put together everything you've learned with a comprehensive exercise that simulates real-world analytical challenges.
Scenario: You're the lead data analyst for a SaaS company. The executive team wants a comprehensive customer health analysis that includes:
Here's your sample dataset:
-- Set up the exercise database
CREATE TABLE saas_customers (
customer_id INT PRIMARY KEY,
company_name VARCHAR(100),
industry VARCHAR(50),
plan_type VARCHAR(20), -- 'starter', 'professional', 'enterprise'
signup_date DATE,
is_active BOOLEAN DEFAULT true
);
CREATE TABLE usage_events (
event_id SERIAL PRIMARY KEY,
customer_id INT,
event_date DATE,
feature_used VARCHAR(50),
usage_count INT,
session_duration_minutes INT
);
CREATE TABLE subscriptions (
subscription_id SERIAL PRIMARY KEY,
customer_id INT,
start_date DATE,
end_date DATE, -- NULL if active
monthly_revenue DECIMAL(10,2),
plan_type VARCHAR(20)
);
-- Your challenge: Write a comprehensive query using CTEs that produces:
-- A customer health dashboard with the following metrics per customer:
-- - Current monthly revenue
-- - Usage trend (growing, stable, declining)
-- - Churn risk score (1-10 scale)
-- - Days since last activity
-- - Customer lifetime value
-- - Recommended action (retain, upsell, at_risk, churn_likely)
Solution approach:
WITH customer_base AS (
-- Step 1: Current customer status and basic metrics
SELECT DISTINCT
c.customer_id,
c.company_name,
c.industry,
c.signup_date,
c.is_active,
s.monthly_revenue,
s.plan_type,
s.start_date as current_subscription_start
FROM saas_customers c
LEFT JOIN subscriptions s ON c.customer_id = s.customer_id AND s.end_date IS NULL
),
usage_trends AS (
-- Step 2: Calculate usage patterns and trends
SELECT
customer_id,
COUNT(*) as total_events_90d,
AVG(usage_count) as avg_usage_per_session,
AVG(session_duration_minutes) as avg_session_duration,
MAX(event_date) as last_activity_date,
EXTRACT(DAYS FROM (CURRENT_DATE - MAX(event_date))) as days_since_last_activity,
-- Calculate usage trend using linear regression approximation
CASE
WHEN COUNT(*) >= 8 THEN
CASE
WHEN REGR_SLOPE(usage_count, EXTRACT(EPOCH FROM event_date)) > 0 THEN 'growing'
WHEN REGR_SLOPE(usage_count, EXTRACT(EPOCH FROM event_date)) < -0.1 THEN 'declining'
ELSE 'stable'
END
ELSE 'insufficient_data'
END as usage_trend
FROM usage_events
WHERE event_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY customer_id
),
churn_risk_factors AS (
-- Step 3: Calculate churn risk components
SELECT
cb.customer_id,
cb.monthly_revenue,
COALESCE(ut.days_since_last_activity, 999) as days_since_last_activity,
COALESCE(ut.usage_trend, 'no_usage') as usage_trend,
COALESCE(ut.total_events_90d, 0) as usage_events_90d,
EXTRACT(DAYS FROM (CURRENT_DATE - cb.signup_date)) as customer_age_days,
-- Churn risk scoring (1-10 scale)
LEAST(10, GREATEST(1,
CASE
WHEN COALESCE(ut.days_since_last_activity, 999) > 30 THEN 8
WHEN COALESCE(ut.days_since_last_activity, 999) > 14 THEN 6
WHEN COALESCE(ut.days_since_last_activity, 999) > 7 THEN 4
ELSE 2
END +
CASE ut.usage_trend
WHEN 'declining' THEN 3
WHEN 'stable' THEN 1
WHEN 'growing' THEN -1
ELSE 2
END +
CASE
WHEN COALESCE(ut.total_events_90d, 0) = 0 THEN 3
WHEN COALESCE(ut.total_events_90d, 0) < 10 THEN 2
ELSE 0
END
)) as churn_risk_score
FROM customer_base cb
LEFT JOIN usage_trends ut ON cb.customer_id = ut.customer_id
WHERE cb.is_active = true
),
lifetime_value AS (
-- Step 4: Calculate customer lifetime value
SELECT
customer_id,
SUM(
monthly_revenue *
EXTRACT(DAYS FROM (
COALESCE(end_date, CURRENT_DATE) - start_date
)) / 30.0
) as historical_clv,
COUNT(*) as subscription_count
FROM subscriptions
GROUP BY customer_id
),
final_dashboard AS (
-- Step 5: Combine all metrics and add recommendations
SELECT
crf.customer_id,
cb.company_name,
cb.industry,
cb.plan_type,
crf.monthly_revenue,
crf.usage_trend,
crf.churn_risk_score,
crf.days_since_last_activity,
COALESCE(lv.historical_clv, 0) as customer_lifetime_value,
crf.customer_age_days,
-- Recommendation engine
CASE
WHEN crf.churn_risk_score >= 8 THEN 'immediate_intervention'
WHEN crf.churn_risk_score >= 6 THEN 'at_risk_outreach'
WHEN crf.usage_trend = 'growing' AND cb.plan_type = 'starter' THEN 'upsell_opportunity'
WHEN crf.usage_trend = 'stable' AND crf.monthly_revenue > 500 THEN 'retain_focus'
ELSE 'monitor'
END as recommended_action,
-- Health score (inverse of churn risk)
11 - crf.churn_risk_score as health_score
FROM churn_risk_factors crf
LEFT JOIN customer_base cb ON crf.customer_id = cb.customer_id
LEFT JOIN lifetime_value lv ON crf.customer_id = lv.customer_id
)
SELECT
recommended_action,
COUNT(*) as customer_count,
SUM(monthly_revenue) as total_monthly_revenue,
AVG(health_score) as avg_health_score,
SUM(customer_lifetime_value) as total_clv,
AVG(days_since_last_activity) as avg_days_since_activity
FROM final_dashboard
GROUP BY recommended_action
ORDER BY total_monthly_revenue DESC;
This exercise demonstrates how CTEs enable you to build complex analytical workflows that would be nearly impossible to write as a single query without intermediate steps.
Problem: Writing correlated subqueries that execute once per outer row, causing performance issues.
-- Problematic: Correlated subquery executes for each customer
SELECT
customer_name,
(SELECT AVG(order_total) FROM orders o WHERE o.customer_id = c.customer_id) as avg_order
FROM customers c;
-- Better: Use window functions or CTEs
WITH customer_averages AS (
SELECT
customer_id,
AVG(order_total) as avg_order
FROM orders
GROUP BY customer_id
)
SELECT
c.customer_name,
COALESCE(ca.avg_order, 0) as avg_order
FROM customers c
LEFT JOIN customer_averages ca ON c.customer_id = ca.customer_id;
Problem: Forgetting cycle detection in recursive CTEs.
-- Dangerous: No cycle detection
WITH RECURSIVE bad_hierarchy AS (
SELECT employee_id, manager_id, 1 as level
FROM employees WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.manager_id, bh.level + 1
FROM employees e
JOIN bad_hierarchy bh ON e.manager_id = bh.employee_id -- Can loop forever!
)
SELECT * FROM bad_hierarchy;
-- Safe: Include cycle detection
WITH RECURSIVE safe_hierarchy AS (
SELECT
employee_id,
manager_id,
1 as level,
ARRAY[employee_id] as path
FROM employees WHERE manager_id IS NULL
UNION ALL
SELECT
e.employee_id,
e.manager_id,
sh.level + 1,
sh.path || e.employee_id
FROM employees e
JOIN safe_hierarchy sh ON e.manager_id = sh.employee_id
WHERE NOT (e.employee_id = ANY(sh.path)) -- Prevents cycles
AND sh.level < 10 -- Additional safety limit
)
SELECT * FROM safe_hierarchy;
Problem: Assuming CTEs always provide better performance than subqueries.
-- This CTE might be materialized unnecessarily
WITH large_result AS (
SELECT * FROM huge_table WHERE condition = 'rare_value' -- Returns millions of rows
)
SELECT COUNT(*) FROM large_result WHERE another_condition = 'specific';
-- Sometimes a subquery is more efficient
SELECT COUNT(*)
FROM (SELECT * FROM huge_table WHERE condition = 'rare_value') subq
WHERE another_condition = 'specific';
-- Or push all filters down
SELECT COUNT(*)
FROM huge_table
WHERE condition = 'rare_value' AND another_condition = 'specific';
Problem: Trying to reference CTEs from different query levels.
-- This won't work - CTE scope is limited to its query
WITH customer_totals AS (
SELECT customer_id, SUM(order_total) as total
FROM orders GROUP BY customer_id
)
SELECT customer_name FROM customers
WHERE customer_id IN (
SELECT customer_id FROM customer_totals WHERE total > 1000 -- Error!
);
-- Correct: Include CTE in the same query level
WITH customer_totals AS (
SELECT customer_id, SUM(order_total) as total
FROM orders GROUP BY customer_id
)
SELECT c.customer_name
FROM customers c
JOIN customer_totals ct ON c.customer_id = ct.customer_id
WHERE ct.total > 1000;
When debugging complex CTE chains, test each CTE independently:
-- Test each CTE step by step
WITH step1 AS (...),
step2 AS (...),
step3 AS (...)
-- Comment out the final SELECT and test intermediate steps:
SELECT * FROM step1 LIMIT 10; -- Test first step
-- SELECT * FROM step2 LIMIT 10; -- Then second step
-- SELECT * FROM step3 LIMIT 10; -- Then third step
Use EXPLAIN ANALYZE to understand query execution:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
WITH complex_cte AS (...)
SELECT * FROM complex_cte;
Performance Monitoring Tip: Set up monitoring for long-running queries in production. CTEs that materialize large intermediate results can consume significant memory and processing time, especially during peak usage periods.
Subqueries and CTEs represent a fundamental shift in how you approach complex data problems. Instead of thinking in terms of single-table operations, you can now architect solutions that mirror natural problem-solving patterns—breaking complex questions into manageable, logical steps.
Key takeaways:
Recommended next steps:
The patterns you've learned here form the foundation for advanced SQL techniques like pivoting data, creating analytical functions, and building complex data transformation pipelines. You're now equipped to tackle sophisticated analytical challenges that previously required multiple tools or programming languages.
Remember: the goal isn't to write the most complex query possible, but to write queries that clearly express your analytical intent while performing efficiently at scale. CTEs give you the tools to achieve both clarity and performance—use them wisely.
Learning Path: SQL Fundamentals