Mastering Subqueries and Common Table Expressions (CTEs) in SQL

You're staring at a complex data problem that seems to require multiple steps: first, you need to identify high-value customers from the past year, then calculate their average order value, and finally compare that against regional benchmarks. Your first instinct might be to create temporary tables or run multiple queries, but there's a more elegant solution hiding in plain sight. Advanced SQL practitioners know that subqueries and Common Table Expressions (CTEs) can transform unwieldy multi-step problems into readable, maintainable single queries.

While basic SQL gets you far with simple SELECT statements, real-world analytics demands the ability to break complex logic into manageable pieces while maintaining query performance. Subqueries and CTEs are your tools for building sophisticated data transformations that remain comprehensible to your future self and your colleagues. They're not just syntactic sugar—they're fundamental to writing production-quality SQL that scales.

By the end of this lesson, you'll wield subqueries and CTEs with the precision of a surgeon and the vision of an architect. You'll understand not just how to write them, but when to choose one over the other, how to optimize their performance, and how to avoid the subtle traps that catch even experienced developers.

What you'll learn:

Master scalar, row, and table subqueries with deep understanding of execution patterns
Build complex CTEs including recursive operations and advanced windowing
Optimize subquery and CTE performance through execution plan analysis
Navigate correlated subqueries and understand their performance implications
Apply advanced patterns like pivoting, unpivoting, and hierarchical data processing
Troubleshoot common anti-patterns and performance bottlenecks

Prerequisites

You should be comfortable with intermediate SQL concepts including JOINs, aggregate functions, and window functions. Experience with query optimization and execution plans will enhance your understanding, though we'll cover the essentials as we go.

Understanding Subqueries: Beyond the Basics

Subqueries are queries nested within other queries, but this simple definition masks their true power. Think of them as functions in programming—they encapsulate logic, return values, and can be composed to build complex operations. However, unlike functions, subqueries execute within the database engine's query optimizer, which opens both opportunities and pitfalls.

Let's start with a realistic scenario. You're analyzing e-commerce data and need to identify customers who placed orders above the average order value in each product category. Here's our sample data structure:

-- Sample data setup
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    name VARCHAR(100),
    region VARCHAR(50),
    signup_date DATE
);

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    order_date DATE,
    total_amount DECIMAL(10,2)
);

CREATE TABLE order_items (
    item_id INT PRIMARY KEY,
    order_id INT,
    product_id INT,
    quantity INT,
    unit_price DECIMAL(8,2)
);

CREATE TABLE products (
    product_id INT PRIMARY KEY,
    name VARCHAR(200),
    category VARCHAR(100),
    cost DECIMAL(8,2)
);

Scalar Subqueries: The Single Value Answer

Scalar subqueries return exactly one value and can be used anywhere you'd use a literal value or column reference. They're deceptively powerful for comparative analysis:

-- Find customers with above-average order totals
SELECT c.name, c.region, o.total_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.total_amount > (
    SELECT AVG(total_amount) 
    FROM orders
)
ORDER BY o.total_amount DESC;

This looks straightforward, but the execution story is more complex. The database engine must decide whether to execute the subquery once (if it's uncorrelated) or multiple times (if it references outer query columns). Most modern optimizers are smart enough to cache scalar subquery results, but understanding this distinction is crucial for performance tuning.

Here's a more sophisticated example that reveals the power of scalar subqueries:

-- Calculate each customer's lifetime value as a percentage of total company revenue
SELECT 
    c.name,
    c.region,
    SUM(o.total_amount) as lifetime_value,
    ROUND(
        (SUM(o.total_amount) * 100.0) / (
            SELECT SUM(total_amount) FROM orders
        ), 2
    ) as pct_of_total_revenue
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name, c.region
HAVING SUM(o.total_amount) > (
    SELECT AVG(customer_total)
    FROM (
        SELECT SUM(total_amount) as customer_total
        FROM orders
        GROUP BY customer_id
    ) customer_averages
)
ORDER BY lifetime_value DESC;

Notice how we nest a subquery within a HAVING clause that itself contains a subquery. This is where query readability starts to suffer, foreshadowing why CTEs become essential.

Row Subqueries: Working with Structured Results

Row subqueries return a single row with multiple columns. They're particularly useful for complex comparisons and existence checks:

-- Find orders that match the highest single-day sales volume
SELECT o.order_id, o.order_date, o.total_amount
FROM orders o
WHERE (o.order_date, o.total_amount) = (
    SELECT order_date, MAX(total_amount)
    FROM orders
    GROUP BY order_date
    ORDER BY MAX(total_amount) DESC
    LIMIT 1
);

Row subqueries excel in scenarios where you need to match multiple criteria simultaneously. However, they're often overlooked because their syntax feels unfamiliar to developers coming from procedural programming backgrounds.

Table Subqueries: The Foundation of Complex Analysis

Table subqueries return multiple rows and columns, effectively creating temporary result sets. They appear in FROM clauses and are essential for multi-step analytics:

-- Analyze customer purchasing patterns by cohort
SELECT 
    cohort.signup_month,
    cohort.customer_count,
    cohort.avg_first_order_value,
    recent_performance.avg_recent_order_value,
    recent_performance.retention_rate
FROM (
    SELECT 
        DATE_TRUNC('month', c.signup_date) as signup_month,
        COUNT(*) as customer_count,
        AVG(first_order.amount) as avg_first_order_value
    FROM customers c
    JOIN (
        SELECT 
            customer_id,
            MIN(order_date) as first_order_date,
            MIN(total_amount) as amount
        FROM orders
        GROUP BY customer_id
    ) first_order ON c.customer_id = first_order.customer_id
    GROUP BY DATE_TRUNC('month', c.signup_date)
) cohort
JOIN (
    SELECT 
        DATE_TRUNC('month', c.signup_date) as signup_month,
        AVG(o.total_amount) as avg_recent_order_value,
        COUNT(DISTINCT o.customer_id) * 100.0 / cohort_size.total as retention_rate
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
    JOIN (
        SELECT 
            DATE_TRUNC('month', signup_date) as month,
            COUNT(*) as total
        FROM customers
        GROUP BY DATE_TRUNC('month', signup_date)
    ) cohort_size ON DATE_TRUNC('month', c.signup_date) = cohort_size.month
    WHERE o.order_date >= CURRENT_DATE - INTERVAL '90 days'
    GROUP BY DATE_TRUNC('month', c.signup_date), cohort_size.total
) recent_performance ON cohort.signup_month = recent_performance.signup_month
ORDER BY cohort.signup_month;

This query demonstrates how table subqueries can create sophisticated analytics, but it also shows the readability cliff we face with deeply nested subqueries. The logic is sound, but the structure is becoming unwieldy.

Correlated Subqueries: When Context Matters

Correlated subqueries reference columns from the outer query, creating a dependency that fundamentally changes execution behavior. Instead of running once, correlated subqueries execute once for each row processed by the outer query. This can be powerful for row-by-row comparisons but dangerous for performance.

-- Find each customer's most expensive order
SELECT c.name, o.order_date, o.total_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.total_amount = (
    SELECT MAX(total_amount)
    FROM orders o2
    WHERE o2.customer_id = c.customer_id
);

The inner query runs once per customer, using the c.customer_id from the outer query. Modern optimizers often transform these into more efficient JOINs, but understanding the logical execution helps you predict performance characteristics.

Correlated subqueries shine in existence and ranking scenarios:

-- Find customers who haven't ordered in their preferred category recently
SELECT DISTINCT c.name, c.region
FROM customers c
WHERE NOT EXISTS (
    SELECT 1
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p ON oi.product_id = p.product_id
    WHERE o.customer_id = c.customer_id
    AND p.category = (
        -- Find each customer's most-ordered category
        SELECT p2.category
        FROM orders o2
        JOIN order_items oi2 ON o2.order_id = oi2.order_id
        JOIN products p2 ON oi2.product_id = p2.product_id
        WHERE o2.customer_id = c.customer_id
        GROUP BY p2.category
        ORDER BY SUM(oi2.quantity) DESC
        LIMIT 1
    )
    AND o.order_date >= CURRENT_DATE - INTERVAL '60 days'
);

This query finds customers who haven't recently ordered from their historically preferred product category. It's a complex business requirement that would be difficult to express without correlated subqueries.

Performance Warning: Correlated subqueries can create N+1 query problems. If your outer query returns 10,000 rows, your inner query might execute 10,000 times. Always examine execution plans for correlated subqueries in production environments.

Common Table Expressions: Readable Complexity

CTEs transform the nested complexity we've seen into readable, debuggable code. Think of them as named subqueries that exist for the duration of a single SQL statement. They're particularly powerful for breaking complex logic into digestible steps.

Basic CTE Syntax and Mental Models

A CTE follows this pattern:

WITH cte_name AS (
    SELECT ...
)
SELECT ...
FROM cte_name

Let's refactor our earlier cohort analysis using CTEs:

-- Customer cohort analysis with CTEs
WITH customer_cohorts AS (
    SELECT 
        c.customer_id,
        c.name,
        c.region,
        DATE_TRUNC('month', c.signup_date) as cohort_month
    FROM customers c
),

first_orders AS (
    SELECT 
        customer_id,
        MIN(order_date) as first_order_date,
        AVG(total_amount) as first_order_amount
    FROM orders
    GROUP BY customer_id
),

recent_orders AS (
    SELECT 
        customer_id,
        AVG(total_amount) as recent_avg_amount,
        COUNT(*) as recent_order_count
    FROM orders
    WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
    GROUP BY customer_id
),

cohort_summary AS (
    SELECT 
        cc.cohort_month,
        COUNT(*) as total_customers,
        AVG(fo.first_order_amount) as avg_first_order,
        AVG(ro.recent_avg_amount) as avg_recent_order,
        COUNT(ro.customer_id) * 100.0 / COUNT(*) as retention_rate
    FROM customer_cohorts cc
    LEFT JOIN first_orders fo ON cc.customer_id = fo.customer_id
    LEFT JOIN recent_orders ro ON cc.customer_id = ro.customer_id
    GROUP BY cc.cohort_month
)

SELECT 
    cohort_month,
    total_customers,
    ROUND(avg_first_order, 2) as avg_first_order,
    ROUND(avg_recent_order, 2) as avg_recent_order,
    ROUND(retention_rate, 2) as retention_pct
FROM cohort_summary
ORDER BY cohort_month;

The logic is identical to our nested subquery version, but the structure is dramatically more readable. Each CTE represents a clear step in our analytical process, making it easy to debug, modify, or explain to stakeholders.

Multiple CTEs and Dependency Chains

CTEs can reference earlier CTEs in the same WITH clause, creating dependency chains that mirror your analytical thinking:

-- Advanced customer segmentation with chained CTEs
WITH order_metrics AS (
    SELECT 
        customer_id,
        COUNT(*) as total_orders,
        SUM(total_amount) as lifetime_value,
        AVG(total_amount) as avg_order_value,
        MIN(order_date) as first_order_date,
        MAX(order_date) as last_order_date
    FROM orders
    GROUP BY customer_id
),

customer_scores AS (
    SELECT 
        customer_id,
        total_orders,
        lifetime_value,
        avg_order_value,
        EXTRACT(days FROM CURRENT_DATE - last_order_date) as days_since_last_order,
        CASE 
            WHEN total_orders >= 10 AND lifetime_value >= 1000 THEN 'VIP'
            WHEN total_orders >= 5 OR lifetime_value >= 500 THEN 'Premium'
            WHEN EXTRACT(days FROM CURRENT_DATE - last_order_date) <= 30 THEN 'Active'
            WHEN EXTRACT(days FROM CURRENT_DATE - last_order_date) <= 90 THEN 'At_Risk'
            ELSE 'Inactive'
        END as customer_segment
    FROM order_metrics
),

segment_analysis AS (
    SELECT 
        customer_segment,
        COUNT(*) as customer_count,
        AVG(lifetime_value) as avg_lifetime_value,
        AVG(avg_order_value) as avg_order_value,
        AVG(days_since_last_order) as avg_days_since_last_order
    FROM customer_scores
    GROUP BY customer_segment
),

total_customers AS (
    SELECT COUNT(*) as total FROM customer_scores
)

SELECT 
    sa.customer_segment,
    sa.customer_count,
    ROUND(sa.customer_count * 100.0 / tc.total, 2) as pct_of_customers,
    ROUND(sa.avg_lifetime_value, 2) as avg_lifetime_value,
    ROUND(sa.avg_order_value, 2) as avg_order_value,
    ROUND(sa.avg_days_since_last_order, 0) as avg_days_since_last_order
FROM segment_analysis sa
CROSS JOIN total_customers tc
ORDER BY sa.avg_lifetime_value DESC;

Each CTE builds on previous ones, creating a clear pipeline from raw order data to business intelligence insights. This approach makes complex analytics maintainable and allows for easy modification of individual steps without rewriting the entire query.

Recursive CTEs: Hierarchical Data and Beyond

Recursive CTEs handle hierarchical data, graph traversal, and iterative calculations. They consist of two parts: an anchor query (the starting point) and a recursive query that references the CTE itself.

Basic Hierarchical Traversal

Let's add organizational hierarchy to our data model:

CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    name VARCHAR(100),
    manager_id INT,
    department VARCHAR(50),
    salary DECIMAL(10,2)
);

A recursive CTE can traverse this hierarchy:

-- Find all employees in a management chain
WITH management_hierarchy AS (
    -- Anchor: start with CEO (no manager)
    SELECT 
        employee_id,
        name,
        manager_id,
        department,
        salary,
        0 as level,
        CAST(name AS VARCHAR(500)) as hierarchy_path
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive: find direct reports
    SELECT 
        e.employee_id,
        e.name,
        e.manager_id,
        e.department,
        e.salary,
        mh.level + 1,
        CAST(mh.hierarchy_path + ' -> ' + e.name AS VARCHAR(500))
    FROM employees e
    JOIN management_hierarchy mh ON e.manager_id = mh.employee_id
)

SELECT 
    employee_id,
    name,
    department,
    salary,
    level,
    hierarchy_path
FROM management_hierarchy
ORDER BY level, department, name;

The anchor query finds top-level employees (those without managers), while the recursive portion finds their direct reports, then their reports' reports, continuing until no more levels exist.

Advanced Recursive Patterns

Recursive CTEs can solve sophisticated problems like finding connected components in graphs or calculating running totals with complex business rules:

-- Calculate departmental budget rollups with approval chains
WITH budget_rollup AS (
    -- Anchor: leaf departments (no subordinate departments)
    SELECT 
        department,
        manager_id,
        SUM(salary) as department_budget,
        COUNT(*) as employee_count,
        0 as rollup_level,
        department as budget_path
    FROM employees
    WHERE department NOT IN (
        SELECT DISTINCT e2.department
        FROM employees e1
        JOIN employees e2 ON e1.employee_id = e2.manager_id
        WHERE e1.department != e2.department
    )
    GROUP BY department, manager_id
    
    UNION ALL
    
    -- Recursive: roll up to parent departments
    SELECT 
        e.department,
        e.manager_id,
        br.department_budget + COALESCE(parent_budget.direct_budget, 0),
        br.employee_count + COALESCE(parent_budget.direct_count, 0),
        br.rollup_level + 1,
        e.department + ' <- ' + br.budget_path
    FROM budget_rollup br
    JOIN employees e ON br.manager_id = e.employee_id
    LEFT JOIN (
        SELECT 
            department,
            SUM(salary) as direct_budget,
            COUNT(*) as direct_count
        FROM employees
        GROUP BY department
    ) parent_budget ON e.department = parent_budget.department
)

SELECT 
    department,
    department_budget,
    employee_count,
    rollup_level,
    budget_path
FROM budget_rollup
WHERE rollup_level = (SELECT MAX(rollup_level) FROM budget_rollup)
ORDER BY department_budget DESC;

Recursion Limits: Most databases impose recursion limits (typically 100 levels) to prevent infinite loops. Use OPTION (MAXRECURSION n) in SQL Server or equivalent settings in other databases for deeper hierarchies.

Performance Considerations and Optimization

Understanding how databases execute subqueries and CTEs is crucial for production performance. The query optimizer has several strategies, and knowing when to expect each helps you write efficient code.

Subquery Execution Strategies

Modern optimizers typically transform subqueries using these strategies:

Subquery Caching: Uncorrelated scalar subqueries execute once and cache results
Semi-join Transformation: EXISTS subqueries become semi-joins
Anti-join Transformation: NOT EXISTS subqueries become anti-joins
Materialization: Complex subqueries may be materialized into temporary tables

Let's examine these with execution plan analysis:

-- Potentially inefficient correlated subquery
SELECT c.name, c.region
FROM customers c
WHERE c.customer_id IN (
    SELECT o.customer_id
    FROM orders o
    WHERE o.total_amount > (
        SELECT AVG(total_amount) * 1.5
        FROM orders o2
        WHERE o2.customer_id = o.customer_id
    )
);

This query has a correlated subquery nested within another subquery. The execution plan reveals whether the optimizer successfully transforms this into efficient joins or if it's executing the inner query repeatedly.

A more efficient approach uses window functions:

-- Optimized version using window functions and CTEs
WITH customer_order_stats AS (
    SELECT 
        customer_id,
        total_amount,
        AVG(total_amount) OVER (PARTITION BY customer_id) as customer_avg
    FROM orders
),

high_value_customers AS (
    SELECT DISTINCT customer_id
    FROM customer_order_stats
    WHERE total_amount > customer_avg * 1.5
)

SELECT c.name, c.region
FROM customers c
JOIN high_value_customers hvc ON c.customer_id = hvc.customer_id;

CTE Materialization and Optimization

CTEs in most databases are materialized once and reused, but some optimizers inline simple CTEs. Understanding your database's behavior helps predict performance:

-- CTE that benefits from materialization
WITH expensive_calculation AS (
    SELECT 
        customer_id,
        SUM(oi.quantity * oi.unit_price) as total_product_revenue,
        COUNT(DISTINCT o.order_date) as shopping_days,
        AVG(p.cost) as avg_product_cost
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p ON oi.product_id = p.product_id
    GROUP BY customer_id
)

SELECT 
    'High Value' as segment,
    COUNT(*) as customer_count,
    AVG(total_product_revenue) as avg_revenue
FROM expensive_calculation
WHERE total_product_revenue > 1000

UNION ALL

SELECT 
    'Frequent Shoppers' as segment,
    COUNT(*) as customer_count,
    AVG(total_product_revenue) as avg_revenue
FROM expensive_calculation
WHERE shopping_days > 10;

The CTE is referenced twice in the UNION, so materialization prevents duplicate calculation of the expensive aggregation.

Indexing Strategies for Subqueries

Effective indexing dramatically improves subquery performance:

-- Indexes that support our common subquery patterns
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
CREATE INDEX idx_orders_amount_date ON orders (total_amount, order_date);
CREATE INDEX idx_order_items_covering ON order_items (order_id, product_id) 
    INCLUDE (quantity, unit_price);
CREATE INDEX idx_products_category ON products (category) INCLUDE (cost);

These indexes support the filtering, joining, and aggregation patterns we've used throughout our examples.

Advanced Patterns and Techniques

Pivoting and Unpivoting with CTEs

CTEs excel at data shape transformations that are awkward with traditional SQL:

-- Dynamic pivot of product sales by month
WITH monthly_sales AS (
    SELECT 
        p.category,
        DATE_TRUNC('month', o.order_date) as sale_month,
        SUM(oi.quantity * oi.unit_price) as monthly_revenue
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p ON oi.product_id = p.product_id
    WHERE o.order_date >= CURRENT_DATE - INTERVAL '12 months'
    GROUP BY p.category, DATE_TRUNC('month', o.order_date)
),

pivot_data AS (
    SELECT 
        category,
        SUM(CASE WHEN sale_month = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '11 months') 
                 THEN monthly_revenue END) as month_01,
        SUM(CASE WHEN sale_month = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '10 months') 
                 THEN monthly_revenue END) as month_02,
        SUM(CASE WHEN sale_month = DATE_TRUNC('month', CURRENT_DATE - INTERVAL '9 months') 
                 THEN monthly_revenue END) as month_03,
        -- ... continue for all months
        SUM(CASE WHEN sale_month = DATE_TRUNC('month', CURRENT_DATE) 
                 THEN monthly_revenue END) as month_12
    FROM monthly_sales
    GROUP BY category
)

SELECT 
    category,
    COALESCE(month_01, 0) as month_01,
    COALESCE(month_02, 0) as month_02,
    COALESCE(month_03, 0) as month_03,
    -- Calculate growth rates
    CASE 
        WHEN month_01 > 0 THEN ROUND((month_02 - month_01) * 100.0 / month_01, 2)
        ELSE NULL 
    END as m2_growth_pct
FROM pivot_data
ORDER BY category;

Window Functions with CTEs for Complex Analytics

Combining CTEs with window functions creates powerful analytical patterns:

-- Customer lifecycle analysis with advanced windowing
WITH customer_orders AS (
    SELECT 
        c.customer_id,
        c.name,
        c.signup_date,
        o.order_date,
        o.total_amount,
        ROW_NUMBER() OVER (PARTITION BY c.customer_id ORDER BY o.order_date) as order_sequence,
        LAG(o.order_date) OVER (PARTITION BY c.customer_id ORDER BY o.order_date) as prev_order_date,
        SUM(o.total_amount) OVER (PARTITION BY c.customer_id ORDER BY o.order_date 
                                  ROWS UNBOUNDED PRECEDING) as cumulative_value
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
),

order_intervals AS (
    SELECT 
        *,
        CASE 
            WHEN prev_order_date IS NOT NULL 
            THEN EXTRACT(days FROM order_date - prev_order_date)
            ELSE EXTRACT(days FROM order_date - signup_date)
        END as days_between_orders,
        CASE 
            WHEN order_sequence = 1 THEN 'First Order'
            WHEN order_sequence = 2 THEN 'Second Order'
            WHEN order_sequence <= 5 THEN 'Early Orders'
            ELSE 'Repeat Customer'
        END as order_stage
    FROM customer_orders
),

customer_patterns AS (
    SELECT 
        customer_id,
        name,
        COUNT(*) as total_orders,
        AVG(days_between_orders) as avg_days_between_orders,
        STDDEV(days_between_orders) as order_frequency_volatility,
        MAX(cumulative_value) as lifetime_value,
        CASE 
            WHEN AVG(days_between_orders) <= 30 THEN 'Frequent'
            WHEN AVG(days_between_orders) <= 90 THEN 'Regular'
            WHEN AVG(days_between_orders) <= 180 THEN 'Occasional'
            ELSE 'Infrequent'
        END as purchase_frequency_segment
    FROM order_intervals
    GROUP BY customer_id, name
)

SELECT 
    purchase_frequency_segment,
    COUNT(*) as customer_count,
    ROUND(AVG(total_orders), 1) as avg_orders_per_customer,
    ROUND(AVG(avg_days_between_orders), 1) as avg_days_between_orders,
    ROUND(AVG(lifetime_value), 2) as avg_lifetime_value,
    ROUND(AVG(order_frequency_volatility), 1) as avg_frequency_volatility
FROM customer_patterns
GROUP BY purchase_frequency_segment
ORDER BY avg_lifetime_value DESC;

This analysis reveals customer purchasing patterns through sophisticated windowing that would be nearly impossible without CTEs to organize the logic.

Hands-On Exercise

Let's apply these concepts to a realistic business scenario. You're analyzing subscription data for a SaaS company and need to create a comprehensive churn analysis report.

-- Setup sample subscription data
CREATE TABLE subscriptions (
    subscription_id INT PRIMARY KEY,
    customer_id INT,
    plan_type VARCHAR(50),
    start_date DATE,
    end_date DATE,
    monthly_fee DECIMAL(8,2)
);

CREATE TABLE usage_events (
    event_id INT PRIMARY KEY,
    customer_id INT,
    event_date DATE,
    feature_used VARCHAR(100),
    usage_count INT
);

-- Sample data
INSERT INTO subscriptions VALUES
(1, 101, 'Basic', '2023-01-15', '2023-06-15', 29.99),
(2, 102, 'Premium', '2023-02-01', NULL, 79.99),
(3, 103, 'Basic', '2023-03-10', '2023-05-10', 29.99),
(4, 104, 'Enterprise', '2023-01-01', NULL, 199.99);

Your task: Create a query using both subqueries and CTEs that analyzes:

Customer churn rate by plan type and signup cohort
Usage patterns in the 30 days before churn vs. active customers
Revenue impact of churned customers
Identification of at-risk customers based on usage decline

Try to solve this before looking at the solution:

-- Solution: Comprehensive churn analysis
WITH customer_cohorts AS (
    SELECT 
        customer_id,
        plan_type,
        monthly_fee,
        start_date,
        end_date,
        DATE_TRUNC('month', start_date) as signup_cohort,
        CASE 
            WHEN end_date IS NULL THEN 'Active'
            ELSE 'Churned'
        END as status,
        CASE 
            WHEN end_date IS NOT NULL 
            THEN EXTRACT(days FROM end_date - start_date)
            ELSE EXTRACT(days FROM CURRENT_DATE - start_date)
        END as tenure_days
    FROM subscriptions
),

usage_summary AS (
    SELECT 
        ue.customer_id,
        COUNT(*) as total_events,
        COUNT(DISTINCT ue.feature_used) as features_used,
        MAX(ue.event_date) as last_usage_date,
        AVG(ue.usage_count) as avg_usage_intensity
    FROM usage_events ue
    GROUP BY ue.customer_id
),

pre_churn_usage AS (
    SELECT 
        ue.customer_id,
        COUNT(*) as pre_churn_events,
        AVG(ue.usage_count) as pre_churn_intensity
    FROM usage_events ue
    JOIN customer_cohorts cc ON ue.customer_id = cc.customer_id
    WHERE cc.status = 'Churned'
    AND ue.event_date BETWEEN cc.end_date - INTERVAL '30 days' AND cc.end_date
    GROUP BY ue.customer_id
),

churn_analysis AS (
    SELECT 
        cc.signup_cohort,
        cc.plan_type,
        COUNT(*) as total_customers,
        COUNT(CASE WHEN cc.status = 'Churned' THEN 1 END) as churned_customers,
        AVG(cc.tenure_days) as avg_tenure_days,
        SUM(CASE WHEN cc.status = 'Churned' 
                 THEN cc.monthly_fee * (cc.tenure_days / 30.0) 
                 END) as lost_revenue,
        AVG(CASE WHEN cc.status = 'Active' THEN us.total_events END) as avg_active_usage,
        AVG(pcu.pre_churn_events) as avg_pre_churn_usage
    FROM customer_cohorts cc
    LEFT JOIN usage_summary us ON cc.customer_id = us.customer_id
    LEFT JOIN pre_churn_usage pcu ON cc.customer_id = pcu.customer_id
    GROUP BY cc.signup_cohort, cc.plan_type
),

at_risk_customers AS (
    SELECT 
        cc.customer_id,
        cc.plan_type,
        cc.monthly_fee,
        us.last_usage_date,
        us.total_events,
        EXTRACT(days FROM CURRENT_DATE - us.last_usage_date) as days_since_usage,
        -- Compare to average usage of churned customers
        CASE 
            WHEN us.total_events < (
                SELECT AVG(pre_churn_events) 
                FROM pre_churn_usage
            ) AND EXTRACT(days FROM CURRENT_DATE - us.last_usage_date) > 14
            THEN 'High Risk'
            WHEN EXTRACT(days FROM CURRENT_DATE - us.last_usage_date) > 7
            THEN 'Medium Risk'
            ELSE 'Low Risk'
        END as risk_level
    FROM customer_cohorts cc
    JOIN usage_summary us ON cc.customer_id = us.customer_id
    WHERE cc.status = 'Active'
)

-- Final report combining all analyses
SELECT 
    'Churn by Cohort' as analysis_type,
    signup_cohort::TEXT as dimension,
    plan_type as subdimension,
    ROUND(churned_customers * 100.0 / total_customers, 2) as churn_rate,
    ROUND(lost_revenue, 2) as impact_value
FROM churn_analysis

UNION ALL

SELECT 
    'At Risk Customers' as analysis_type,
    risk_level as dimension,
    plan_type as subdimension,
    COUNT(*)::DECIMAL as churn_rate,
    ROUND(SUM(monthly_fee), 2) as impact_value
FROM at_risk_customers
GROUP BY risk_level, plan_type

ORDER BY analysis_type, impact_value DESC;

This exercise demonstrates how CTEs can organize complex analytical logic while subqueries handle specific comparative calculations.

Common Mistakes & Troubleshooting

Performance Anti-patterns

Mistake #1: Overusing Correlated Subqueries

-- Inefficient: Multiple correlated subqueries
SELECT 
    c.name,
    (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) as order_count,
    (SELECT AVG(total_amount) FROM orders o WHERE o.customer_id = c.customer_id) as avg_order,
    (SELECT MAX(order_date) FROM orders o WHERE o.customer_id = c.customer_id) as last_order
FROM customers c;

Better: Single aggregation with window functions

-- Efficient: One join with multiple aggregations
WITH customer_metrics AS (
    SELECT 
        customer_id,
        COUNT(*) as order_count,
        AVG(total_amount) as avg_order,
        MAX(order_date) as last_order
    FROM orders
    GROUP BY customer_id
)

SELECT 
    c.name,
    cm.order_count,
    cm.avg_order,
    cm.last_order
FROM customers c
LEFT JOIN customer_metrics cm ON c.customer_id = cm.customer_id;

Mistake #2: CTE Scope Confusion

-- Wrong: Trying to reference CTE outside its scope
WITH monthly_sales AS (
    SELECT DATE_TRUNC('month', order_date) as month, SUM(total_amount) as revenue
    FROM orders
    GROUP BY DATE_TRUNC('month', order_date)
)
SELECT * FROM monthly_sales;

-- This would fail - CTEs don't persist across statements
SELECT month FROM monthly_sales WHERE revenue > 1000;

Mistake #3: Recursive CTE Infinite Loops

-- Dangerous: No termination condition
WITH hierarchy AS (
    SELECT employee_id, manager_id, 0 as level
    FROM employees
    WHERE employee_id = 1
    
    UNION ALL
    
    SELECT e.employee_id, e.manager_id, h.level + 1
    FROM employees e
    JOIN hierarchy h ON e.manager_id = h.employee_id
    -- Missing: WHERE h.level < some_limit
)
SELECT * FROM hierarchy;

Debugging Complex Queries

When subqueries or CTEs aren't returning expected results:

Test each CTE independently:

-- Test individual CTEs by selecting from them directly
WITH customer_metrics AS (
    SELECT customer_id, COUNT(*) as order_count
    FROM orders
    GROUP BY customer_id
)
SELECT * FROM customer_metrics WHERE customer_id = 101;

Add debugging columns:

WITH debug_customers AS (
    SELECT 
        customer_id,
        name,
        COUNT(*) as order_count,
        'Debug: ' || COUNT(*) as debug_info
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY customer_id, name
)
SELECT * FROM debug_customers;

Use execution plans to identify bottlenecks:

-- Most databases support EXPLAIN or similar
EXPLAIN (ANALYZE, BUFFERS) 
WITH complex_cte AS (...)
SELECT * FROM complex_cte;

Data Type and NULL Handling

Subqueries can introduce subtle NULL handling issues:

-- Potential issue: NULL comparisons in subqueries
SELECT customer_id 
FROM customers c
WHERE customer_id NOT IN (
    SELECT customer_id FROM orders  -- If any customer_id is NULL, returns no rows!
);

-- Safe version: Handle NULLs explicitly
SELECT customer_id 
FROM customers c
WHERE customer_id NOT IN (
    SELECT customer_id FROM orders WHERE customer_id IS NOT NULL
);

-- Even better: Use EXISTS
SELECT customer_id 
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id
);

Summary & Next Steps

You've now mastered the sophisticated art of subqueries and CTEs, transforming from someone who writes simple SELECT statements into a practitioner who can tackle complex analytical challenges with readable, maintainable SQL. The techniques we've covered—from basic scalar subqueries to recursive CTEs with advanced windowing—represent the backbone of production data analysis.

The key insights to remember:

Subqueries excel at encapsulating logic but can become unwieldy when nested deeply
CTEs prioritize readability and are essential for complex multi-step analytics
Correlated subqueries are powerful but potentially expensive—always examine execution plans
Recursive CTEs unlock hierarchical and graph-based analysis that's impossible with traditional SQL
Performance optimization requires understanding how your database's query optimizer handles these constructs

Your next steps should focus on applying these patterns to real production scenarios in your environment. Start by identifying complex queries in your current work that could benefit from CTE refactoring. Practice writing recursive CTEs for any hierarchical data you encounter. Most importantly, develop the habit of examining execution plans for your more complex queries.

Advanced topics to explore next include materialized views (which can cache CTE-like logic), stored procedures that use these patterns, and database-specific extensions like SQL Server's MERGE statements or PostgreSQL's lateral joins. The foundation you've built here will make those advanced topics much more accessible.