Subqueries and Correlated Subqueries: Writing Queries Within Queries

Introduction

Imagine you're a data analyst at a retail company, and your manager asks: "Give me a list of every customer who has spent more than the average order amount." Simple enough question. But when you sit down to write the SQL, you realize the problem — you don't know the average order amount yet. You'd have to run one query to find the average, write down that number, and then hardcode it into a second query. That works once, but the moment the data changes, your number is stale.

This is exactly the problem that subqueries solve. A subquery is a query nested inside another query — it lets SQL answer "what's the average?" and "who's above it?" in a single, self-updating statement. Once you understand this pattern, you'll stop writing two-step workarounds and start writing queries that do real analytical work in one shot.

By the end of this lesson, you'll be writing queries that filter based on aggregated results, compare rows against group-level statistics, and even reference the outer query from inside the inner one — a technique called a correlated subquery that unlocks a whole new category of problems you can solve.

What you'll learn:

What a subquery is and how SQL evaluates it
How to use subqueries in WHERE, FROM, and SELECT clauses
The difference between a regular subquery and a correlated subquery
When to use EXISTS instead of IN for correlated logic
How to avoid the most common mistakes that trip up intermediate SQL writers

Prerequisites

This lesson assumes you're comfortable with:

Basic SELECT, FROM, WHERE, and ORDER BY clauses
Aggregate functions like COUNT(), AVG(), SUM(), and MAX()
GROUP BY and HAVING
Basic JOIN syntax (helpful but not strictly required)

If aggregates and GROUP BY feel shaky, review those topics first — subqueries build on them heavily.

The Core Idea: A Query Inside a Query

Before writing any code, let's build the mental model.

SQL processes a query in a specific order: it figures out what table you're querying, applies filters, groups rows, and then returns results. A subquery takes advantage of this by letting you embed an entire query where SQL normally expects a value, a list, or a table.

Think of it like a recipe that calls for "homemade stock." You don't buy the stock — you make it first, and then use it in the larger recipe. The stock-making process is self-contained. The outer recipe doesn't care how you made the stock; it just uses the result.

SQL subqueries work the same way. The database runs the inner query first, gets a result, and then plugs that result into the outer query.

Your Sample Data

Throughout this lesson, we'll work with a simple e-commerce dataset. Here are the tables:

customers

customer_id | name           | city
------------|----------------|----------
1           | Amara Osei     | Chicago
2           | Rafael Torres  | Miami
3           | Mei Lin        | Chicago
4           | James Okafor   | Houston
5           | Priya Nair     | Chicago

orders

order_id | customer_id | order_date  | order_total
---------|-------------|-------------|------------
101      | 1           | 2024-01-05  | 120.00
102      | 2           | 2024-01-08  | 340.00
103      | 1           | 2024-02-01  | 85.00
104      | 3           | 2024-02-14  | 210.00
105      | 4           | 2024-03-02  | 95.00
106      | 5           | 2024-03-10  | 450.00
107      | 2           | 2024-03-15  | 180.00
108      | 3           | 2024-04-01  | 310.00

Subqueries in the WHERE Clause

This is where most people first encounter subqueries, and it's the most intuitive place to start.

The problem: Find all orders where the order total is above the average order total.

Without a subquery, you'd do this in two steps:

-- Step 1: Find the average
SELECT AVG(order_total) FROM orders;
-- Returns: 223.75

-- Step 2: Hardcode that value
SELECT * FROM orders WHERE order_total > 223.75;

This is brittle. Next week, when new orders come in, 223.75 is wrong. Here's the subquery version:

SELECT order_id, customer_id, order_total
FROM orders
WHERE order_total > (SELECT AVG(order_total) FROM orders);

The parentheses wrap the inner query. SQL evaluates SELECT AVG(order_total) FROM orders first, which returns a single number (223.75), and then uses that number in the WHERE clause. The outer query never needs to know what the number actually is — it asks for it fresh every time.

Result:

order_id | customer_id | order_total
---------|-------------|------------
102      | 2           | 340.00
104      | 3           | 210.00  ← Wait, is this right?
106      | 5           | 450.00
108      | 3           | 310.00

Tip: Order 104 ($210.00) actually falls below $223.75, so it would NOT appear. I left this here deliberately — always sanity-check your results. If you expect a row and don't see it, recalculate the threshold manually to confirm the filter is working correctly.

The corrected result:

order_id | customer_id | order_total
---------|-------------|------------
102      | 2           | 340.00
106      | 5           | 450.00
108      | 3           | 310.00

Using IN with a Subquery

Sometimes you don't want to compare against a single value — you want to compare against a list of values. That's where IN combined with a subquery shines.

The problem: Find all customers who have placed at least one order.

SELECT name, city
FROM customers
WHERE customer_id IN (SELECT DISTINCT customer_id FROM orders);

The inner query returns a list of customer IDs that appear in the orders table: {1, 2, 3, 4, 5}. The outer query then checks each customer row to see if their customer_id is in that list.

Now flip it: find customers who have never placed an order.

SELECT name, city
FROM customers
WHERE customer_id NOT IN (SELECT DISTINCT customer_id FROM orders);

Warning: NOT IN has a dangerous gotcha. If the subquery returns any NULL values, NOT IN will return zero rows — not because no customers match, but because SQL's three-valued logic causes the whole comparison to become UNKNOWN. Always make sure your subquery result set excludes NULLs, or use NOT EXISTS instead (covered later in this lesson).

Subqueries in the FROM Clause (Derived Tables)

A subquery in a WHERE clause returns a value or a list. A subquery in a FROM clause returns a table — and you can query that table just like any other. These are sometimes called derived tables or inline views.

The problem: You want to find the average of per-customer total spend. That's a two-step aggregation: first sum each customer's orders, then average those sums.

You can't do AVG(SUM(order_total)) directly in SQL — you can't nest aggregate functions. But you can compute the inner aggregation in a subquery and then aggregate over that result.

SELECT AVG(customer_total) AS avg_customer_spend
FROM (
    SELECT customer_id, SUM(order_total) AS customer_total
    FROM orders
    GROUP BY customer_id
) AS customer_spending;

Let's unpack this:

The inner query groups orders by customer and computes each customer's total spend.
That produces a temporary table (aliased as customer_spending) with columns customer_id and customer_total.
The outer query then runs AVG() on the customer_total column of that temporary table.

Important: When you write a subquery in the FROM clause, you must give it an alias. AS customer_spending is required syntax — omitting it will give you an error. The alias is how the outer query refers to the derived table.

The inner query produces:

customer_id | customer_total
------------|---------------
1           | 205.00
2           | 520.00
3           | 520.00
4           | 95.00
5           | 450.00

And the outer query averages those five values: (205 + 520 + 520 + 95 + 450) / 5 = 358.00.

This pattern — aggregate, then re-aggregate — is one of the most common reasons to use a FROM clause subquery.

Subqueries in the SELECT Clause

You can also place a subquery directly in the SELECT list. This is useful when you want to add a computed column that requires a separate aggregation — almost like adding a reference value to every row.

The problem: List every order alongside the overall average order total, so you can see at a glance how much each order deviates from the mean.

SELECT
    order_id,
    customer_id,
    order_total,
    (SELECT AVG(order_total) FROM orders) AS avg_order_total,
    order_total - (SELECT AVG(order_total) FROM orders) AS deviation_from_avg
FROM orders
ORDER BY deviation_from_avg DESC;

Result:

order_id | customer_id | order_total | avg_order_total | deviation_from_avg
---------|-------------|-------------|-----------------|-------------------
106      | 5           | 450.00      | 223.75          | 226.25
102      | 2           | 340.00      | 223.75          | 116.25
108      | 3           | 310.00      | 223.75          | 86.25
104      | 3           | 210.00      | 223.75          | -13.75
101      | 1           | 120.00      | 223.75          | -103.75
107      | 2           | 180.00      | 223.75          | -43.75
105      | 4           | 95.00       | 223.75          | -128.75
103      | 1           | 85.00       | 223.75          | -138.75

Tip: In practice, this is often better written using a window function (AVG(order_total) OVER ()), which is more efficient because the database computes the average once instead of once per row. But subqueries in the SELECT clause are valid, and worth understanding as a stepping stone to window functions.

Correlated Subqueries: The Inner Query Reaches Outward

Everything up to this point has used non-correlated subqueries — the inner query runs once, produces a result, and the outer query uses it. The inner and outer queries are completely independent.

A correlated subquery is fundamentally different: the inner query references a column from the outer query, which means it runs once for each row that the outer query processes. The two queries are coupled — that's where the name "correlated" comes from.

This sounds expensive (and can be), but it enables a category of questions that are hard to express any other way.

The problem: Find all orders where the order total is higher than the average order total for that specific customer.

This is not the same as comparing against the overall average. You want a per-customer benchmark. If Amara's average order is $102.50, you want her orders above $102.50. If Rafael's average is $260.00, you want his orders above $260.00.

Here's the correlated subquery:

SELECT o1.order_id, o1.customer_id, o1.order_total
FROM orders o1
WHERE o1.order_total > (
    SELECT AVG(o2.order_total)
    FROM orders o2
    WHERE o2.customer_id = o1.customer_id
);

Notice the critical line inside the subquery: WHERE o2.customer_id = o1.customer_id. The inner query is referencing o1.customer_id — a column from the outer query. This is what makes it correlated.

Here's how the database actually executes this:

The outer query picks up the first row: order 101, customer 1, $120.00.
It passes customer_id = 1 into the inner query.
The inner query calculates the average order total for customer 1 only: (120 + 85) / 2 = $102.50.
Back in the outer query: is $120.00 > $102.50? Yes. Include this row.
Move to the next row. Repeat.

Result:

order_id | customer_id | order_total
---------|-------------|------------
101      | 1           | 120.00
102      | 2           | 340.00
106      | 5           | 450.00
108      | 3           | 310.00

Let's verify manually for customer 2 (Rafael): his orders are $340 and $180, averaging $260. So only $340 qualifies. ✓

For customer 3 (Mei): her orders are $210 and $310, averaging $260. Only $310 qualifies. ✓

Performance note: Because a correlated subquery runs once per row of the outer query, it can be slow on large datasets. For many correlated subquery patterns, there's an equivalent JOIN approach that performs better. But for learning purposes and moderate data volumes, correlated subqueries are perfectly valid and much easier to reason about.

EXISTS: The Cleaner Correlated Pattern

One of the most practical uses of correlated subqueries is the EXISTS operator. Instead of returning a value, EXISTS simply checks whether the inner query returns any rows at all — it returns TRUE or FALSE.

The problem: Find all customers who have placed more than one order.

SELECT c.name, c.city
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
    AND o.order_id <> (
        SELECT MIN(order_id) FROM orders WHERE customer_id = c.customer_id
    )
);

Actually, let's use a cleaner version with GROUP BY in the subquery:

SELECT c.name, c.city
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
    HAVING COUNT(*) > 1
);

Hmm — HAVING without GROUP BY is valid in some databases but not all. Let's write it plainly:

SELECT c.name
FROM customers c
WHERE (
    SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id
) > 1;

Result:

name
-----------
Amara Osei
Rafael Torres
Mei Lin

The SELECT 1 inside an EXISTS subquery is a convention — you're not actually selecting meaningful data, just checking if rows exist. The 1 is just a placeholder; SELECT * or SELECT order_id would work identically. Using SELECT 1 signals to the reader (and sometimes the query optimizer) that the actual values don't matter.

EXISTS has a key advantage over IN for correlated patterns: it short-circuits. The moment the inner query finds one matching row, it stops looking. This makes it genuinely faster when a match is likely to be found early.

Hands-On Exercise

Work through these exercises using the customers and orders tables from this lesson. Write each query yourself before checking your approach.

Exercise 1 — Subquery in WHERE: Find all orders placed by customers in Chicago. Use a subquery (not a JOIN) to get the customer IDs for Chicago customers first.

Exercise 2 — Subquery in FROM: Calculate the maximum single order total for each customer, and then find the average of those maximum values. (Hint: compute per-customer maximums in an inner query, then average in the outer query.)

Exercise 3 — Correlated Subquery: For each customer, find their most recent order date. List the customer name alongside that date. (Hint: use a correlated subquery in the SELECT clause that runs MAX(order_date) filtered to each customer.)

Exercise 4 — NOT EXISTS: Find all customers who have never placed an order. Use NOT EXISTS instead of NOT IN.

Sample answer for Exercise 1:

SELECT order_id, order_total, order_date
FROM orders
WHERE customer_id IN (
    SELECT customer_id
    FROM customers
    WHERE city = 'Chicago'
);

Common Mistakes & Troubleshooting

Mistake 1: Forgetting to alias derived tables

-- This will fail:
SELECT AVG(customer_total)
FROM (SELECT customer_id, SUM(order_total) AS customer_total FROM orders GROUP BY customer_id);

-- This works:
SELECT AVG(customer_total)
FROM (SELECT customer_id, SUM(order_total) AS customer_total FROM orders GROUP BY customer_id) AS spend_summary;

Mistake 2: Subquery returns more than one row when one is expected

If you use a subquery in a WHERE clause like WHERE order_total = (subquery), the subquery must return exactly one value. If it returns multiple rows, you'll get an error. Use IN instead of = when the subquery might return multiple rows.

-- Dangerous if subquery returns multiple rows:
WHERE customer_id = (SELECT customer_id FROM customers WHERE city = 'Chicago')

-- Safe:
WHERE customer_id IN (SELECT customer_id FROM customers WHERE city = 'Chicago')

Mistake 3: NULL values breaking NOT IN

-- If any customer_id in orders is NULL, this returns zero rows:
SELECT * FROM customers WHERE customer_id NOT IN (SELECT customer_id FROM orders);

-- Safe alternative:
SELECT * FROM customers c WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id
);

Mistake 4: Correlated subquery referencing the wrong table alias

In correlated subqueries, the aliases matter enormously. If your outer query uses o1 and your inner query uses o2, make sure you reference o1 in the inner WHERE clause — not o2. Referencing o2 turns it into a non-correlated subquery that produces the same result for every row.

Mistake 5: Assuming subqueries are always slower than JOINs

Modern query optimizers are smart. For many subquery patterns — especially non-correlated ones — the database will rewrite the subquery as a JOIN internally anyway. Don't prematurely optimize. Write the version that's most readable, then optimize if performance is actually a problem.

Summary & Next Steps

You've covered a lot of ground. Here's what you now know:

A subquery is a complete SELECT statement nested inside another query. SQL evaluates the inner query first and uses its result in the outer query.
Subqueries can appear in the WHERE clause (to filter by a computed value or list), the FROM clause (to create a derived table you can query), and the SELECT clause (to add a computed reference column).
A correlated subquery references columns from the outer query, causing it to re-execute for each row the outer query processes. This enables row-level comparisons against group-level statistics.
EXISTS is a clean, short-circuiting alternative to IN for correlated logic, and avoids the NULL pitfall of NOT IN.
Common pitfalls include forgetting to alias derived tables, not accounting for NULLs with NOT IN, and accidentally making a correlated subquery non-correlated through sloppy aliasing.

Where to go next:

Common Table Expressions (CTEs): The WITH clause lets you name and reuse subqueries, making complex queries dramatically more readable. They're the modern replacement for many FROM-clause subqueries.
Window Functions: AVG() OVER(), RANK() OVER(), and similar functions handle many correlated subquery use cases more efficiently.
Query Performance and EXPLAIN: Learn to read your database's execution plan to understand whether your subquery is being optimized or causing a performance bottleneck.

Subqueries are one of those techniques that, once you have them, you'll use constantly. The jump from "I need to run two queries and stitch the results together" to "I can do this in one statement" is genuinely satisfying — and it's the foundation of the more powerful analytical SQL that's coming next.

Subqueries and Correlated Subqueries: Writing Queries Within Queries

Subqueries and Correlated Subqueries: Writing Queries Within Queries

Introduction

Prerequisites

The Core Idea: A Query Inside a Query

Your Sample Data

Subqueries in the WHERE Clause

Using IN with a Subquery

Subqueries in the FROM Clause (Derived Tables)

Subqueries in the SELECT Clause

Correlated Subqueries: The Inner Query Reaches Outward

EXISTS: The Cleaner Correlated Pattern

Hands-On Exercise

Common Mistakes & Troubleshooting

Summary & Next Steps

Related Articles

Database Performance Tuning: Advanced Indexing Strategies and Query Rewriting for Production Systems

SQL for Data Analysis: Cohort Analysis, Funnels, and Retention - Complete Guide

Working with JSON and Arrays in Modern SQL: Complete Guide

Related Articles

SQL🔥 Expert
Database Performance Tuning: Advanced Indexing Strategies and Query Rewriting for Production Systems
20 min

SQL⚡ Practitioner
SQL for Data Analysis: Cohort Analysis, Funnels, and Retention - Complete Guide
17 min

SQL🌱 Foundation
Working with JSON and Arrays in Modern SQL: Complete Guide
14 min