
Imagine you work at a retail company, and your data warehouse tracks customer information. A customer named Maria Santos moves from Austin, Texas to Denver, Colorado. Simple enough — you update her address. But three months later, your marketing team asks: "What were our sales to customers living in Texas during Q1?" You check your data. Maria bought $2,400 worth of product in January and February, when she was still in Austin. But now her record shows Denver. Your report says Denver. That's wrong.
This is the core problem that Slowly Changing Dimensions (SCDs) solve. Real-world data changes — customers move, employees change job titles, products get reclassified, prices shift. A warehouse that can't track when those changes happened can't answer historical questions accurately. And in data work, answering historical questions accurately is most of the job.
By the end of this lesson, you'll know how to design tables that preserve the full history of changes in your dimensional data, which strategy to use for different business scenarios, and how to implement those strategies with real SQL. You'll be able to look at a changing-data problem and make an informed decision about how to handle it — not just guess.
What you'll learn:
SELECT with a WHERE clause and understand JOINBefore we talk about slowly changing dimensions, let's nail down what a dimension is.
In a data warehouse, you typically organize data into two kinds of tables:
When you run an analysis like "total revenue by customer region," you're joining the sales fact table to the customer dimension table and grouping by region. The dimension provides the context that makes the fact meaningful.
The phrase "slowly changing" just means that the data in dimension tables isn't static — it changes, but not constantly the way transaction data does. A customer address changes occasionally. An employee's job title changes a few times in their career. A product's category might be reorganized once every few years. These are slow, irregular changes — and your warehouse needs a strategy for each one.
There are six or seven SCD types in the academic literature, but in practice, three of them do 95% of the work. Let's understand each one by working through the same scenario: a customer, David Kim, who gets promoted and moves to a new city.
His record today looks like this:
customer_id | name | city | job_title
------------|------------|---------|------------------
1001 | David Kim | Chicago | Account Executive
Now David gets promoted to Senior Account Executive and relocates to Seattle.
Type 1 simply overwrites the old value with the new one. No history is kept. After the update:
customer_id | name | city | job_title
------------|------------|---------|----------------------
1001 | David Kim | Seattle | Senior Account Executive
Chicago and "Account Executive" are gone. Forever.
When is this appropriate? When the change is a correction rather than a business event. If David's city was misspelled as "Chigaco" and you fix it to "Chicago," nobody needs to know that the misspelling existed. Similarly, if a product's SKU format gets standardized across the catalog, that's a correction, not a meaningful business change.
Type 1 is dangerous when the change reflects something real about the business that affects historical analysis. If you Type 1 David's city, any past sales attributed to him will suddenly appear as Seattle sales, which is factually wrong.
Implementation: Just a plain UPDATE statement. There's nothing clever here.
UPDATE dim_customer
SET city = 'Seattle',
job_title = 'Senior Account Executive',
updated_at = CURRENT_TIMESTAMP
WHERE customer_id = 1001;
Type 2 is the workhorse of historical data management. Instead of overwriting, you close out the old record by marking it as expired, and insert a brand new row for the updated information.
After David's move, your table looks like this:
surrogate_key | customer_id | name | city | job_title | effective_from | effective_to | is_current
--------------|-------------|------------|---------|--------------------------|----------------|--------------|-----------
5001 | 1001 | David Kim | Chicago | Account Executive | 2022-03-15 | 2024-11-30 | FALSE
5002 | 1001 | David Kim | Seattle | Senior Account Executive | 2024-12-01 | 9999-12-31 | TRUE
Let's unpack the new columns here, because they're doing all the work:
customer_id alone no longer uniquely identifies a row, because there are now two rows for customer 1001.effective_to is set to the day before the new row becomes active.The sentinel value 9999-12-31 for effective_to on current rows is a widely-used convention. It means "this record has no end date yet." Some teams use NULL instead, which also works — just make sure your team is consistent.
When is this appropriate? Whenever you need to answer historical questions like "What was true about this customer at the time of this transaction?" For most business dimensions — customers, employees, products — Type 2 is the right default.
Type 3 adds a new column to store the previous value, rather than adding a new row.
customer_id | name | current_city | previous_city | job_title
------------|------------|--------------|---------------|----------------------
1001 | David Kim | Seattle | Chicago | Senior Account Executive
You can now see both the current and the immediately preceding city. But you've lost the date of the change, and you can only store one previous value. If David moves again — say, from Seattle to Portland — you'd overwrite "Chicago" with "Seattle," and the Chicago history is gone.
When is this appropriate? When you need to support a "before and after" comparison for a specific, planned change. For example, if your company reorganizes its sales territories and you want to report on both the old and new territory assignment for a transition period, Type 3 makes sense. It's a narrow use case.
The Practical Rule of Thumb: Default to Type 2. Use Type 1 only for corrections. Consider Type 3 only when you have a single, well-understood transition you need to track alongside the current state, and you know it won't happen repeatedly.
Let's build this properly. We'll use a realistic scenario: a dim_employee table at a software company. Employees change job titles, departments, and managers over time.
CREATE TABLE dim_employee (
employee_sk BIGINT PRIMARY KEY, -- surrogate key
employee_id VARCHAR(20) NOT NULL, -- natural key from HR system
full_name VARCHAR(100) NOT NULL,
department VARCHAR(50) NOT NULL,
job_title VARCHAR(50) NOT NULL,
manager_id VARCHAR(20),
location VARCHAR(50) NOT NULL,
effective_from DATE NOT NULL,
effective_to DATE NOT NULL DEFAULT '9999-12-31',
is_current BOOLEAN NOT NULL DEFAULT TRUE
);
When an employee first enters the system, they get one row:
INSERT INTO dim_employee (
employee_sk, employee_id, full_name, department,
job_title, manager_id, location, effective_from
)
VALUES
(1, 'E-0042', 'Priya Nair', 'Engineering', 'Software Engineer', 'E-0011', 'Boston', '2021-06-01'),
(2, 'E-0055', 'Marcus Webb', 'Sales', 'Account Executive', 'E-0018', 'Chicago', '2020-09-15'),
(3, 'E-0061', 'Fatima Al-Rashid', 'Engineering', 'Staff Engineer', 'E-0011', 'Remote', '2019-03-04');
Priya gets promoted to Senior Software Engineer and moves from Boston to the Austin office, effective January 15, 2025. Here's the two-step SCD Type 2 update:
First, expire the old row:
UPDATE dim_employee
SET effective_to = '2025-01-14',
is_current = FALSE
WHERE employee_id = 'E-0042'
AND is_current = TRUE;
Then, insert the new row:
INSERT INTO dim_employee (
employee_sk, employee_id, full_name, department,
job_title, manager_id, location, effective_from
)
VALUES
(4, 'E-0042', 'Priya Nair', 'Engineering', 'Senior Software Engineer', 'E-0011', 'Austin', '2025-01-15');
Notice that only the columns that changed are different between the two rows. Everything else — full_name, department, manager_id — carries forward unchanged.
Warning: In production, these two statements should be wrapped in a transaction so they succeed or fail together. If the INSERT fails after the UPDATE, you've closed out Priya's old record without creating a new one, which is data corruption.
BEGIN;
UPDATE dim_employee
SET effective_to = '2025-01-14',
is_current = FALSE
WHERE employee_id = 'E-0042'
AND is_current = TRUE;
INSERT INTO dim_employee (
employee_sk, employee_id, full_name, department,
job_title, manager_id, location, effective_from
)
VALUES
(4, 'E-0042', 'Priya Nair', 'Engineering', 'Senior Software Engineer', 'E-0011', 'Austin', '2025-01-15');
COMMIT;
This is where many analysts trip up. Now that your dimension has multiple rows per employee, naive queries will produce wrong results.
Wrong — this will double-count Priya:
SELECT department, COUNT(DISTINCT employee_id) AS headcount
FROM dim_employee
GROUP BY department;
If you run this, Priya appears twice in Engineering because she has two rows.
Correct — filter for current records:
SELECT department, COUNT(DISTINCT employee_id) AS headcount
FROM dim_employee
WHERE is_current = TRUE
GROUP BY department;
Historical query — what did our Engineering headcount look like on January 1, 2025?
SELECT department, COUNT(DISTINCT employee_id) AS headcount
FROM dim_employee
WHERE department = 'Engineering'
AND '2025-01-01' BETWEEN effective_from AND effective_to
GROUP BY department;
On January 1, 2025, Priya was still a Software Engineer in Boston. The query will pick up her old row correctly, because 2025-01-01 falls within 2021-06-01 to 2025-01-14.
Joining to a fact table:
Here's the real payoff. Suppose you have a fact_project_assignments table that records when employees were assigned to projects, with an assignment_date column. You want to know the job title each employee held at the time of the assignment:
SELECT
fa.project_id,
de.full_name,
de.job_title,
de.department,
fa.hours_logged
FROM fact_project_assignments fa
JOIN dim_employee de
ON fa.employee_id = de.employee_id
AND fa.assignment_date BETWEEN de.effective_from AND de.effective_to;
This join condition is what makes SCD Type 2 so powerful. You're not just matching on employee_id — you're asking the dimension to tell you which version of that employee existed on the date of each specific fact row. Priya's January 10th assignment will join to her "Software Engineer in Boston" row. Her February 1st assignment will join to her "Senior Software Engineer in Austin" row.
Manual INSERT/UPDATE pairs don't scale to thousands of records. In a real warehouse, you'll typically load changes from a staging table and use a MERGE statement (available in most modern databases and cloud warehouses).
-- Assume stg_employee contains today's employee records from the HR source system
-- Step 1: Expire rows where tracked columns have changed
UPDATE dim_employee d
SET effective_to = CURRENT_DATE - INTERVAL '1 day',
is_current = FALSE
FROM stg_employee s
WHERE d.employee_id = s.employee_id
AND d.is_current = TRUE
AND (
d.job_title <> s.job_title OR
d.department <> s.department OR
d.location <> s.location OR
d.manager_id <> s.manager_id
);
-- Step 2: Insert new rows for changed and new employees
INSERT INTO dim_employee (
employee_sk, employee_id, full_name, department,
job_title, manager_id, location, effective_from
)
SELECT
nextval('dim_employee_sk_seq'),
s.employee_id, s.full_name, s.department,
s.job_title, s.manager_id, s.location,
CURRENT_DATE
FROM stg_employee s
WHERE NOT EXISTS (
SELECT 1
FROM dim_employee d
WHERE d.employee_id = s.employee_id
AND d.is_current = TRUE
);
Tip: Tools like dbt (data build tool) have built-in SCD Type 2 snapshots via the
dbt snapshotfeature. If you're working in a modern data stack, that's worth learning — it handles the surrogate key generation, effective date management, and hashing of tracked columns for you. But understanding the underlying SQL is essential for debugging when something goes wrong.
Work through this exercise to solidify what you've learned.
Setup: Create a dim_product table with the following columns: product_sk, product_id, product_name, category, unit_price, supplier_id, effective_from, effective_to, is_current.
Initial load: Insert three products:
P-100: "Mountain Trail Boots", category "Footwear", price $129.99, supplier SUP-07, effective 2023-01-01P-101: "Lightweight Tent 2P", category "Camping", price $299.00, supplier SUP-12, effective 2023-01-01P-102: "Merino Base Layer", category "Apparel", price $74.99, supplier SUP-03, effective 2023-01-01Changes to process (effective 2025-02-01):
SUP-09Questions to answer:
dim_product after processing the changes?Mistake 1: Joining on just the natural key
When you join a fact table to an SCD Type 2 dimension on only employee_id (or product_id, etc.), you'll get a fan-out — multiple rows returned per fact row. Always include the date range condition in your join.
Mistake 2: Forgetting the is_current filter on headcount queries
Any query that's trying to describe the current state of the world needs WHERE is_current = TRUE. New analysts on the team will frequently miss this, especially if they're used to source system tables where each entity has exactly one row.
Mistake 3: Choosing the wrong surrogate key strategy
Using the source system's natural key (employee_id) as your primary key breaks the moment you add a second version of the same employee. Your surrogate key must be truly unique per row, not per entity. Use an auto-incrementing sequence or a UUID.
Mistake 4: Applying Type 2 to everything Not every column in a dimension needs Type 2 tracking. If a customer's email address is corrected for a typo, that's a Type 1 fix. If their billing tier changes, that's a Type 2 event. Consider tracking Type 1 and Type 2 columns in the same dimension — some fields get overwritten, others create new rows.
Mistake 5: Off-by-one errors in effective dates
Notice that when we expire Priya's old row, we set effective_to = '2025-01-14', and the new row starts effective_from = '2025-01-15'. There's no gap, and no overlap. If you use effective_to = '2025-01-15' for the old row, then a fact record on January 15th will match both rows simultaneously — and your joins will duplicate. Be precise about whether your effective dates are inclusive or exclusive.
Slowly Changing Dimensions are one of the most practical concepts in data engineering. Here's what you've built:
The most important shift in your thinking should be this: rows in a dimension don't represent entities, they represent versions of entities at a point in time. Once that clicks, the rest of SCD design becomes intuitive.
What to explore next:
Learning Path: Modern Data Stack