Slowly Changing Dimensions in Practice: Handling Historical Data Changes in Your Warehouse

Introduction

Imagine you work at a retail company, and your data warehouse tracks customer information. A customer named Maria Santos moves from Austin, Texas to Denver, Colorado. Simple enough — you update her address. But three months later, your marketing team asks: "What were our sales to customers living in Texas during Q1?" You check your data. Maria bought $2,400 worth of product in January and February, when she was still in Austin. But now her record shows Denver. Your report says Denver. That's wrong.

This is the core problem that Slowly Changing Dimensions (SCDs) solve. Real-world data changes — customers move, employees change job titles, products get reclassified, prices shift. A warehouse that can't track when those changes happened can't answer historical questions accurately. And in data work, answering historical questions accurately is most of the job.

By the end of this lesson, you'll know how to design tables that preserve the full history of changes in your dimensional data, which strategy to use for different business scenarios, and how to implement those strategies with real SQL. You'll be able to look at a changing-data problem and make an informed decision about how to handle it — not just guess.

What you'll learn:

What dimensions are and why they "change slowly" in a warehouse context
The three most common SCD strategies (Type 1, Type 2, and Type 3) and when to use each
How to implement SCD Type 2 — the industry workhorse — with practical SQL
How to query an SCD Type 2 table correctly so your historical reports are accurate
How to choose the right SCD type for a given business requirement

Prerequisites

Basic SQL comfort: you should be able to write a SELECT with a WHERE clause and understand JOIN
A general idea of what a data warehouse does (stores data for analytics, not transactions)
No prior knowledge of dimensional modeling or SCD concepts required — we'll build from scratch

What Is a Dimension, Anyway?

Before we talk about slowly changing dimensions, let's nail down what a dimension is.

In a data warehouse, you typically organize data into two kinds of tables:

Fact tables record things that happened — a sale, a click, a shipment, a payment. They tend to be tall (millions of rows) and contain numbers you want to measure: quantities, amounts, durations.
Dimension tables describe the context around those events — the customer who bought something, the product that was sold, the store where it happened, the date it occurred. They tend to be wider (lots of descriptive columns) and shorter (thousands to hundreds of thousands of rows).

When you run an analysis like "total revenue by customer region," you're joining the sales fact table to the customer dimension table and grouping by region. The dimension provides the context that makes the fact meaningful.

The phrase "slowly changing" just means that the data in dimension tables isn't static — it changes, but not constantly the way transaction data does. A customer address changes occasionally. An employee's job title changes a few times in their career. A product's category might be reorganized once every few years. These are slow, irregular changes — and your warehouse needs a strategy for each one.

The Three Core Strategies

There are six or seven SCD types in the academic literature, but in practice, three of them do 95% of the work. Let's understand each one by working through the same scenario: a customer, David Kim, who gets promoted and moves to a new city.

His record today looks like this:

customer_id | name       | city    | job_title
------------|------------|---------|------------------
1001        | David Kim  | Chicago | Account Executive

Now David gets promoted to Senior Account Executive and relocates to Seattle.

SCD Type 1: Overwrite and Forget

Type 1 simply overwrites the old value with the new one. No history is kept. After the update:

customer_id | name       | city    | job_title
------------|------------|---------|----------------------
1001        | David Kim  | Seattle | Senior Account Executive

Chicago and "Account Executive" are gone. Forever.

When is this appropriate? When the change is a correction rather than a business event. If David's city was misspelled as "Chigaco" and you fix it to "Chicago," nobody needs to know that the misspelling existed. Similarly, if a product's SKU format gets standardized across the catalog, that's a correction, not a meaningful business change.

Type 1 is dangerous when the change reflects something real about the business that affects historical analysis. If you Type 1 David's city, any past sales attributed to him will suddenly appear as Seattle sales, which is factually wrong.

Implementation: Just a plain UPDATE statement. There's nothing clever here.

UPDATE dim_customer
SET city = 'Seattle',
    job_title = 'Senior Account Executive',
    updated_at = CURRENT_TIMESTAMP
WHERE customer_id = 1001;

SCD Type 2: Add a New Row, Keep the Old One

Type 2 is the workhorse of historical data management. Instead of overwriting, you close out the old record by marking it as expired, and insert a brand new row for the updated information.

After David's move, your table looks like this:

surrogate_key | customer_id | name       | city    | job_title                | effective_from | effective_to | is_current
--------------|-------------|------------|---------|--------------------------|----------------|--------------|-----------
5001          | 1001        | David Kim  | Chicago | Account Executive        | 2022-03-15     | 2024-11-30   | FALSE
5002          | 1001        | David Kim  | Seattle | Senior Account Executive | 2024-12-01     | 9999-12-31   | TRUE

Let's unpack the new columns here, because they're doing all the work:

surrogate_key: A new, warehouse-generated primary key (often called a surrogate key) that uniquely identifies this version of the record. This is critical — customer_id alone no longer uniquely identifies a row, because there are now two rows for customer 1001.
customer_id: The original identifier from the source system. Sometimes called the "natural key" or "business key." You keep this so you can link back to the source.
effective_from / effective_to: The date range during which this version of the record was true. The old row's effective_to is set to the day before the new row becomes active.
is_current: A convenience flag that makes it easy to query the current state of each customer without messing with dates.

The sentinel value 9999-12-31 for effective_to on current rows is a widely-used convention. It means "this record has no end date yet." Some teams use NULL instead, which also works — just make sure your team is consistent.

When is this appropriate? Whenever you need to answer historical questions like "What was true about this customer at the time of this transaction?" For most business dimensions — customers, employees, products — Type 2 is the right default.

SCD Type 3: Add a Column for the Previous Value

Type 3 adds a new column to store the previous value, rather than adding a new row.

customer_id | name       | current_city | previous_city | job_title
------------|------------|--------------|---------------|----------------------
1001        | David Kim  | Seattle      | Chicago       | Senior Account Executive

You can now see both the current and the immediately preceding city. But you've lost the date of the change, and you can only store one previous value. If David moves again — say, from Seattle to Portland — you'd overwrite "Chicago" with "Seattle," and the Chicago history is gone.

When is this appropriate? When you need to support a "before and after" comparison for a specific, planned change. For example, if your company reorganizes its sales territories and you want to report on both the old and new territory assignment for a transition period, Type 3 makes sense. It's a narrow use case.

The Practical Rule of Thumb: Default to Type 2. Use Type 1 only for corrections. Consider Type 3 only when you have a single, well-understood transition you need to track alongside the current state, and you know it won't happen repeatedly.

Implementing SCD Type 2 in SQL

Let's build this properly. We'll use a realistic scenario: a dim_employee table at a software company. Employees change job titles, departments, and managers over time.

Step 1: Create the Table Structure

CREATE TABLE dim_employee (
    employee_sk       BIGINT        PRIMARY KEY,  -- surrogate key
    employee_id       VARCHAR(20)   NOT NULL,     -- natural key from HR system
    full_name         VARCHAR(100)  NOT NULL,
    department        VARCHAR(50)   NOT NULL,
    job_title         VARCHAR(50)   NOT NULL,
    manager_id        VARCHAR(20),
    location          VARCHAR(50)   NOT NULL,
    effective_from    DATE          NOT NULL,
    effective_to      DATE          NOT NULL DEFAULT '9999-12-31',
    is_current        BOOLEAN       NOT NULL DEFAULT TRUE
);

Step 2: Load Initial Data

When an employee first enters the system, they get one row:

INSERT INTO dim_employee (
    employee_sk, employee_id, full_name, department,
    job_title, manager_id, location, effective_from
)
VALUES
    (1, 'E-0042', 'Priya Nair', 'Engineering', 'Software Engineer', 'E-0011', 'Boston', '2021-06-01'),
    (2, 'E-0055', 'Marcus Webb', 'Sales', 'Account Executive', 'E-0018', 'Chicago', '2020-09-15'),
    (3, 'E-0061', 'Fatima Al-Rashid', 'Engineering', 'Staff Engineer', 'E-0011', 'Remote', '2019-03-04');

Step 3: Process a Change

Priya gets promoted to Senior Software Engineer and moves from Boston to the Austin office, effective January 15, 2025. Here's the two-step SCD Type 2 update:

First, expire the old row:

UPDATE dim_employee
SET effective_to = '2025-01-14',
    is_current   = FALSE
WHERE employee_id = 'E-0042'
  AND is_current = TRUE;

Then, insert the new row:

INSERT INTO dim_employee (
    employee_sk, employee_id, full_name, department,
    job_title, manager_id, location, effective_from
)
VALUES
    (4, 'E-0042', 'Priya Nair', 'Engineering', 'Senior Software Engineer', 'E-0011', 'Austin', '2025-01-15');

Notice that only the columns that changed are different between the two rows. Everything else — full_name, department, manager_id — carries forward unchanged.

Warning: In production, these two statements should be wrapped in a transaction so they succeed or fail together. If the INSERT fails after the UPDATE, you've closed out Priya's old record without creating a new one, which is data corruption.

BEGIN;

UPDATE dim_employee
SET effective_to = '2025-01-14',
    is_current   = FALSE
WHERE employee_id = 'E-0042'
  AND is_current = TRUE;

INSERT INTO dim_employee (
    employee_sk, employee_id, full_name, department,
    job_title, manager_id, location, effective_from
)
VALUES
    (4, 'E-0042', 'Priya Nair', 'Engineering', 'Senior Software Engineer', 'E-0011', 'Austin', '2025-01-15');

COMMIT;

Step 4: Querying Correctly

This is where many analysts trip up. Now that your dimension has multiple rows per employee, naive queries will produce wrong results.

Wrong — this will double-count Priya:

SELECT department, COUNT(DISTINCT employee_id) AS headcount
FROM dim_employee
GROUP BY department;

If you run this, Priya appears twice in Engineering because she has two rows.

Correct — filter for current records:

SELECT department, COUNT(DISTINCT employee_id) AS headcount
FROM dim_employee
WHERE is_current = TRUE
GROUP BY department;

Historical query — what did our Engineering headcount look like on January 1, 2025?

SELECT department, COUNT(DISTINCT employee_id) AS headcount
FROM dim_employee
WHERE department = 'Engineering'
  AND '2025-01-01' BETWEEN effective_from AND effective_to
GROUP BY department;

On January 1, 2025, Priya was still a Software Engineer in Boston. The query will pick up her old row correctly, because 2025-01-01 falls within 2021-06-01 to 2025-01-14.

Joining to a fact table:

Here's the real payoff. Suppose you have a fact_project_assignments table that records when employees were assigned to projects, with an assignment_date column. You want to know the job title each employee held at the time of the assignment:

SELECT
    fa.project_id,
    de.full_name,
    de.job_title,
    de.department,
    fa.hours_logged
FROM fact_project_assignments fa
JOIN dim_employee de
    ON fa.employee_id = de.employee_id
    AND fa.assignment_date BETWEEN de.effective_from AND de.effective_to;

This join condition is what makes SCD Type 2 so powerful. You're not just matching on employee_id — you're asking the dimension to tell you which version of that employee existed on the date of each specific fact row. Priya's January 10th assignment will join to her "Software Engineer in Boston" row. Her February 1st assignment will join to her "Senior Software Engineer in Austin" row.

Automating SCD Type 2 with a Merge Pattern

Manual INSERT/UPDATE pairs don't scale to thousands of records. In a real warehouse, you'll typically load changes from a staging table and use a MERGE statement (available in most modern databases and cloud warehouses).

-- Assume stg_employee contains today's employee records from the HR source system
-- Step 1: Expire rows where tracked columns have changed
UPDATE dim_employee d
SET effective_to = CURRENT_DATE - INTERVAL '1 day',
    is_current   = FALSE
FROM stg_employee s
WHERE d.employee_id = s.employee_id
  AND d.is_current = TRUE
  AND (
      d.job_title    <> s.job_title    OR
      d.department   <> s.department   OR
      d.location     <> s.location     OR
      d.manager_id   <> s.manager_id
  );

-- Step 2: Insert new rows for changed and new employees
INSERT INTO dim_employee (
    employee_sk, employee_id, full_name, department,
    job_title, manager_id, location, effective_from
)
SELECT
    nextval('dim_employee_sk_seq'),
    s.employee_id, s.full_name, s.department,
    s.job_title, s.manager_id, s.location,
    CURRENT_DATE
FROM stg_employee s
WHERE NOT EXISTS (
    SELECT 1
    FROM dim_employee d
    WHERE d.employee_id = s.employee_id
      AND d.is_current = TRUE
);

Tip: Tools like dbt (data build tool) have built-in SCD Type 2 snapshots via the dbt snapshot feature. If you're working in a modern data stack, that's worth learning — it handles the surrogate key generation, effective date management, and hashing of tracked columns for you. But understanding the underlying SQL is essential for debugging when something goes wrong.

Hands-On Exercise

Work through this exercise to solidify what you've learned.

Setup: Create a dim_product table with the following columns: product_sk, product_id, product_name, category, unit_price, supplier_id, effective_from, effective_to, is_current.

Initial load: Insert three products:

Product P-100: "Mountain Trail Boots", category "Footwear", price $129.99, supplier SUP-07, effective 2023-01-01
Product P-101: "Lightweight Tent 2P", category "Camping", price $299.00, supplier SUP-12, effective 2023-01-01
Product P-102: "Merino Base Layer", category "Apparel", price $74.99, supplier SUP-03, effective 2023-01-01

Changes to process (effective 2025-02-01):

P-100's price increases to $139.99 and supplier changes to SUP-09
P-102's category changes from "Apparel" to "Layering Systems"
P-101 has no changes

Questions to answer:

Write the SQL to expire the changed rows and insert the new versions.
Query the table to show what each product's category was on January 15, 2025.
Query the table to show the current price and category for all products.
How many total rows are in dim_product after processing the changes?

Common Mistakes & Troubleshooting

Mistake 1: Joining on just the natural key When you join a fact table to an SCD Type 2 dimension on only employee_id (or product_id, etc.), you'll get a fan-out — multiple rows returned per fact row. Always include the date range condition in your join.

Mistake 2: Forgetting the is_current filter on headcount queries Any query that's trying to describe the current state of the world needs WHERE is_current = TRUE. New analysts on the team will frequently miss this, especially if they're used to source system tables where each entity has exactly one row.

Mistake 3: Choosing the wrong surrogate key strategy Using the source system's natural key (employee_id) as your primary key breaks the moment you add a second version of the same employee. Your surrogate key must be truly unique per row, not per entity. Use an auto-incrementing sequence or a UUID.

Mistake 4: Applying Type 2 to everything Not every column in a dimension needs Type 2 tracking. If a customer's email address is corrected for a typo, that's a Type 1 fix. If their billing tier changes, that's a Type 2 event. Consider tracking Type 1 and Type 2 columns in the same dimension — some fields get overwritten, others create new rows.

Mistake 5: Off-by-one errors in effective dates Notice that when we expire Priya's old row, we set effective_to = '2025-01-14', and the new row starts effective_from = '2025-01-15'. There's no gap, and no overlap. If you use effective_to = '2025-01-15' for the old row, then a fact record on January 15th will match both rows simultaneously — and your joins will duplicate. Be precise about whether your effective dates are inclusive or exclusive.

Summary & Next Steps

Slowly Changing Dimensions are one of the most practical concepts in data engineering. Here's what you've built:

A mental model of why dimensions change and why it matters for historical accuracy
Clear understanding of when to use Type 1 (overwrite corrections), Type 2 (full historical tracking), and Type 3 (before/after comparisons for a known transition)
A working SCD Type 2 implementation in SQL — table design, initial load, change processing, and correct querying patterns
Awareness of the key mistakes teams make and how to avoid them

The most important shift in your thinking should be this: rows in a dimension don't represent entities, they represent versions of entities at a point in time. Once that clicks, the rest of SCD design becomes intuitive.

What to explore next:

dbt Snapshots: dbt's built-in SCD Type 2 implementation uses row hashing to detect changes efficiently. It's worth seeing how a production tool implements what you've done by hand here.
SCD Type 4 and Type 6: Type 4 separates history into a separate "history table," and Type 6 is a hybrid that adds Type 1, 2, and 3 tracking simultaneously. These are less common but worth knowing exist.
Dimensional Modeling fundamentals: If SCD clicked for you, the next level is understanding the full dimensional model — star schemas, conformed dimensions, and slowly changing facts. Kimball's The Data Warehouse Toolkit is the canonical resource.
Change Data Capture (CDC): In practice, how do changes get from your source systems into a staging table in the first place? CDC tools like Debezium, Fivetran, or Airbyte solve this upstream problem.

Slowly Changing Dimensions in Practice: Handling Historical Data Changes in Your Warehouse

Slowly Changing Dimensions in Practice: Handling Historical Data Changes in Your Warehouse

Introduction

Prerequisites

What Is a Dimension, Anyway?

The Three Core Strategies

SCD Type 1: Overwrite and Forget

SCD Type 2: Add a New Row, Keep the Old One

SCD Type 3: Add a Column for the Previous Value

Implementing SCD Type 2 in SQL

Step 1: Create the Table Structure

Step 2: Load Initial Data

Step 3: Process a Change

Step 4: Querying Correctly

Automating SCD Type 2 with a Merge Pattern

Hands-On Exercise

Common Mistakes & Troubleshooting

Summary & Next Steps

Related Articles

Batch vs. Stream Processing: Choosing the Right Ingestion Pattern for Your Pipeline

Designing Idempotent Data Pipelines: Guaranteeing Exactly-Once Semantics in Production

Cloud Data Warehouses: Snowflake vs BigQuery vs Redshift - Complete Comparison Guide

Related Articles

Data Engineering🌱 Foundation
Batch vs. Stream Processing: Choosing the Right Ingestion Pattern for Your Pipeline
17 min

Data Engineering🔥 Expert
Designing Idempotent Data Pipelines: Guaranteeing Exactly-Once Semantics in Production
28 min

Data Engineering🌱 Foundation
Cloud Data Warehouses: Snowflake vs BigQuery vs Redshift - Complete Comparison Guide
13 min