Imagine you're a data engineer at a fast-growing e-commerce company. Every day, millions of rows of order data, clickstream events, customer records, and inventory updates pour in from a dozen different systems. Your job is to get all of that raw information into a form that analysts can actually use — clean tables, consistent formats, sensible joins. The question isn't whether to transform that data. It's when.
For decades, the industry answered that question the same way: transform the data first, then load it. Scrub it, reshape it, validate it, and only once it's clean and tidy do you let it near your data warehouse. That approach is called ETL — Extract, Transform, Load — and it powered business intelligence for an entire generation. But sometime in the early 2010s, the rules changed. Cloud data warehouses got dramatically cheaper and more powerful. Storage costs collapsed. And a new approach — ELT, Extract, Load, Transform — emerged that flips the traditional order and delivers some surprisingly significant advantages. Today, ELT is the dominant pattern in the modern data stack.
By the end of this lesson, you'll understand exactly why the industry made that shift, when each approach still makes sense, and how to reason about your own data architecture decisions. You won't just know the definition of ELT and ETL — you'll understand the forces that shaped them.
What you'll learn:
This lesson is written for beginners, so no prior data engineering experience is required. You'll benefit from having a general sense of what a database is and a loose familiarity with SQL (knowing that it's a language used to query and reshape data is enough). No hands-on coding environment is required to follow along, though you'll have the opportunity to write some SQL in the exercise section.
ETL stands for Extract, Transform, Load. The three words describe the exact sequence of operations:
The key detail is that transformation happens in the middle, in a processing layer that exists specifically for this purpose. Data never arrives at the warehouse in its raw, messy state. By the time it lands, it's already been groomed.
This made a lot of sense in its historical context. Data warehouses in the 1990s and early 2000s were expensive. We're talking about on-premises hardware — massive Oracle or Teradata installations that cost hundreds of thousands of dollars per terabyte of storage. Compute was precious. You couldn't afford to store data you didn't need, and you certainly couldn't afford to run expensive SQL transformations repeatedly on a large dataset inside the warehouse. So you did the heavy lifting outside the warehouse, in cheaper compute, and you only let clean, final data through the door.
Think of it like running a restaurant with a very small kitchen. You can't bring in whole cows and process them on-site — you work with a supplier who delivers exactly the cuts you need, already trimmed and portioned. The prep happens before the delivery.
ETL pipelines also served another function: they enforced a contract. Because every transformation happened before the data entered the warehouse, analysts were always working with data that had already been validated. The warehouse was a source of truth by definition.
The ETL model has real advantages, but it carries some painful constraints that only become obvious over time.
Pipelines are brittle. When a source system changes — say, Salesforce adds a new field to the Opportunity object — your ETL pipeline often breaks. Because transformation logic is baked into the pipeline code itself (written in Java, Python, or a proprietary tool's drag-and-drop interface), every change requires a pipeline engineer to go in and fix the logic. Then re-test it. Then re-deploy it. This feedback loop can take days.
You lose the raw data. Once the ETL job transforms and loads data, the intermediate state is often gone. If a business analyst comes back six months later and says "wait, why did you filter out cancelled orders?" — you might not be able to reconstruct the original input. You've permanently discarded it.
The transformation layer becomes a bottleneck. As the organization grows and more teams need more transformations, the ETL server becomes a shared resource with a queue. Your pipeline runs at 2 AM, then waits for three other pipelines to finish, and data isn't ready until 7 AM. Analysts come in at 9 AM to stale dashboards.
Schema-on-write is inflexible. ETL requires you to define the final shape of your data before you load it. This is called schema-on-write. If the business changes its mind about what it needs — and it always does — you have to redesign the pipeline, not just the query.
ELT stands for Extract, Load, Transform. You extract data from source systems, load it into the data warehouse in its raw form, and then transform it inside the warehouse using SQL.
The sequence shift sounds minor. It isn't.
In ELT, your pipeline does the minimum necessary work to get data from A to B. It extracts raw data, maybe does some lightweight type coercion or encoding normalization, and loads it directly into a raw schema in the data warehouse. The warehouse gets a raw copy of everything — messy nulls, inconsistent casing, duplicate rows, weird vendor-specific codes and all.
Then, transformation is a separate step that runs inside the warehouse itself. Tools like dbt (data build tool) allow analytics engineers to write SQL SELECT statements that read from the raw tables and produce clean, modeled tables. Those clean tables are what analysts and dashboards consume.
Here's a simple mental model: instead of trimming the meat before delivery, you now have a spacious walk-in refrigerator. You take delivery of everything and store it all. Then your kitchen staff — using modern, efficient tools — does the prep work right there on-site, and they can always go back to the original cut if they need to reprocess it.
Modern cloud data warehouses — Snowflake, BigQuery, Databricks, Redshift — are fundamentally different machines than the on-premises warehouses of 20 years ago. A few properties make them ideal transformation engines:
Massive parallel processing. These systems are built to run SQL queries across hundreds of compute nodes simultaneously. A transformation that might take hours on an ETL server might complete in minutes — or seconds — in BigQuery. The warehouse itself has become extremely good at the kind of set-based operations that data transformation requires.
Elastic compute, separated from storage. In Snowflake, you can spin up a larger "virtual warehouse" (their term for a compute cluster) just for transformation runs, then scale it back down. You're not fighting for shared resources. Storage and compute are billed independently, which changes the economics completely.
Cheap, cheap storage. BigQuery charges roughly $20 per terabyte per month for storage. Storing all your raw data — including the ugly intermediate states — costs almost nothing at reasonable business data volumes. This removes the old pressure to discard data early.
SQL is universal. Every analyst and analytics engineer already knows SQL. When transformations live inside the warehouse as SQL queries, the whole team can read, review, and modify them. When they lived in Informatica pipelines or custom Java code, only the ETL engineers could touch them.
A typical modern ELT stack has three distinct layers, and it's worth understanding how they connect.
Layer 1: Extraction and Loading (Ingestion)
Tools like Fivetran, Airbyte, or Stitch handle the extract-and-load step. They maintain connectors to hundreds of source systems — Salesforce, Stripe, Shopify, Google Analytics, PostgreSQL, you name it — and they continuously sync raw data into a raw schema in your data warehouse. Their entire job is to replicate data faithfully and handle all the connector maintenance when source APIs change.
Layer 2: Transformation (Modeling)
This is where dbt lives. dbt lets you define transformations as SQL SELECT statements in a project repository. Each model is a .sql file that reads from raw tables (or from other models) and defines a new, clean table. dbt handles dependency resolution, documentation, testing, and scheduling. It runs inside your data warehouse — it doesn't move data out.
Layer 3: Consumption (Analytics)
Tools like Looker, Tableau, Metabase, or Mode connect to the final, transformed tables in your warehouse. Analysts write queries, build dashboards, and explore data — all against clean, well-modeled tables that dbt produced.
The separation of concerns here is elegant. The ingestion layer doesn't need to know anything about business logic. The transformation layer doesn't need to know anything about source system APIs. Each layer can evolve independently.
Let's say you're working with e-commerce order data. Your production database has an orders table with a status field that contains values like "fulfilled", "FULFILLED", "Fulfilled", and sometimes a legacy code "F". Analysts want a clean is_fulfilled boolean flag.
In an ETL world, you'd write a transformation step in your pipeline — probably Python or a proprietary tool's scripting interface — that runs before the data reaches the warehouse:
# Running on an ETL server, before data reaches the warehouse
def transform_order_status(row):
raw_status = row.get("status", "").strip().upper()
if raw_status in ("FULFILLED", "F"):
row["is_fulfilled"] = True
else:
row["is_fulfilled"] = False
# The raw 'status' field might be discarded here
return row
This gets the job done. But notice what happens when the source system adds a new status code "SHIP" for partially-shipped orders. You have to find this pipeline, update the Python code, test it, and redeploy it. And if the original status field wasn't preserved, you can't go back and re-examine the raw values.
In an ELT world, the raw data loads into the warehouse exactly as it came — messy values and all — into a table called raw.orders. Then a dbt model handles the cleaning:
-- models/staging/stg_orders.sql
-- This runs INSIDE the data warehouse as a SQL SELECT
SELECT
order_id,
customer_id,
created_at,
status AS raw_status, -- We keep the original for auditing
CASE
WHEN UPPER(TRIM(status)) IN ('FULFILLED', 'F') THEN TRUE
ELSE FALSE
END AS is_fulfilled,
order_total_cents / 100.0 AS order_total_dollars
FROM raw.orders
When the new "SHIP" status appears, an analytics engineer updates this SQL file, commits it to git, and runs dbt run. The change is tracked in version control. The raw data is unchanged and still available. Other models that build on stg_orders get the updated logic automatically. The whole process is transparent and auditable.
Tip: Notice that in the ELT model, we keep
raw_statusalongside the derivedis_fulfilled. Preserving raw values is a best practice — it gives you an escape hatch when business rules change.
ELT is the right default for most modern analytics use cases. But ETL isn't dead — there are situations where transforming before loading is still the correct choice.
Data privacy and compliance. If you're handling data governed by GDPR, HIPAA, or CCPA, you may be legally required to anonymize or mask personal data before it lands anywhere it could be broadly accessed. Loading raw PII (personally identifiable information) into a data warehouse where many people have access is a compliance risk. In this case, the ETL pipeline acts as a security boundary — you mask or hash sensitive fields before the data ever reaches the warehouse.
Streaming and real-time pipelines. ELT works well for batch data loads. But if you need to process events in real time — fraud detection, live inventory updates, operational alerting — the latency of the "load raw, then transform" cycle may be too high. Stream processing systems like Apache Kafka or Apache Flink often apply transformations in-flight before data reaches any storage layer.
Extremely constrained target systems. If your destination isn't a modern cloud data warehouse — say, you're loading into a legacy system with strict schema requirements and limited storage — you may need to transform data to a specific shape before loading.
The key principle: use ELT when your target system is powerful and storage is cheap. Use ETL when you have specific constraints that require earlier intervention.
You don't need a cloud data warehouse to practice this thinking. This exercise uses a sample scenario and SQL you can run in any SQL environment — including free tools like DB Browser for SQLite or DuckDB (which you can install locally with pip install duckdb).
The scenario: You receive a raw dump of customer records from a CRM system. The data is loaded into a table called raw_customers. Your task is to write an ELT-style transformation that produces a clean stg_customers table.
Step 1: Create and populate the raw table.
CREATE TABLE raw_customers (
customer_id INTEGER,
first_name TEXT,
last_name TEXT,
email TEXT,
signup_date TEXT,
plan_type TEXT,
lifetime_value_cents INTEGER
);
INSERT INTO raw_customers VALUES
(1, 'Maria', 'Santos', 'maria@example.com', '2022-03-15', 'PRO', 4999),
(2, 'james', 'okafor', 'JAMES@EXAMPLE.COM', '2022-07-01', 'free', 0),
(3, 'Chen', 'Wei', 'chen.wei@example.com', '2023-01-10', 'pro', 12400),
(4, 'Anika', 'Patel', NULL, '2023-05-20', 'FREE', 0),
(5, 'Robert', 'Kim', 'robert.kim@example.com', '2021-11-30', 'ENTERPRISE', 89900);
Step 2: Write a staging transformation.
Notice the raw data has problems: inconsistent casing in plan_type, inconsistent casing in email, signup_date stored as text, and a null email for one customer.
-- This is your stg_customers transformation
-- In a real ELT stack, this would be a dbt model
CREATE TABLE stg_customers AS
SELECT
customer_id,
-- Standardize name casing
UPPER(SUBSTR(first_name, 1, 1)) || LOWER(SUBSTR(first_name, 2)) AS first_name,
UPPER(SUBSTR(last_name, 1, 1)) || LOWER(SUBSTR(last_name, 2)) AS last_name,
-- Normalize email to lowercase
LOWER(email) AS email,
-- Flag missing emails for downstream handling
CASE WHEN email IS NULL THEN TRUE ELSE FALSE END AS is_missing_email,
-- Cast date string to a proper date
DATE(signup_date) AS signup_date,
-- Normalize plan type to lowercase
LOWER(plan_type) AS plan_type,
-- Flag paid plans explicitly
CASE WHEN LOWER(plan_type) IN ('pro', 'enterprise') THEN TRUE ELSE FALSE END AS is_paid,
-- Convert cents to dollars
lifetime_value_cents / 100.0 AS lifetime_value_dollars
FROM raw_customers;
Step 3: Inspect the results.
SELECT * FROM stg_customers;
You should see clean, consistent records — even for customer 4 (Anika Patel), where is_missing_email is TRUE so downstream models can handle the null safely rather than silently propagating it.
Notice that
raw_customersstill exists with all its original messiness intact. In ELT, the raw layer is append-only and never modified. Your staging model is a clean view of the truth, not a replacement for it.
Mistake 1: Treating the staging layer as the final layer.
New practitioners sometimes transform data once in a staging model and then have analysts query that staging model directly. Staging models are meant to be close to the source — they clean and rename columns but don't apply business logic. Business logic (calculating customer lifetime value segments, defining "active" users, attributing revenue to channels) should live in a separate "mart" or "reporting" layer built on top of staging. Keep the layers separate.
Mistake 2: Doing too much in the ingestion step.
Some ingestion tools let you configure column mappings, filters, and type conversions right in the connector. It's tempting to use these features heavily — but every transformation you bake into the ingestion layer is transformation logic that lives outside your version-controlled SQL. Keep ingestion as close to raw as possible. Let dbt handle the rest.
Mistake 3: Loading raw PII without thinking about access controls.
ELT means everyone on the data team can potentially see raw data, including sensitive fields like email addresses, social security numbers, or payment card information. Before you set up ELT pipelines for sensitive sources, define who has access to the raw schema. Most cloud data warehouses have row-level and column-level security features — use them.
Mistake 4: Assuming ELT eliminates the need to understand your data.
A common misconception is that because ELT defers transformation, you can "just load everything and figure it out later." Loading garbage in still produces garbage out. You still need to understand your source data, document your models, write tests (dbt has built-in data testing), and monitor for anomalies. ELT makes transformation more flexible, not optional.
Let's recap what you've learned.
ETL was designed for a world where warehouses were expensive and compute was scarce. You transformed data in an external processing layer before loading a clean version into the warehouse. This kept the warehouse tidy but made pipelines brittle, discarded raw data, and created engineering bottlenecks.
ELT emerged when cloud data warehouses became powerful enough to do transformation work cheaply at scale. You load raw data first — preserving everything — and then transform it inside the warehouse using SQL. This pattern is more flexible, auditable, and collaborative because transformations live as version-controlled SQL files that any team member can read and modify.
The modern data stack typically uses a managed ingestion tool (Fivetran, Airbyte) for the extract-and-load step, dbt for the transform step, and a BI tool (Looker, Tableau, Metabase) for consumption. Each layer has a clear, single responsibility.
ETL still makes sense for data privacy requirements, real-time streaming use cases, and constrained legacy target systems. ELT is the right default for analytics workloads on cloud data warehouses.
Where to go next:
The shift from ETL to ELT isn't just a technical preference — it reflects a deeper change in who owns data transformation and how quickly organizations can respond to new analytical questions. Understanding why the field moved in this direction will make you a better data practitioner, regardless of which specific tools your team chooses.
Learning Path: Modern Data Stack