
Picture this: Your company just closed its biggest quarter ever, and your CEO wants a comprehensive dashboard showing sales performance, customer behavior, and inventory trends—all by tomorrow morning. You have data scattered across your CRM system, your e-commerce platform, three different databases, and a handful of CSV files from your marketing team. How do you transform this data chaos into actionable insights?
This scenario happens every day in modern organizations, and it highlights one of the most fundamental decisions in data engineering: how you move and transform your data. The approach you choose—ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform)—will determine everything from how fast you can deliver results to how much your infrastructure costs to how easily you can adapt to changing business requirements.
By the end of this lesson, you'll understand both approaches deeply enough to make confident architectural decisions for your own data projects. More importantly, you'll know how to explain your choice to stakeholders and implement it successfully.
What you'll learn:
You should be familiar with basic database concepts (tables, queries, joins) and understand what a data warehouse is conceptually. If you've ever written a SQL query or moved data from one system to another, you're ready for this lesson. We'll use SQL examples throughout, but don't worry—I'll explain each query as we go.
Before diving into ETL and ELT, let's establish why we need these approaches at all. Modern organizations generate data everywhere: customer interactions on websites, sales transactions, sensor readings, social media mentions, financial records. This data lives in different systems, uses different formats, and updates at different frequencies.
Raw operational data rarely matches what you need for analysis. Consider a typical e-commerce transaction record:
-- Raw transaction data from your e-commerce system
{
"transaction_id": "TXN_20240315_001847",
"customer_email": "sarah.martinez@email.com",
"items": [
{"sku": "LAPTOP_DELL_15", "quantity": 1, "unit_price": 1299.99},
{"sku": "MOUSE_LOGITECH", "quantity": 2, "unit_price": 29.99}
],
"timestamp": "2024-03-15T14:18:47Z",
"payment_method": "credit_card",
"shipping_address": "123 Oak St, Portland, OR 97205"
}
For analysis, you might need this data transformed into a flat structure with calculated fields:
-- Transformed data ready for analysis
SELECT
'TXN_20240315_001847' as transaction_id,
'sarah.martinez@email.com' as customer_email,
DATE('2024-03-15T14:18:47Z') as transaction_date,
EXTRACT(HOUR FROM '2024-03-15T14:18:47Z') as transaction_hour,
1299.99 + (2 * 29.99) as total_amount,
3 as total_items,
'credit_card' as payment_method,
'OR' as state_code,
'Electronics' as primary_category
This transformation—from nested JSON to analytical tables with derived fields—is where ETL and ELT strategies diverge.
ETL stands for Extract, Transform, Load, and the order matters. In ETL, you pull data from source systems, transform it into the format you need, then load the cleaned, processed data into your destination system.
Think of ETL like a factory assembly line. Raw materials (data) come in one end, go through various processing stations (transformations), and finished products (cleaned data) come out the other end, ready for immediate use.
-- ETL Pipeline Example: Processing Daily Sales Data
-- Step 1: Extract (conceptual - actual extraction varies by source)
-- Pull data from multiple sources:
-- - transactions table from PostgreSQL
-- - product_catalog from MySQL
-- - customer_segments from CSV files
-- Step 2: Transform (happens in processing engine before loading)
WITH transformed_sales AS (
SELECT
t.transaction_id,
t.customer_id,
c.segment as customer_segment,
t.transaction_date,
t.total_amount,
p.category as product_category,
p.brand,
-- Calculate business metrics during transformation
CASE
WHEN t.total_amount > 500 THEN 'High Value'
WHEN t.total_amount > 100 THEN 'Medium Value'
ELSE 'Low Value'
END as transaction_tier,
-- Clean and standardize data
UPPER(TRIM(p.brand)) as standardized_brand
FROM raw_transactions t
JOIN customer_data c ON t.customer_id = c.customer_id
JOIN product_catalog p ON t.product_id = p.product_id
WHERE t.transaction_date = CURRENT_DATE - 1 -- Yesterday's data
)
-- Step 3: Load (insert transformed data into data warehouse)
INSERT INTO analytics_db.daily_sales_summary
SELECT * FROM transformed_sales;
In this ETL example, all the heavy lifting—joins, calculations, data cleaning—happens before the data reaches your analytics database. What gets loaded is clean, processed, ready-to-use data.
Traditional ETL tools include:
Here's a typical ETL architecture:
Source Systems ETL Processing Layer Destination
| | |
[CRM Database] ──┐ | [Data Warehouse]
│ | |
[E-commerce API] ├──► [ETL Engine] ────────► [Clean Tables]
│ (Transforms [Aggregated Views]
[CSV Files] ─────┘ data here) [Business Metrics]
ETL works best when:
1. You have limited storage or compute in your destination system
-- Example: Loading into a smaller database with limited resources
-- Transform first to reduce the data volume
SELECT
DATE_TRUNC('month', order_date) as month,
product_category,
SUM(revenue) as total_revenue,
COUNT(*) as order_count
FROM raw_orders
WHERE order_date >= '2024-01-01'
GROUP BY DATE_TRUNC('month', order_date), product_category
-- Load only monthly summaries instead of individual transactions
2. You need consistent, validated data
-- ETL ensures data quality before loading
SELECT
customer_id,
order_date,
revenue
FROM raw_orders
WHERE customer_id IS NOT NULL -- Remove invalid records
AND revenue > 0 -- Remove negative/zero amounts
AND order_date <= CURRENT_DATE -- Remove future dates
AND LENGTH(customer_id) = 10 -- Validate ID format
3. Your analysts need data immediately available Once ETL completes, your data warehouse contains processed data that's ready for immediate querying—no additional transformation required.
Pro tip: ETL is often the better choice when you're working with traditional data warehouses like Oracle, SQL Server, or older systems with limited processing power.
ELT flips the script: Extract, Load, Transform. You pull data from sources and load it into your destination system in its raw or minimally processed form, then transform it using the computational power of your destination system.
Think of ELT like a modern streaming service. Instead of processing everything upfront (like old-school TV production), you store everything and process it on-demand when someone requests it.
-- ELT Pipeline Example: Same sales data, different approach
-- Step 1: Extract and Load (minimal processing)
-- Raw data loaded directly into data lake/warehouse:
CREATE TABLE raw_transactions (
transaction_id VARCHAR(50),
customer_email VARCHAR(255),
raw_items JSON,
transaction_timestamp TIMESTAMP,
payment_method VARCHAR(50),
shipping_address TEXT
);
-- Load data as-is (JSON and all)
INSERT INTO raw_transactions VALUES
('TXN_20240315_001847', 'sarah.martinez@email.com',
'[{"sku": "LAPTOP_DELL_15", "quantity": 1, "unit_price": 1299.99}]',
'2024-03-15 14:18:47', 'credit_card', '123 Oak St, Portland, OR 97205');
-- Step 2: Transform using destination system's power
CREATE VIEW sales_analytics AS
SELECT
transaction_id,
customer_email,
DATE(transaction_timestamp) as transaction_date,
EXTRACT(HOUR FROM transaction_timestamp) as transaction_hour,
-- Parse JSON in the warehouse
JSON_EXTRACT_SCALAR(item, '$.unit_price') as unit_price,
JSON_EXTRACT_SCALAR(item, '$.quantity') as quantity,
-- Calculate metrics on-demand
SUM(CAST(JSON_EXTRACT_SCALAR(item, '$.unit_price') AS DECIMAL) *
CAST(JSON_EXTRACT_SCALAR(item, '$.quantity') AS INTEGER)) as total_amount
FROM raw_transactions,
UNNEST(JSON_EXTRACT_ARRAY(raw_items)) as item
GROUP BY transaction_id, customer_email, transaction_timestamp;
In ELT, the transformation happens inside your data warehouse or data lake, using its processing power rather than a separate ETL engine.
Modern ELT tools include:
Here's a typical ELT architecture:
Source Systems Ingestion Layer Destination System
| | |
[CRM Database] ──┐ | [Data Lake/Warehouse]
│ | |
[E-commerce API] ├──► [Data Sync] ──► [Raw Data Tables]
│ (Minimal |
[CSV Files] ─────┘ transformation) [dbt Models] ← Transformation
[Analytics Views] happens here
ELT works best when:
1. You have powerful, scalable compute in your destination
-- Example: Using BigQuery's processing power for complex transformations
-- This would be expensive/slow in traditional ETL, but fast in BigQuery
WITH customer_behavior AS (
SELECT
customer_id,
-- Complex window functions
AVG(order_amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as rolling_7day_avg,
-- Machine learning functions
ML.PREDICT(MODEL customer_ltv_model,
STRUCT(order_amount, days_since_last_order)) as predicted_ltv
FROM raw_orders
)
SELECT * FROM customer_behavior
2. You need to store data for unknown future use cases
-- Store everything now, figure out what you need later
CREATE TABLE event_stream (
timestamp TIMESTAMP,
user_id STRING,
event_type STRING,
raw_payload JSON -- Store complete event data
);
-- Future analyst can query for new insights
SELECT
JSON_EXTRACT_SCALAR(raw_payload, '$.page_url') as page,
COUNT(*) as page_views
FROM event_stream
WHERE event_type = 'page_view'
AND timestamp > CURRENT_TIMESTAMP - INTERVAL 7 DAY
GROUP BY JSON_EXTRACT_SCALAR(raw_payload, '$.page_url');
3. Your data sources change frequently With ELT, you're not locked into transformation logic written months ago. You can always go back to raw data and create new transformations.
Pro tip: ELT is often the better choice when working with modern cloud data warehouses like Snowflake, BigQuery, or Redshift, which are designed to handle massive compute workloads.
Let's compare these approaches across the dimensions that matter most for your decision:
| Aspect | ETL | ELT |
|---|---|---|
| Transformation Location | Separate processing system | Inside destination system |
| Data Storage | Only processed data stored | Raw + processed data stored |
| Flexibility | Changes require pipeline updates | Query-time transformations possible |
| Performance | Fast queries (pre-processed) | Depends on destination compute power |
| Storage Costs | Lower (only final data) | Higher (raw + processed data) |
| Compute Costs | Dedicated ETL infrastructure | Use destination's compute |
| Data Quality | Enforced during processing | Can be enforced at query time |
| Time to Insights | Slower (wait for ETL) | Faster (query raw data immediately) |
Here's how the same analysis might perform differently:
-- ETL Approach: Pre-calculated daily summaries
-- Query time: ~50ms (data already summarized)
SELECT
sales_date,
total_revenue,
order_count,
avg_order_value
FROM daily_sales_summary
WHERE sales_date BETWEEN '2024-01-01' AND '2024-03-31';
-- ELT Approach: Calculate on-demand
-- Query time: ~2-5 seconds (but more flexible)
SELECT
DATE(order_timestamp) as sales_date,
SUM(order_amount) as total_revenue,
COUNT(*) as order_count,
AVG(order_amount) as avg_order_value
FROM raw_orders
WHERE DATE(order_timestamp) BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY DATE(order_timestamp);
-- ETL: Store only what you need
CREATE TABLE quarterly_metrics (
quarter VARCHAR(10),
product_category VARCHAR(50),
revenue DECIMAL(15,2),
customer_count INTEGER
);
-- Storage: ~1GB for 5 years of quarterly data
-- ELT: Store everything
CREATE TABLE all_transactions (
transaction_id VARCHAR(50),
customer_id VARCHAR(50),
product_details JSON,
transaction_timestamp TIMESTAMP,
-- ... all raw fields
);
-- Storage: ~500GB for same 5 years, but can answer any question
Use this framework to choose between ETL and ELT for your specific situation:
1. Limited destination compute power
2. Predictable, stable requirements
-- Example: Regulatory reporting with fixed requirements
-- You know exactly what data you need, format doesn't change
SELECT
report_month,
total_loans_issued,
average_credit_score,
default_rate
FROM monthly_lending_summary
WHERE report_month >= '2024-01-01';
3. Performance is critical
4. Data quality must be enforced
-- ETL ensures bad data never reaches analysts
CASE
WHEN customer_age < 0 OR customer_age > 120 THEN NULL
WHEN LENGTH(customer_email) < 5 OR customer_email NOT LIKE '%@%' THEN NULL
ELSE customer_email
END as validated_email
1. Powerful, scalable destination system
2. Evolving or unknown requirements
-- Store everything, analyze later
-- Today: Basic sales reporting
SELECT product_id, SUM(revenue) FROM orders GROUP BY product_id;
-- Tomorrow: Complex customer journey analysis (same raw data)
SELECT
customer_id,
STRING_AGG(product_category ORDER BY order_timestamp) as journey
FROM orders
GROUP BY customer_id;
3. Multiple teams with different needs
4. Real-time or near-real-time requirements
-- Stream raw events, query immediately
CREATE STREAM events (
user_id STRING,
event_type STRING,
timestamp TIMESTAMP
);
-- Immediate analysis without waiting for ETL
SELECT event_type, COUNT(*)
FROM events
WHERE timestamp > CURRENT_TIMESTAMP - INTERVAL 5 MINUTE
GROUP BY event_type;
Let's put this knowledge into practice. You're building a data pipeline for an online bookstore that wants to analyze customer behavior and sales performance.
Data Sources:
Business Questions:
Your Task: Design both an ETL and ELT solution for this scenario. For each approach:
-- ETL Approach: Pre-process everything
-- Dimensional model (star schema)
CREATE TABLE fact_sales (
sale_id SERIAL PRIMARY KEY,
customer_id INTEGER,
book_id INTEGER,
order_date DATE,
quantity INTEGER,
revenue DECIMAL(10,2),
unit_price DECIMAL(8,2)
);
CREATE TABLE dim_customers (
customer_id INTEGER PRIMARY KEY,
customer_name VARCHAR(255),
customer_email VARCHAR(255),
registration_date DATE,
customer_state VARCHAR(50),
customer_segment VARCHAR(50) -- Calculated during ETL
);
CREATE TABLE dim_books (
book_id INTEGER PRIMARY KEY,
title VARCHAR(500),
author VARCHAR(255),
category VARCHAR(100),
publication_decade VARCHAR(10) -- Calculated during ETL
);
-- ETL Transformation Logic
INSERT INTO fact_sales
SELECT
o.order_id,
o.customer_id,
o.book_id,
o.order_date,
o.quantity,
o.quantity * o.price as revenue,
o.price
FROM raw_orders o
WHERE o.order_date = CURRENT_DATE - 1; -- Process yesterday's data
INSERT INTO dim_customers
SELECT
c.customer_id,
c.name,
c.email,
c.registration_date,
SUBSTRING(c.location FROM '.*, (..)') as customer_state,
CASE
WHEN days_since_registration > 365 THEN 'Loyal'
WHEN days_since_registration > 90 THEN 'Regular'
ELSE 'New'
END as customer_segment
FROM raw_customers c;
-- Pre-calculated aggregates for fast querying
CREATE TABLE monthly_sales_summary AS
SELECT
DATE_TRUNC('month', order_date) as sales_month,
b.category,
SUM(f.revenue) as total_revenue,
COUNT(DISTINCT f.customer_id) as unique_customers,
SUM(f.quantity) as books_sold
FROM fact_sales f
JOIN dim_books b ON f.book_id = b.book_id
GROUP BY DATE_TRUNC('month', order_date), b.category;
-- ELT Approach: Store raw, transform on-demand
-- Raw data storage (minimal processing)
CREATE TABLE raw_orders (
order_id VARCHAR(50),
customer_id VARCHAR(50),
book_id VARCHAR(50),
quantity INTEGER,
price DECIMAL(8,2),
order_timestamp TIMESTAMP
);
CREATE TABLE raw_customers (
customer_id VARCHAR(50),
full_name VARCHAR(255),
email VARCHAR(255),
registration_timestamp TIMESTAMP,
full_address TEXT
);
CREATE TABLE raw_books (
book_id VARCHAR(50),
title TEXT,
author_name VARCHAR(255),
category VARCHAR(100),
publication_year INTEGER
);
-- Transform using views (computed on-demand)
CREATE VIEW monthly_sales_analysis AS
WITH enriched_orders AS (
SELECT
o.order_id,
o.customer_id,
o.book_id,
o.quantity,
o.price,
o.quantity * o.price as revenue,
DATE_TRUNC('month', o.order_timestamp) as sales_month,
b.category,
b.author_name,
-- Parse state from address during query
REGEXP_EXTRACT(c.full_address, r', ([A-Z]{2}) \d') as customer_state,
-- Calculate customer age during query
DATE_DIFF(CURRENT_DATE(),
DATE(c.registration_timestamp),
DAY) as days_since_registration
FROM raw_orders o
JOIN raw_customers c ON o.customer_id = c.customer_id
JOIN raw_books b ON o.book_id = b.book_id
)
SELECT
sales_month,
category,
SUM(revenue) as total_revenue,
COUNT(DISTINCT customer_id) as unique_customers,
SUM(quantity) as books_sold,
AVG(revenue) as avg_order_value
FROM enriched_orders
GROUP BY sales_month, category;
-- Customer segmentation view (calculated on-demand)
CREATE VIEW customer_segments AS
SELECT
customer_id,
full_name,
CASE
WHEN days_since_registration > 365 THEN 'Loyal'
WHEN days_since_registration > 90 THEN 'Regular'
ELSE 'New'
END as segment,
SUM(quantity * price) OVER (PARTITION BY customer_id) as lifetime_value
FROM raw_customers c
JOIN raw_orders o ON c.customer_id = o.customer_id;
For this bookstore scenario, I'd recommend ELT for these reasons:
However, I'd add some pre-calculated tables for the most common queries (like daily sales summaries) to ensure dashboard performance.
The Problem:
-- Teams often default to what they know
-- "We've always done ETL, so let's stick with ETL"
-- Even when requirements have changed dramatically
The Fix: Regularly reassess your architecture against current requirements. Ask:
The Problem:
-- Looking only at obvious costs
-- ETL: "We need a dedicated ETL server" ✓
-- ELT: "Storage is cheap!" ✓
-- Missing: maintenance, development time, flexibility costs
The Fix: Calculate full costs:
The Problem:
-- Building ELT for a 10GB dataset
-- "We might scale to petabytes someday!"
-- Spending months on infrastructure for data that fits in Excel
The Fix: Start with your actual current requirements. You can migrate approaches as you grow:
-- Start simple for small data
CREATE TABLE sales_summary AS
SELECT
DATE(order_date) as sale_date,
SUM(revenue) as daily_revenue
FROM orders
GROUP BY DATE(order_date);
-- Migrate to ELT when you outgrow this approach
The Problem:
-- ETL pipeline breaks when source adds a new field
CREATE TABLE processed_orders (
order_id INTEGER,
customer_id INTEGER,
total_amount DECIMAL(10,2)
-- What happens when source adds "discount_code"?
);
The Fix: Design for change:
-- ETL: Build flexible transformation logic
CASE
WHEN source_version >= '2.0' THEN
COALESCE(discount_amount, 0)
ELSE 0
END as discount_amount
-- ELT: Store raw data, handle changes in transformation layer
CREATE TABLE raw_orders (
raw_data JSON, -- Can handle any schema changes
ingestion_timestamp TIMESTAMP
);
The Problem:
-- Some tables use ETL (pre-processed)
SELECT * FROM clean_customer_data; -- ETL approach
-- Others use ELT (raw data)
SELECT JSON_EXTRACT(raw_payload, '$.customer_id') FROM raw_events; -- ELT approach
-- Creates confusion and maintenance overhead
The Fix: Choose one primary approach with clear exceptions:
ETL and ELT represent fundamentally different philosophies for handling data:
ETL transforms data before storage, optimizing for performance and storage efficiency at the cost of flexibility. It works best when you have limited destination compute power, well-defined requirements, and need predictable query performance.
ELT stores raw data and transforms on-demand, optimizing for flexibility and speed-to-insight at the cost of storage and compute resources. It excels when you have powerful destination systems, evolving requirements, and multiple teams with different analytical needs.
The key insight is that this isn't just a technical decision—it's a business strategy decision. ETL assumes you know what questions you'll ask. ELT assumes you don't know what you don't know.
In practice, many modern data teams use a hybrid approach: ELT for exploration and ad-hoc analysis, with selective ETL for performance-critical applications. The rise of cloud data warehouses has tilted the scales toward ELT, but ETL remains valuable when you need guaranteed performance or have limited computational resources.
Next steps to continue your data engineering journey:
Learn dbt (data build tool) - The modern standard for ELT transformations in SQL. Understanding dbt will make you fluent in the ELT approach and teach you software engineering best practices for data work.
Explore data warehousing concepts - Understand dimensional modeling, star schemas, and data vault methodology. These concepts apply to both ETL and ELT but are implemented differently in each approach.
Study stream processing - Learn about Apache Kafka, Apache Spark Streaming, or cloud streaming services. Real-time data processing adds another dimension to the ETL vs ELT decision, and understanding streaming will make you a more complete data engineer.