
You're staring at a spreadsheet with 50,000 rows of e-commerce transaction data. Sales wants to know which products are trending, marketing needs conversion rates by campaign, and finance wants monthly revenue summaries. Everyone's asking different questions, but they're all hitting the same messy data dump. Sound familiar?
This is where analytical data modeling saves the day. Unlike operational databases that prioritize storing data efficiently, analytical models are designed for asking questions fast. Today, we'll explore two fundamental approaches: dimensional modeling (the time-tested warehouse approach) and One Big Table (the modern cloud-first method). By the end, you'll know when to use each approach and how to implement both.
What you'll learn:
You should be comfortable with basic SQL queries (SELECT, JOIN, GROUP BY) and understand fundamental database concepts like tables and relationships. We'll work with realistic e-commerce data throughout the examples.
Let's start with a typical problem. Here's what raw e-commerce data often looks like:
-- Raw orders table - how data usually arrives
CREATE TABLE raw_orders (
order_id INT,
customer_email VARCHAR(100),
customer_name VARCHAR(100),
customer_city VARCHAR(50),
customer_state VARCHAR(20),
product_id INT,
product_name VARCHAR(200),
product_category VARCHAR(50),
product_price DECIMAL(10,2),
order_date DATE,
quantity INT,
discount_percent DECIMAL(5,2),
shipping_cost DECIMAL(8,2)
);
This structure works fine for recording transactions, but it creates problems for analytics:
Let's solve these problems with two different modeling approaches.
Dimensional modeling separates your data into two types of tables:
Think of it like organizing a library. Facts are the books (the things you want to analyze), and dimensions are the catalog system (how you find and group the books).
Let's transform our messy e-commerce data into a clean dimensional model. We'll create a central fact table surrounded by dimension tables—this pattern is called a "star schema" because of its visual shape.
-- Dimension: Customers
CREATE TABLE dim_customers (
customer_key INT PRIMARY KEY,
customer_email VARCHAR(100),
customer_name VARCHAR(100),
customer_city VARCHAR(50),
customer_state VARCHAR(20),
customer_segment VARCHAR(20) -- added business logic
);
-- Dimension: Products
CREATE TABLE dim_products (
product_key INT PRIMARY KEY,
product_id INT,
product_name VARCHAR(200),
product_category VARCHAR(50),
product_subcategory VARCHAR(50), -- more granular grouping
product_brand VARCHAR(100)
);
-- Dimension: Time (crucial for analytics!)
CREATE TABLE dim_date (
date_key INT PRIMARY KEY, -- format: 20240315
date_actual DATE,
year_num INT,
quarter_num INT,
month_num INT,
month_name VARCHAR(20),
day_of_week INT,
day_name VARCHAR(20),
is_weekend BOOLEAN,
is_holiday BOOLEAN
);
-- Fact: Order Line Items
CREATE TABLE fact_order_lines (
order_line_key INT PRIMARY KEY,
order_id INT,
customer_key INT,
product_key INT,
date_key INT,
-- Measures (the numbers you want to analyze)
quantity INT,
unit_price DECIMAL(10,2),
discount_amount DECIMAL(8,2),
line_total DECIMAL(10,2),
-- Foreign keys to dimensions
FOREIGN KEY (customer_key) REFERENCES dim_customers(customer_key),
FOREIGN KEY (product_key) REFERENCES dim_products(product_key),
FOREIGN KEY (date_key) REFERENCES dim_date(date_key)
);
Notice how we've made several improvements:
customer_key instead of email addresses for better performanceLet's populate our dimensional model with realistic data:
-- Load customers (removing duplicates and adding segments)
INSERT INTO dim_customers VALUES
(1, 'sarah.chen@email.com', 'Sarah Chen', 'Seattle', 'WA', 'Premium'),
(2, 'mike.jones@email.com', 'Mike Jones', 'Austin', 'TX', 'Standard'),
(3, 'lisa.park@email.com', 'Lisa Park', 'Denver', 'CO', 'Premium');
-- Load products with hierarchy
INSERT INTO dim_products VALUES
(1, 101, 'MacBook Pro 14"', 'Electronics', 'Laptops', 'Apple'),
(2, 102, 'Wireless Mouse', 'Electronics', 'Accessories', 'Logitech'),
(3, 103, 'Standing Desk', 'Furniture', 'Office', 'IKEA');
-- Load some dates
INSERT INTO dim_date VALUES
(20240315, '2024-03-15', 2024, 1, 3, 'March', 5, 'Friday', FALSE, FALSE),
(20240316, '2024-03-16', 2024, 1, 3, 'March', 6, 'Saturday', TRUE, FALSE);
-- Load fact data
INSERT INTO fact_order_lines VALUES
(1, 1001, 1, 1, 20240315, 1, 1999.00, 100.00, 1899.00),
(2, 1001, 1, 2, 20240315, 2, 79.99, 0.00, 159.98),
(3, 1002, 2, 3, 20240316, 1, 299.99, 30.00, 269.99);
Now see how clean your analytical queries become:
-- Monthly sales by product category
SELECT
d.month_name,
p.product_category,
SUM(f.line_total) as total_sales,
COUNT(*) as order_lines,
AVG(f.line_total) as avg_line_value
FROM fact_order_lines f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_products p ON f.product_key = p.product_key
WHERE d.year_num = 2024
GROUP BY d.month_name, p.product_category
ORDER BY total_sales DESC;
-- Customer performance by segment and state
SELECT
c.customer_state,
c.customer_segment,
COUNT(DISTINCT f.order_id) as total_orders,
SUM(f.line_total) as total_revenue,
AVG(f.line_total) as avg_order_value
FROM fact_order_lines f
JOIN dim_customers c ON f.customer_key = c.customer_key
GROUP BY c.customer_state, c.customer_segment
ORDER BY total_revenue DESC;
Pro tip: Notice how we can slice and dice by any dimension without complex subqueries. That's the power of dimensional modeling—it makes complex business questions feel simple.
While dimensional modeling is battle-tested, modern cloud platforms have changed the game. With columnar storage and massive parallel processing, a different approach has emerged: the One Big Table (OBT).
Instead of splitting data across multiple tables, OBT denormalizes everything into a single, wide table. This trades storage efficiency for query simplicity and performance.
Let's redesign our e-commerce data as an OBT:
-- One Big Table approach
CREATE TABLE orders_obt (
-- Order identifiers
order_line_id INT PRIMARY KEY,
order_id INT,
order_date DATE,
-- Customer dimensions (denormalized)
customer_id INT,
customer_email VARCHAR(100),
customer_name VARCHAR(100),
customer_city VARCHAR(50),
customer_state VARCHAR(20),
customer_segment VARCHAR(20),
customer_lifetime_orders INT, -- pre-calculated
customer_lifetime_value DECIMAL(12,2), -- pre-calculated
-- Product dimensions (denormalized)
product_id INT,
product_name VARCHAR(200),
product_category VARCHAR(50),
product_subcategory VARCHAR(50),
product_brand VARCHAR(100),
product_launch_date DATE,
-- Time dimensions (denormalized)
year_num INT,
quarter_num INT,
month_num INT,
month_name VARCHAR(20),
day_of_week INT,
day_name VARCHAR(20),
is_weekend BOOLEAN,
is_holiday BOOLEAN,
-- Fact measures
quantity INT,
unit_price DECIMAL(10,2),
discount_amount DECIMAL(8,2),
line_total DECIMAL(10,2),
shipping_cost DECIMAL(8,2),
-- Pre-calculated measures for common queries
order_total DECIMAL(12,2),
customer_order_number INT, -- 1st, 2nd, 3rd order for this customer
days_since_last_order INT
);
INSERT INTO orders_obt VALUES
(1, 1001, '2024-03-15',
1, 'sarah.chen@email.com', 'Sarah Chen', 'Seattle', 'WA', 'Premium', 5, 8500.00,
101, 'MacBook Pro 14"', 'Electronics', 'Laptops', 'Apple', '2023-10-01',
2024, 1, 3, 'March', 5, 'Friday', FALSE, FALSE,
1, 1999.00, 100.00, 1899.00, 25.00, 2083.98, 3, 45),
(2, 1001, '2024-03-15',
1, 'sarah.chen@email.com', 'Sarah Chen', 'Seattle', 'WA', 'Premium', 5, 8500.00,
102, 'Wireless Mouse', 'Electronics', 'Accessories', 'Logitech', '2023-05-15',
2024, 1, 3, 'March', 5, 'Friday', FALSE, FALSE,
2, 79.99, 0.00, 159.98, 0.00, 2083.98, 3, 45);
The beauty of OBT is query simplicity:
-- Same monthly sales by category - much simpler!
SELECT
month_name,
product_category,
SUM(line_total) as total_sales,
COUNT(*) as order_lines,
AVG(line_total) as avg_line_value
FROM orders_obt
WHERE year_num = 2024
GROUP BY month_name, product_category
ORDER BY total_sales DESC;
-- Customer cohort analysis - leveraging pre-calculated fields
SELECT
customer_segment,
customer_order_number,
COUNT(DISTINCT customer_id) as customers,
AVG(line_total) as avg_order_value,
AVG(days_since_last_order) as avg_days_between_orders
FROM orders_obt
WHERE customer_order_number <= 5 -- first 5 orders
GROUP BY customer_segment, customer_order_number
ORDER BY customer_segment, customer_order_number;
Key insight: OBT queries are often faster to write and run, especially for exploratory analytics. You eliminate joins and can pre-calculate common business metrics.
Many organizations use both:
Let's build both models for a subscription business scenario. You're analyzing a SaaS company with these requirements:
Create tables for a subscription analytics warehouse:
-- Dimension: Customers with SaaS-specific attributes
CREATE TABLE dim_customers_saas (
customer_key INT PRIMARY KEY,
customer_id VARCHAR(50),
company_name VARCHAR(200),
industry VARCHAR(100),
company_size VARCHAR(20), -- Small, Medium, Large, Enterprise
signup_source VARCHAR(50), -- Organic, Paid Search, Referral, etc.
signup_date DATE,
customer_status VARCHAR(20) -- Active, Churned, Paused
);
-- Dimension: Subscription Plans
CREATE TABLE dim_plans (
plan_key INT PRIMARY KEY,
plan_id VARCHAR(50),
plan_name VARCHAR(100),
plan_tier VARCHAR(20), -- Starter, Pro, Enterprise
monthly_price DECIMAL(8,2),
annual_price DECIMAL(8,2),
max_users INT,
features_included TEXT
);
-- Fact: Monthly Subscription Metrics
CREATE TABLE fact_subscriptions (
subscription_key INT PRIMARY KEY,
customer_key INT,
plan_key INT,
date_key INT,
-- Subscription measures
mrr DECIMAL(10,2),
arr DECIMAL(12,2),
active_users INT,
feature_usage_score DECIMAL(5,2),
support_tickets INT,
is_churned BOOLEAN,
FOREIGN KEY (customer_key) REFERENCES dim_customers_saas(customer_key),
FOREIGN KEY (plan_key) REFERENCES dim_plans(plan_key)
);
CREATE TABLE subscriptions_obt (
-- Identifiers
subscription_record_id INT PRIMARY KEY,
customer_id VARCHAR(50),
subscription_id VARCHAR(50),
record_date DATE,
-- Customer attributes (denormalized)
company_name VARCHAR(200),
industry VARCHAR(100),
company_size VARCHAR(20),
signup_source VARCHAR(50),
signup_date DATE,
customer_status VARCHAR(20),
days_as_customer INT, -- calculated field
-- Plan attributes (denormalized)
plan_id VARCHAR(50),
plan_name VARCHAR(100),
plan_tier VARCHAR(20),
monthly_price DECIMAL(8,2),
annual_price DECIMAL(8,2),
-- Time attributes (denormalized)
year_num INT,
month_num INT,
month_name VARCHAR(20),
quarter_name VARCHAR(10),
-- Metrics
mrr DECIMAL(10,2),
arr DECIMAL(12,2),
active_users INT,
feature_usage_score DECIMAL(5,2),
support_tickets INT,
is_churned BOOLEAN,
-- Pre-calculated business metrics
months_subscribed INT,
total_lifetime_value DECIMAL(12,2),
churn_risk_score DECIMAL(3,2) -- ML model output
);
Write the same business question both ways:
-- Dimensional approach: Churn rate by signup source and plan tier
SELECT
c.signup_source,
p.plan_tier,
COUNT(*) as total_customers,
SUM(CASE WHEN f.is_churned THEN 1 ELSE 0 END) as churned_customers,
ROUND(100.0 * SUM(CASE WHEN f.is_churned THEN 1 ELSE 0 END) / COUNT(*), 2) as churn_rate_pct
FROM fact_subscriptions f
JOIN dim_customers_saas c ON f.customer_key = c.customer_key
JOIN dim_plans p ON f.plan_key = p.plan_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.year_num = 2024 AND d.month_num <= 6
GROUP BY c.signup_source, p.plan_tier
ORDER BY churn_rate_pct DESC;
-- OBT approach: Same query, simpler syntax
SELECT
signup_source,
plan_tier,
COUNT(*) as total_customers,
SUM(CASE WHEN is_churned THEN 1 ELSE 0 END) as churned_customers,
ROUND(100.0 * SUM(CASE WHEN is_churned THEN 1 ELSE 0 END) / COUNT(*), 2) as churn_rate_pct
FROM subscriptions_obt
WHERE year_num = 2024 AND month_num <= 6
GROUP BY signup_source, plan_tier
ORDER BY churn_rate_pct DESC;
Notice how the OBT version eliminates three joins while providing the same insights.
Mistake: Creating too many dimension tables
-- Don't do this - over-normalization
CREATE TABLE dim_customer_emails (email_key INT, email VARCHAR(100));
CREATE TABLE dim_customer_names (name_key INT, first_name VARCHAR(50), last_name VARCHAR(50));
Better: Keep related attributes together
-- Do this - logical grouping
CREATE TABLE dim_customers (
customer_key INT,
email VARCHAR(100),
first_name VARCHAR(50),
last_name VARCHAR(50)
);
Mistake: Forgetting surrogate keys Using natural keys like email addresses or product codes as primary keys causes problems when those values change or contain special characters.
Mistake: Ignoring slowly changing dimensions Customer addresses change, product prices change. Plan for this:
-- Add effective dating for changing attributes
CREATE TABLE dim_products (
product_key INT PRIMARY KEY,
product_id VARCHAR(50),
product_name VARCHAR(200),
current_price DECIMAL(10,2),
effective_date DATE,
expiration_date DATE,
is_current BOOLEAN
);
Mistake: Including everything in one table Not every piece of data belongs in your analytics table. Focus on what analysts actually query.
Mistake: Forgetting about data types
-- Bad - everything as text
CREATE TABLE bad_obt (
revenue VARCHAR(50), -- Should be DECIMAL
order_date VARCHAR(50), -- Should be DATE
is_weekend VARCHAR(10) -- Should be BOOLEAN
);
Mistake: Not pre-calculating common metrics If analysts always calculate customer lifetime value, do it once during data loading:
-- Add during ETL process
UPDATE orders_obt SET customer_lifetime_value = (
SELECT SUM(line_total)
FROM orders_obt o2
WHERE o2.customer_id = orders_obt.customer_id
AND o2.order_date <= orders_obt.order_date
);
Slow dimensional queries: Check your join keys are indexed
-- Add indexes on foreign keys
CREATE INDEX idx_fact_customer ON fact_order_lines(customer_key);
CREATE INDEX idx_fact_product ON fact_order_lines(product_key);
CREATE INDEX idx_fact_date ON fact_order_lines(date_key);
Slow OBT queries: Partition large tables by date
-- Most cloud platforms support automatic partitioning
CREATE TABLE orders_obt (
...columns...
order_date DATE
) PARTITION BY order_date;
You now understand two fundamental approaches to analytical data modeling:
Dimensional modeling organizes data into facts and dimensions, providing:
One Big Table denormalizes everything into a single table, offering:
Neither approach is always right. The best choice depends on your specific needs: data size, team structure, performance requirements, and platform capabilities.
The foundation you've built here will serve you well as data volumes grow and requirements evolve. Whether you choose stars or tables, you're now equipped to make that decision strategically rather than accidentally.
Learning Path: Modern Data Stack