You're staring at a CSV file with customer data from your company's e-commerce platform. There are 100,000 rows of information including customer names, email addresses, registration dates, purchase amounts, and product categories. Your task is to import this data into a SQL database, but you quickly realize that simply dumping everything into VARCHAR columns isn't going to cut it. The dates are storing inconsistently, the monetary values are losing precision, and your queries are crawling along at a snail's pace.
This scenario highlights a fundamental truth about working with databases: choosing the right data types and designing your schema properly from the start will save you countless hours of headaches down the road. Poor data type choices lead to data integrity issues, performance problems, and maintenance nightmares that only get worse as your database grows.
By the end of this lesson, you'll understand how to make informed decisions about data types and schema design that will set your database projects up for long-term success. You'll learn to think like a database architect, considering not just what data you need to store today, but how your schema will evolve and perform as your application scales.
What you'll learn:
You should be comfortable with basic SQL concepts including creating tables, inserting data, and writing simple SELECT statements. Familiarity with primary and foreign keys is helpful but we'll review these concepts as they relate to schema design.
Let's start by examining the most commonly used SQL data types and when to choose each one. Understanding these fundamentals is crucial because data type decisions ripple through every aspect of your database's performance and functionality.
When storing numbers, your choice of data type affects both storage efficiency and computational accuracy. Let's examine a real scenario using product pricing data:
-- Poor approach: storing prices as VARCHAR
CREATE TABLE products_bad (
product_id INT PRIMARY KEY,
product_name VARCHAR(100),
price VARCHAR(20), -- This will cause problems!
discount_percentage VARCHAR(10)
);
-- Better approach: appropriate numeric types
CREATE TABLE products_good (
product_id INT PRIMARY KEY,
product_name VARCHAR(100),
price DECIMAL(10,2), -- Up to 99,999,999.99
discount_percentage DECIMAL(5,2) -- Up to 999.99%
);
Why does this matter? Consider what happens when you try to calculate totals:
-- With VARCHAR prices, you get string concatenation instead of addition
SELECT product_name, price + 5.00 as price_with_tax
FROM products_bad;
-- Result: "29.99" + 5.00 = "29.995.00" (wrong!)
-- With DECIMAL prices, you get proper arithmetic
SELECT product_name, price + 5.00 as price_with_tax
FROM products_good;
-- Result: 29.99 + 5.00 = 34.99 (correct!)
Here's when to use each numeric type:
Pro tip: Always use DECIMAL for monetary values. FLOAT and DOUBLE can introduce rounding errors that compound over thousands of transactions.
Text storage decisions significantly impact both storage costs and query performance. Let's design a user profile table to illustrate these choices:
CREATE TABLE user_profiles (
user_id INT PRIMARY KEY,
username VARCHAR(50), -- Fixed maximum, indexed efficiently
email VARCHAR(255), -- Standard email length
first_name VARCHAR(100), -- Reasonable name length
last_name VARCHAR(100),
bio TEXT, -- Variable length, potentially long
country_code CHAR(2), -- Always exactly 2 characters (US, CA, etc.)
phone_number VARCHAR(20), -- International formats vary
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Key principles for text type selection:
The performance difference is significant. Here's a comparison using a table with 1 million users:
-- Efficient: searching on a properly-sized VARCHAR with an index
SELECT * FROM user_profiles
WHERE username = 'john_doe';
-- Execution time: ~1ms with index
-- Less efficient: searching on TEXT without knowing length patterns
CREATE TABLE user_profiles_bad (
username TEXT, -- Overkill for usernames
bio TEXT
);
-- Similar query execution time: ~50ms+ without proper indexing
Temporal data is notoriously tricky to handle correctly. Let's look at an event tracking table that demonstrates proper date/time type usage:
CREATE TABLE user_events (
event_id BIGINT PRIMARY KEY AUTO_INCREMENT,
user_id INT,
event_type VARCHAR(50),
event_timestamp TIMESTAMP, -- Includes timezone info
event_date DATE, -- Just the date for daily rollups
processing_duration INTERVAL, -- MySQL: TIME, PostgreSQL: INTERVAL
scheduled_time DATETIME, -- No timezone assumptions
INDEX idx_user_events_timestamp (user_id, event_timestamp),
INDEX idx_event_date (event_date)
);
Each temporal type serves a specific purpose:
Here's a practical example showing why type choice matters:
-- Inserting event data with different temporal types
INSERT INTO user_events (user_id, event_type, event_timestamp, event_date, scheduled_time)
VALUES
(1001, 'login', '2024-01-15 14:30:00', '2024-01-15', '2024-01-15 14:30:00'),
(1002, 'purchase', '2024-01-15 09:15:30', '2024-01-15', '2024-01-15 09:15:30');
-- Efficient date range query using the DATE column
SELECT COUNT(*)
FROM user_events
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31';
-- Less efficient: extracting date from timestamp every time
SELECT COUNT(*)
FROM user_events
WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31';
Good schema design is about more than just choosing data types. It's about creating a structure that supports your application's current needs while remaining flexible enough to evolve. Let's work through designing a complete schema for an e-commerce platform.
Normalization is the process of organizing data to reduce redundancy and improve data integrity. However, it's not always about achieving perfect normal forms—it's about finding the right balance for your use case.
Let's start with a denormalized approach and then improve it:
-- Denormalized approach (problematic)
CREATE TABLE orders_denormalized (
order_id INT PRIMARY KEY,
customer_name VARCHAR(100),
customer_email VARCHAR(255),
customer_address TEXT,
customer_phone VARCHAR(20),
product_name VARCHAR(200),
product_category VARCHAR(100),
product_price DECIMAL(10,2),
order_date TIMESTAMP,
quantity INT,
total_amount DECIMAL(10,2)
);
Problems with this approach:
Here's a properly normalized version:
-- Normalized schema design
CREATE TABLE customers (
customer_id INT PRIMARY KEY AUTO_INCREMENT,
customer_name VARCHAR(100) NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
phone VARCHAR(20),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
INDEX idx_customer_email (email)
);
CREATE TABLE addresses (
address_id INT PRIMARY KEY AUTO_INCREMENT,
customer_id INT NOT NULL,
address_type ENUM('billing', 'shipping') DEFAULT 'shipping',
street_address VARCHAR(255) NOT NULL,
city VARCHAR(100) NOT NULL,
state VARCHAR(50),
postal_code VARCHAR(20),
country VARCHAR(100) NOT NULL DEFAULT 'United States',
is_default BOOLEAN DEFAULT FALSE,
FOREIGN KEY (customer_id) REFERENCES customers(customer_id) ON DELETE CASCADE,
INDEX idx_customer_addresses (customer_id)
);
CREATE TABLE product_categories (
category_id INT PRIMARY KEY AUTO_INCREMENT,
category_name VARCHAR(100) UNIQUE NOT NULL,
parent_category_id INT,
FOREIGN KEY (parent_category_id) REFERENCES product_categories(category_id)
);
CREATE TABLE products (
product_id INT PRIMARY KEY AUTO_INCREMENT,
product_name VARCHAR(200) NOT NULL,
category_id INT NOT NULL,
base_price DECIMAL(10,2) NOT NULL,
description TEXT,
sku VARCHAR(50) UNIQUE,
is_active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (category_id) REFERENCES product_categories(category_id),
INDEX idx_product_category (category_id),
INDEX idx_product_sku (sku)
);
CREATE TABLE orders (
order_id INT PRIMARY KEY AUTO_INCREMENT,
customer_id INT NOT NULL,
order_status ENUM('pending', 'processing', 'shipped', 'delivered', 'cancelled') DEFAULT 'pending',
order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
shipping_address_id INT,
billing_address_id INT,
subtotal DECIMAL(10,2) NOT NULL,
tax_amount DECIMAL(10,2) DEFAULT 0,
shipping_cost DECIMAL(10,2) DEFAULT 0,
total_amount DECIMAL(10,2) NOT NULL,
FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
FOREIGN KEY (shipping_address_id) REFERENCES addresses(address_id),
FOREIGN KEY (billing_address_id) REFERENCES addresses(address_id),
INDEX idx_customer_orders (customer_id, order_date),
INDEX idx_order_status (order_status),
INDEX idx_order_date (order_date)
);
CREATE TABLE order_items (
order_item_id INT PRIMARY KEY AUTO_INCREMENT,
order_id INT NOT NULL,
product_id INT NOT NULL,
quantity INT NOT NULL CHECK (quantity > 0),
unit_price DECIMAL(10,2) NOT NULL, -- Price at time of order
line_total DECIMAL(10,2) NOT NULL,
FOREIGN KEY (order_id) REFERENCES orders(order_id) ON DELETE CASCADE,
FOREIGN KEY (product_id) REFERENCES products(product_id),
INDEX idx_order_items (order_id),
INDEX idx_product_orders (product_id)
);
Indexes are crucial for query performance, but they come with trade-offs. Each index speeds up reads but slows down writes and consumes storage space. Let's look at indexing strategy for our e-commerce schema:
-- Composite index for common query patterns
CREATE INDEX idx_customer_orders_status ON orders (customer_id, order_status, order_date);
-- This index supports multiple query patterns efficiently:
-- 1. Orders for a specific customer
SELECT * FROM orders WHERE customer_id = 1001;
-- 2. Orders for a customer with specific status
SELECT * FROM orders WHERE customer_id = 1001 AND order_status = 'shipped';
-- 3. Orders for a customer with status in date range
SELECT * FROM orders
WHERE customer_id = 1001
AND order_status = 'delivered'
AND order_date >= '2024-01-01';
Key indexing principles:
Constraints prevent bad data from entering your database. Here are the essential constraint types and when to use them:
-- Example showing various constraint types
CREATE TABLE inventory (
inventory_id INT PRIMARY KEY AUTO_INCREMENT,
product_id INT NOT NULL,
warehouse_location VARCHAR(10) NOT NULL,
quantity_available INT NOT NULL DEFAULT 0,
quantity_reserved INT NOT NULL DEFAULT 0,
reorder_point INT NOT NULL DEFAULT 10,
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
-- Foreign key constraint
FOREIGN KEY (product_id) REFERENCES products(product_id) ON DELETE RESTRICT,
-- Check constraints for business rules
CONSTRAINT chk_positive_quantities CHECK (quantity_available >= 0 AND quantity_reserved >= 0),
CONSTRAINT chk_reserved_not_exceeds_available CHECK (quantity_reserved <= quantity_available),
CONSTRAINT chk_reasonable_reorder CHECK (reorder_point >= 0 AND reorder_point <= 10000),
-- Unique constraint to prevent duplicate entries
UNIQUE KEY uk_product_warehouse (product_id, warehouse_location)
);
As your application grows, you'll encounter scenarios that require more sophisticated schema design approaches. Let's explore some common patterns and when to use them.
Sometimes you need to store attributes that vary significantly between records. The Entity-Attribute-Value (EAV) pattern can be useful, but it should be used sparingly:
-- Traditional approach: lots of nullable columns
CREATE TABLE products_traditional (
product_id INT PRIMARY KEY,
name VARCHAR(200),
color VARCHAR(50), -- Only relevant for some products
size VARCHAR(20), -- Only relevant for clothing
weight DECIMAL(8,2), -- Only relevant for shipped items
isbn VARCHAR(13), -- Only relevant for books
author VARCHAR(100) -- Only relevant for books
-- ... many more potentially null columns
);
-- EAV approach for flexible attributes
CREATE TABLE product_attributes (
product_id INT NOT NULL,
attribute_name VARCHAR(50) NOT NULL,
attribute_value TEXT,
attribute_type ENUM('string', 'number', 'date', 'boolean') DEFAULT 'string',
PRIMARY KEY (product_id, attribute_name),
FOREIGN KEY (product_id) REFERENCES products(product_id) ON DELETE CASCADE,
INDEX idx_attribute_name (attribute_name)
);
-- Inserting flexible attributes
INSERT INTO product_attributes (product_id, attribute_name, attribute_value, attribute_type) VALUES
(1001, 'color', 'red', 'string'),
(1001, 'size', 'large', 'string'),
(1002, 'isbn', '978-0123456789', 'string'),
(1002, 'author', 'John Smith', 'string'),
(1003, 'weight_kg', '2.5', 'number');
The EAV pattern is useful when:
However, avoid EAV when:
Tracking changes over time is crucial for many applications. Here's how to implement effective audit trails:
-- Main products table with current data
CREATE TABLE products (
product_id INT PRIMARY KEY AUTO_INCREMENT,
product_name VARCHAR(200) NOT NULL,
category_id INT NOT NULL,
base_price DECIMAL(10,2) NOT NULL,
is_active BOOLEAN DEFAULT TRUE,
version INT DEFAULT 1,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
FOREIGN KEY (category_id) REFERENCES product_categories(category_id)
);
-- Audit table capturing all changes
CREATE TABLE products_audit (
audit_id BIGINT PRIMARY KEY AUTO_INCREMENT,
product_id INT NOT NULL,
product_name VARCHAR(200),
category_id INT,
base_price DECIMAL(10,2),
is_active BOOLEAN,
version INT,
change_type ENUM('INSERT', 'UPDATE', 'DELETE') NOT NULL,
changed_by VARCHAR(100),
changed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_product_audit (product_id, changed_at),
INDEX idx_audit_date (changed_at)
);
-- Trigger to automatically populate audit table
DELIMITER //
CREATE TRIGGER products_audit_trigger
AFTER UPDATE ON products
FOR EACH ROW
BEGIN
INSERT INTO products_audit (
product_id, product_name, category_id, base_price, is_active,
version, change_type, changed_by
) VALUES (
NEW.product_id, NEW.product_name, NEW.category_id, NEW.base_price,
NEW.is_active, NEW.version, 'UPDATE', USER()
);
END//
DELIMITER ;
This approach gives you:
Let's put everything together by designing a schema for a project management application. This will include users, projects, tasks, and time tracking.
Start by identifying the core entities and their relationships:
-- User management
CREATE TABLE users (
user_id INT PRIMARY KEY AUTO_INCREMENT,
username VARCHAR(50) UNIQUE NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
password_hash VARCHAR(255) NOT NULL,
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
role ENUM('admin', 'manager', 'developer', 'client') NOT NULL,
is_active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_login_at TIMESTAMP NULL,
INDEX idx_user_email (email),
INDEX idx_user_role (role)
);
-- Project organization
CREATE TABLE projects (
project_id INT PRIMARY KEY AUTO_INCREMENT,
project_name VARCHAR(200) NOT NULL,
description TEXT,
project_manager_id INT NOT NULL,
client_id INT,
status ENUM('planning', 'active', 'on_hold', 'completed', 'cancelled') DEFAULT 'planning',
start_date DATE,
target_end_date DATE,
actual_end_date DATE NULL,
budget DECIMAL(12,2),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (project_manager_id) REFERENCES users(user_id),
FOREIGN KEY (client_id) REFERENCES users(user_id),
INDEX idx_project_manager (project_manager_id),
INDEX idx_project_status (status),
INDEX idx_project_dates (start_date, target_end_date)
);
-- Task management
CREATE TABLE tasks (
task_id INT PRIMARY KEY AUTO_INCREMENT,
project_id INT NOT NULL,
parent_task_id INT NULL, -- For subtasks
assigned_to_user_id INT,
task_title VARCHAR(200) NOT NULL,
description TEXT,
priority ENUM('low', 'medium', 'high', 'critical') DEFAULT 'medium',
status ENUM('todo', 'in_progress', 'review', 'done', 'blocked') DEFAULT 'todo',
estimated_hours DECIMAL(6,2),
actual_hours DECIMAL(6,2) DEFAULT 0,
due_date DATE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
FOREIGN KEY (project_id) REFERENCES projects(project_id) ON DELETE CASCADE,
FOREIGN KEY (parent_task_id) REFERENCES tasks(task_id) ON DELETE SET NULL,
FOREIGN KEY (assigned_to_user_id) REFERENCES users(user_id) ON DELETE SET NULL,
INDEX idx_task_project (project_id),
INDEX idx_task_assignee (assigned_to_user_id),
INDEX idx_task_status (status),
INDEX idx_task_due_date (due_date)
);
-- Time tracking
CREATE TABLE time_entries (
entry_id BIGINT PRIMARY KEY AUTO_INCREMENT,
user_id INT NOT NULL,
task_id INT NOT NULL,
project_id INT NOT NULL, -- Denormalized for faster reporting
start_time TIMESTAMP NOT NULL,
end_time TIMESTAMP,
duration_minutes INT, -- Calculated field for easier queries
description TEXT,
is_billable BOOLEAN DEFAULT TRUE,
hourly_rate DECIMAL(8,2),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES users(user_id),
FOREIGN KEY (task_id) REFERENCES tasks(task_id) ON DELETE CASCADE,
FOREIGN KEY (project_id) REFERENCES projects(project_id) ON DELETE CASCADE,
INDEX idx_time_entries_user_date (user_id, start_time),
INDEX idx_time_entries_project (project_id, start_time),
INDEX idx_time_entries_task (task_id)
);
-- Project team membership
CREATE TABLE project_team_members (
project_id INT NOT NULL,
user_id INT NOT NULL,
role ENUM('manager', 'developer', 'tester', 'designer') NOT NULL,
joined_date DATE DEFAULT (CURRENT_DATE),
left_date DATE NULL,
PRIMARY KEY (project_id, user_id),
FOREIGN KEY (project_id) REFERENCES projects(project_id) ON DELETE CASCADE,
FOREIGN KEY (user_id) REFERENCES users(user_id) ON DELETE CASCADE
);
Now let's add some sample data and test our schema:
-- Insert sample users
INSERT INTO users (username, email, password_hash, first_name, last_name, role) VALUES
('jmanager', 'john.manager@company.com', 'hashed_password_1', 'John', 'Manager', 'manager'),
('adev', 'alice.developer@company.com', 'hashed_password_2', 'Alice', 'Developer', 'developer'),
('bclient', 'bob.client@client.com', 'hashed_password_3', 'Bob', 'Client', 'client');
-- Insert a sample project
INSERT INTO projects (project_name, description, project_manager_id, client_id, status, start_date, target_end_date, budget) VALUES
('E-commerce Platform', 'Build a new e-commerce platform with advanced features', 1, 3, 'active', '2024-01-01', '2024-06-30', 150000.00);
-- Add team members to the project
INSERT INTO project_team_members (project_id, user_id, role) VALUES
(1, 1, 'manager'),
(1, 2, 'developer');
-- Create some tasks
INSERT INTO tasks (project_id, assigned_to_user_id, task_title, description, priority, estimated_hours, due_date) VALUES
(1, 2, 'Database Schema Design', 'Design and implement the database schema for the e-commerce platform', 'high', 16.0, '2024-02-15'),
(1, 2, 'User Authentication System', 'Implement secure user login and registration', 'high', 24.0, '2024-03-01');
-- Test with some queries
-- Get all active projects with their managers
SELECT p.project_name, p.status, CONCAT(u.first_name, ' ', u.last_name) as manager_name
FROM projects p
JOIN users u ON p.project_manager_id = u.user_id
WHERE p.status = 'active';
-- Get project progress summary
SELECT
p.project_name,
COUNT(t.task_id) as total_tasks,
COUNT(CASE WHEN t.status = 'done' THEN 1 END) as completed_tasks,
ROUND(COUNT(CASE WHEN t.status = 'done' THEN 1 END) * 100.0 / COUNT(t.task_id), 2) as completion_percentage
FROM projects p
LEFT JOIN tasks t ON p.project_id = t.project_id
GROUP BY p.project_id, p.project_name;
Even experienced developers make schema design mistakes. Here are the most common ones and how to avoid them:
-- Wrong: Everything is VARCHAR
CREATE TABLE orders_wrong (
order_id VARCHAR(50), -- Should be INT
order_date VARCHAR(50), -- Should be TIMESTAMP
total_amount VARCHAR(50), -- Should be DECIMAL
is_shipped VARCHAR(10) -- Should be BOOLEAN
);
-- Right: Appropriate data types
CREATE TABLE orders_right (
order_id INT PRIMARY KEY AUTO_INCREMENT,
order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
total_amount DECIMAL(10,2) NOT NULL,
is_shipped BOOLEAN DEFAULT FALSE
);
Why it matters: Using inappropriate data types leads to:
-- Dangerous: No foreign key constraints
CREATE TABLE order_items_dangerous (
order_id INT, -- No constraint - orphaned records possible
product_id INT -- No constraint - references to deleted products
);
-- Safe: Proper foreign key constraints
CREATE TABLE order_items_safe (
order_id INT NOT NULL,
product_id INT NOT NULL,
FOREIGN KEY (order_id) REFERENCES orders(order_id) ON DELETE CASCADE,
FOREIGN KEY (product_id) REFERENCES products(product_id) ON DELETE RESTRICT
);
Debugging tip: If you're getting foreign key constraint errors, check:
-- Problem: No indexes on frequently queried columns
SELECT * FROM orders WHERE customer_id = 1001 AND order_date > '2024-01-01';
-- This query will be slow without proper indexes
-- Solution: Add appropriate indexes
CREATE INDEX idx_customer_orders ON orders (customer_id, order_date);
-- Also common mistake: Too many single-column indexes instead of composite ones
-- Less efficient:
CREATE INDEX idx_customer_id ON orders (customer_id);
CREATE INDEX idx_order_date ON orders (order_date);
-- More efficient for the above query:
CREATE INDEX idx_customer_orders ON orders (customer_id, order_date);
Your schema will change over time. Design it to evolve gracefully:
-- Brittle: Hard to change later
CREATE TABLE users_brittle (
id INT PRIMARY KEY,
name VARCHAR(100),
address VARCHAR(255) -- What if users have multiple addresses?
);
-- Flexible: Designed for evolution
CREATE TABLE users_flexible (
user_id INT PRIMARY KEY AUTO_INCREMENT,
first_name VARCHAR(100),
last_name VARCHAR(100),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
version INT DEFAULT 1 -- For optimistic locking
);
-- Separate address table for flexibility
CREATE TABLE user_addresses (
address_id INT PRIMARY KEY AUTO_INCREMENT,
user_id INT NOT NULL,
address_type ENUM('home', 'work', 'billing', 'shipping'),
street VARCHAR(255),
city VARCHAR(100),
state VARCHAR(50),
postal_code VARCHAR(20),
country VARCHAR(100),
is_primary BOOLEAN DEFAULT FALSE,
FOREIGN KEY (user_id) REFERENCES users_flexible(user_id) ON DELETE CASCADE
);
As your database grows, performance becomes increasingly important. Here are key strategies for optimizing your schema for performance:
Before optimizing, understand how your database executes queries:
-- Use EXPLAIN to see query execution plans
EXPLAIN SELECT
p.project_name,
COUNT(t.task_id) as task_count
FROM projects p
LEFT JOIN tasks t ON p.project_id = t.project_id
WHERE p.status = 'active'
GROUP BY p.project_id, p.project_name;
-- Look for:
-- - Full table scans (bad)
-- - Index usage (good)
-- - Join types (nested loop vs hash join)
-- - Row estimates vs actual rows
Sometimes breaking normalization rules improves performance:
-- Normalized but potentially slow
SELECT
o.order_id,
o.order_date,
SUM(oi.quantity * oi.unit_price) as order_total
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id, o.order_date;
-- Denormalized for speed: store calculated total
CREATE TABLE orders_optimized (
order_id INT PRIMARY KEY AUTO_INCREMENT,
customer_id INT NOT NULL,
order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
total_amount DECIMAL(10,2) NOT NULL, -- Denormalized calculated field
item_count INT NOT NULL, -- Denormalized count
FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
INDEX idx_customer_orders (customer_id, order_date)
);
-- Update triggers maintain consistency
DELIMITER //
CREATE TRIGGER update_order_totals
AFTER INSERT ON order_items
FOR EACH ROW
BEGIN
UPDATE orders_optimized
SET
total_amount = (
SELECT SUM(quantity * unit_price)
FROM order_items
WHERE order_id = NEW.order_id
),
item_count = (
SELECT COUNT(*)
FROM order_items
WHERE order_id = NEW.order_id
)
WHERE order_id = NEW.order_id;
END//
DELIMITER ;
For very large tables, consider partitioning:
-- Partition time_entries by month for better performance
CREATE TABLE time_entries_partitioned (
entry_id BIGINT PRIMARY KEY AUTO_INCREMENT,
user_id INT NOT NULL,
task_id INT NOT NULL,
start_time TIMESTAMP NOT NULL,
end_time TIMESTAMP,
duration_minutes INT,
INDEX idx_user_time (user_id, start_time)
) PARTITION BY RANGE (YEAR(start_time) * 100 + MONTH(start_time)) (
PARTITION p202401 VALUES LESS THAN (202402),
PARTITION p202402 VALUES LESS THAN (202403),
PARTITION p202403 VALUES LESS THAN (202404),
-- Add more partitions as needed
PARTITION p_future VALUES LESS THAN MAXVALUE
);
You've learned the fundamentals of SQL data types and schema design that will serve as the foundation for all your database projects. The key takeaways are:
The schema design principles you've learned here scale from small applications to enterprise systems. The main difference is complexity and volume, not the fundamental approaches.
Your next steps should be:
Remember that schema design is both an art and a science. The technical principles provide the framework, but good judgment comes from experience. Start applying these concepts to your own projects, and don't be afraid to iterate and improve your designs as you learn more about your application's usage patterns.
Learning Path: SQL Fundamentals