SQL Data Types and Schema Design: Building Databases That Scale

You're staring at a CSV file with customer data from your company's e-commerce platform. There are 100,000 rows of information including customer names, email addresses, registration dates, purchase amounts, and product categories. Your task is to import this data into a SQL database, but you quickly realize that simply dumping everything into VARCHAR columns isn't going to cut it. The dates are storing inconsistently, the monetary values are losing precision, and your queries are crawling along at a snail's pace.

This scenario highlights a fundamental truth about working with databases: choosing the right data types and designing your schema properly from the start will save you countless hours of headaches down the road. Poor data type choices lead to data integrity issues, performance problems, and maintenance nightmares that only get worse as your database grows.

By the end of this lesson, you'll understand how to make informed decisions about data types and schema design that will set your database projects up for long-term success. You'll learn to think like a database architect, considering not just what data you need to store today, but how your schema will evolve and perform as your application scales.

What you'll learn:

How to choose appropriate SQL data types based on your data characteristics and use cases
The performance and storage implications of different data type choices
Essential schema design principles including normalization, indexing strategy, and constraint planning
How to design schemas that can evolve gracefully as business requirements change
Common schema design mistakes and how to avoid them

Prerequisites

You should be comfortable with basic SQL concepts including creating tables, inserting data, and writing simple SELECT statements. Familiarity with primary and foreign keys is helpful but we'll review these concepts as they relate to schema design.

Understanding SQL Data Types: Beyond VARCHAR

Let's start by examining the most commonly used SQL data types and when to choose each one. Understanding these fundamentals is crucial because data type decisions ripple through every aspect of your database's performance and functionality.

Numeric Data Types: Precision Matters

When storing numbers, your choice of data type affects both storage efficiency and computational accuracy. Let's examine a real scenario using product pricing data:

-- Poor approach: storing prices as VARCHAR
CREATE TABLE products_bad (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100),
    price VARCHAR(20),  -- This will cause problems!
    discount_percentage VARCHAR(10)
);

-- Better approach: appropriate numeric types
CREATE TABLE products_good (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100),
    price DECIMAL(10,2),  -- Up to 99,999,999.99
    discount_percentage DECIMAL(5,2)  -- Up to 999.99%
);

Why does this matter? Consider what happens when you try to calculate totals:

-- With VARCHAR prices, you get string concatenation instead of addition
SELECT product_name, price + 5.00 as price_with_tax 
FROM products_bad;
-- Result: "29.99" + 5.00 = "29.995.00" (wrong!)

-- With DECIMAL prices, you get proper arithmetic
SELECT product_name, price + 5.00 as price_with_tax 
FROM products_good;
-- Result: 29.99 + 5.00 = 34.99 (correct!)

Here's when to use each numeric type:

INT: Use for whole numbers like IDs, counts, or ages. Range: -2,147,483,648 to 2,147,483,647
BIGINT: For large whole numbers like timestamps or high-volume counters
DECIMAL(precision, scale): For exact decimal values like money or percentages where precision is critical
FLOAT/DOUBLE: For scientific calculations where some precision loss is acceptable but you need a wide range

Pro tip: Always use DECIMAL for monetary values. FLOAT and DOUBLE can introduce rounding errors that compound over thousands of transactions.

Text Data Types: Size and Performance Trade-offs

Text storage decisions significantly impact both storage costs and query performance. Let's design a user profile table to illustrate these choices:

CREATE TABLE user_profiles (
    user_id INT PRIMARY KEY,
    username VARCHAR(50),        -- Fixed maximum, indexed efficiently
    email VARCHAR(255),          -- Standard email length
    first_name VARCHAR(100),     -- Reasonable name length
    last_name VARCHAR(100),
    bio TEXT,                    -- Variable length, potentially long
    country_code CHAR(2),        -- Always exactly 2 characters (US, CA, etc.)
    phone_number VARCHAR(20),    -- International formats vary
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Key principles for text type selection:

CHAR(n): Use when the length is always the same (country codes, status flags, etc.)
VARCHAR(n): Use when you know the maximum length but actual values vary
TEXT: Use for long-form content where you can't predict the length

The performance difference is significant. Here's a comparison using a table with 1 million users:

-- Efficient: searching on a properly-sized VARCHAR with an index
SELECT * FROM user_profiles 
WHERE username = 'john_doe';
-- Execution time: ~1ms with index

-- Less efficient: searching on TEXT without knowing length patterns
CREATE TABLE user_profiles_bad (
    username TEXT,  -- Overkill for usernames
    bio TEXT
);
-- Similar query execution time: ~50ms+ without proper indexing

Date and Time Types: Handling Temporal Data Correctly

Temporal data is notoriously tricky to handle correctly. Let's look at an event tracking table that demonstrates proper date/time type usage:

CREATE TABLE user_events (
    event_id BIGINT PRIMARY KEY AUTO_INCREMENT,
    user_id INT,
    event_type VARCHAR(50),
    event_timestamp TIMESTAMP,    -- Includes timezone info
    event_date DATE,             -- Just the date for daily rollups
    processing_duration INTERVAL, -- MySQL: TIME, PostgreSQL: INTERVAL
    scheduled_time DATETIME,      -- No timezone assumptions
    INDEX idx_user_events_timestamp (user_id, event_timestamp),
    INDEX idx_event_date (event_date)
);

Each temporal type serves a specific purpose:

DATE: Use for birth dates, deadlines, or when time doesn't matter
TIME: Use for daily schedules, duration calculations
DATETIME: Use when you need precise time but timezone isn't relevant
TIMESTAMP: Use for audit trails, logging, or when timezone conversion matters

Here's a practical example showing why type choice matters:

-- Inserting event data with different temporal types
INSERT INTO user_events (user_id, event_type, event_timestamp, event_date, scheduled_time)
VALUES 
(1001, 'login', '2024-01-15 14:30:00', '2024-01-15', '2024-01-15 14:30:00'),
(1002, 'purchase', '2024-01-15 09:15:30', '2024-01-15', '2024-01-15 09:15:30');

-- Efficient date range query using the DATE column
SELECT COUNT(*) 
FROM user_events 
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31';

-- Less efficient: extracting date from timestamp every time
SELECT COUNT(*) 
FROM user_events 
WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31';

Schema Design Fundamentals: Building for the Future

Good schema design is about more than just choosing data types. It's about creating a structure that supports your application's current needs while remaining flexible enough to evolve. Let's work through designing a complete schema for an e-commerce platform.

Normalization: Reducing Redundancy Without Sacrificing Performance

Normalization is the process of organizing data to reduce redundancy and improve data integrity. However, it's not always about achieving perfect normal forms—it's about finding the right balance for your use case.

Let's start with a denormalized approach and then improve it:

-- Denormalized approach (problematic)
CREATE TABLE orders_denormalized (
    order_id INT PRIMARY KEY,
    customer_name VARCHAR(100),
    customer_email VARCHAR(255),
    customer_address TEXT,
    customer_phone VARCHAR(20),
    product_name VARCHAR(200),
    product_category VARCHAR(100),
    product_price DECIMAL(10,2),
    order_date TIMESTAMP,
    quantity INT,
    total_amount DECIMAL(10,2)
);

Problems with this approach:

Customer information is repeated for every order
Product information is duplicated across orders
Updating a customer's email requires changing multiple rows
Inconsistencies can creep in (same customer with different email formats)

Here's a properly normalized version:

-- Normalized schema design
CREATE TABLE customers (
    customer_id INT PRIMARY KEY AUTO_INCREMENT,
    customer_name VARCHAR(100) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    phone VARCHAR(20),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    
    INDEX idx_customer_email (email)
);

CREATE TABLE addresses (
    address_id INT PRIMARY KEY AUTO_INCREMENT,
    customer_id INT NOT NULL,
    address_type ENUM('billing', 'shipping') DEFAULT 'shipping',
    street_address VARCHAR(255) NOT NULL,
    city VARCHAR(100) NOT NULL,
    state VARCHAR(50),
    postal_code VARCHAR(20),
    country VARCHAR(100) NOT NULL DEFAULT 'United States',
    is_default BOOLEAN DEFAULT FALSE,
    
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id) ON DELETE CASCADE,
    INDEX idx_customer_addresses (customer_id)
);

CREATE TABLE product_categories (
    category_id INT PRIMARY KEY AUTO_INCREMENT,
    category_name VARCHAR(100) UNIQUE NOT NULL,
    parent_category_id INT,
    
    FOREIGN KEY (parent_category_id) REFERENCES product_categories(category_id)
);

CREATE TABLE products (
    product_id INT PRIMARY KEY AUTO_INCREMENT,
    product_name VARCHAR(200) NOT NULL,
    category_id INT NOT NULL,
    base_price DECIMAL(10,2) NOT NULL,
    description TEXT,
    sku VARCHAR(50) UNIQUE,
    is_active BOOLEAN DEFAULT TRUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    FOREIGN KEY (category_id) REFERENCES product_categories(category_id),
    INDEX idx_product_category (category_id),
    INDEX idx_product_sku (sku)
);

CREATE TABLE orders (
    order_id INT PRIMARY KEY AUTO_INCREMENT,
    customer_id INT NOT NULL,
    order_status ENUM('pending', 'processing', 'shipped', 'delivered', 'cancelled') DEFAULT 'pending',
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    shipping_address_id INT,
    billing_address_id INT,
    subtotal DECIMAL(10,2) NOT NULL,
    tax_amount DECIMAL(10,2) DEFAULT 0,
    shipping_cost DECIMAL(10,2) DEFAULT 0,
    total_amount DECIMAL(10,2) NOT NULL,
    
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
    FOREIGN KEY (shipping_address_id) REFERENCES addresses(address_id),
    FOREIGN KEY (billing_address_id) REFERENCES addresses(address_id),
    
    INDEX idx_customer_orders (customer_id, order_date),
    INDEX idx_order_status (order_status),
    INDEX idx_order_date (order_date)
);

CREATE TABLE order_items (
    order_item_id INT PRIMARY KEY AUTO_INCREMENT,
    order_id INT NOT NULL,
    product_id INT NOT NULL,
    quantity INT NOT NULL CHECK (quantity > 0),
    unit_price DECIMAL(10,2) NOT NULL,  -- Price at time of order
    line_total DECIMAL(10,2) NOT NULL,
    
    FOREIGN KEY (order_id) REFERENCES orders(order_id) ON DELETE CASCADE,
    FOREIGN KEY (product_id) REFERENCES products(product_id),
    
    INDEX idx_order_items (order_id),
    INDEX idx_product_orders (product_id)
);

Strategic Indexing: Making Queries Fast

Indexes are crucial for query performance, but they come with trade-offs. Each index speeds up reads but slows down writes and consumes storage space. Let's look at indexing strategy for our e-commerce schema:

-- Composite index for common query patterns
CREATE INDEX idx_customer_orders_status ON orders (customer_id, order_status, order_date);

-- This index supports multiple query patterns efficiently:
-- 1. Orders for a specific customer
SELECT * FROM orders WHERE customer_id = 1001;

-- 2. Orders for a customer with specific status
SELECT * FROM orders WHERE customer_id = 1001 AND order_status = 'shipped';

-- 3. Orders for a customer with status in date range
SELECT * FROM orders 
WHERE customer_id = 1001 
AND order_status = 'delivered' 
AND order_date >= '2024-01-01';

Key indexing principles:

Index your foreign keys - This enables efficient joins
Create composite indexes for common query patterns - Put the most selective column first
Don't over-index - Each index has maintenance overhead
Monitor and adjust - Use query analysis tools to identify missing indexes

Constraints: Enforcing Data Integrity

Constraints prevent bad data from entering your database. Here are the essential constraint types and when to use them:

-- Example showing various constraint types
CREATE TABLE inventory (
    inventory_id INT PRIMARY KEY AUTO_INCREMENT,
    product_id INT NOT NULL,
    warehouse_location VARCHAR(10) NOT NULL,
    quantity_available INT NOT NULL DEFAULT 0,
    quantity_reserved INT NOT NULL DEFAULT 0,
    reorder_point INT NOT NULL DEFAULT 10,
    last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    
    -- Foreign key constraint
    FOREIGN KEY (product_id) REFERENCES products(product_id) ON DELETE RESTRICT,
    
    -- Check constraints for business rules
    CONSTRAINT chk_positive_quantities CHECK (quantity_available >= 0 AND quantity_reserved >= 0),
    CONSTRAINT chk_reserved_not_exceeds_available CHECK (quantity_reserved <= quantity_available),
    CONSTRAINT chk_reasonable_reorder CHECK (reorder_point >= 0 AND reorder_point <= 10000),
    
    -- Unique constraint to prevent duplicate entries
    UNIQUE KEY uk_product_warehouse (product_id, warehouse_location)
);

Advanced Schema Design Patterns

As your application grows, you'll encounter scenarios that require more sophisticated schema design approaches. Let's explore some common patterns and when to use them.

Handling Flexible Attributes: The EAV Pattern

Sometimes you need to store attributes that vary significantly between records. The Entity-Attribute-Value (EAV) pattern can be useful, but it should be used sparingly:

-- Traditional approach: lots of nullable columns
CREATE TABLE products_traditional (
    product_id INT PRIMARY KEY,
    name VARCHAR(200),
    color VARCHAR(50),      -- Only relevant for some products
    size VARCHAR(20),       -- Only relevant for clothing
    weight DECIMAL(8,2),    -- Only relevant for shipped items
    isbn VARCHAR(13),       -- Only relevant for books
    author VARCHAR(100)     -- Only relevant for books
    -- ... many more potentially null columns
);

-- EAV approach for flexible attributes
CREATE TABLE product_attributes (
    product_id INT NOT NULL,
    attribute_name VARCHAR(50) NOT NULL,
    attribute_value TEXT,
    attribute_type ENUM('string', 'number', 'date', 'boolean') DEFAULT 'string',
    
    PRIMARY KEY (product_id, attribute_name),
    FOREIGN KEY (product_id) REFERENCES products(product_id) ON DELETE CASCADE,
    INDEX idx_attribute_name (attribute_name)
);

-- Inserting flexible attributes
INSERT INTO product_attributes (product_id, attribute_name, attribute_value, attribute_type) VALUES
(1001, 'color', 'red', 'string'),
(1001, 'size', 'large', 'string'),
(1002, 'isbn', '978-0123456789', 'string'),
(1002, 'author', 'John Smith', 'string'),
(1003, 'weight_kg', '2.5', 'number');

The EAV pattern is useful when:

You have highly variable attributes across different product types
You need to add new attributes without schema changes
The number of null values in a traditional approach would be excessive

However, avoid EAV when:

Most records share the same attributes
You need strong typing and validation
Query performance is critical (EAV queries are more complex)

Audit Trails and Historical Data

Tracking changes over time is crucial for many applications. Here's how to implement effective audit trails:

-- Main products table with current data
CREATE TABLE products (
    product_id INT PRIMARY KEY AUTO_INCREMENT,
    product_name VARCHAR(200) NOT NULL,
    category_id INT NOT NULL,
    base_price DECIMAL(10,2) NOT NULL,
    is_active BOOLEAN DEFAULT TRUE,
    version INT DEFAULT 1,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    
    FOREIGN KEY (category_id) REFERENCES product_categories(category_id)
);

-- Audit table capturing all changes
CREATE TABLE products_audit (
    audit_id BIGINT PRIMARY KEY AUTO_INCREMENT,
    product_id INT NOT NULL,
    product_name VARCHAR(200),
    category_id INT,
    base_price DECIMAL(10,2),
    is_active BOOLEAN,
    version INT,
    change_type ENUM('INSERT', 'UPDATE', 'DELETE') NOT NULL,
    changed_by VARCHAR(100),
    changed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    INDEX idx_product_audit (product_id, changed_at),
    INDEX idx_audit_date (changed_at)
);

-- Trigger to automatically populate audit table
DELIMITER //
CREATE TRIGGER products_audit_trigger
    AFTER UPDATE ON products
    FOR EACH ROW
BEGIN
    INSERT INTO products_audit (
        product_id, product_name, category_id, base_price, is_active, 
        version, change_type, changed_by
    ) VALUES (
        NEW.product_id, NEW.product_name, NEW.category_id, NEW.base_price, 
        NEW.is_active, NEW.version, 'UPDATE', USER()
    );
END//
DELIMITER ;

This approach gives you:

Complete change history for compliance and debugging
Ability to see data at any point in time
Tracking of who made changes and when

Hands-On Exercise: Building a Complete Schema

Let's put everything together by designing a schema for a project management application. This will include users, projects, tasks, and time tracking.

Start by identifying the core entities and their relationships:

-- User management
CREATE TABLE users (
    user_id INT PRIMARY KEY AUTO_INCREMENT,
    username VARCHAR(50) UNIQUE NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    role ENUM('admin', 'manager', 'developer', 'client') NOT NULL,
    is_active BOOLEAN DEFAULT TRUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_login_at TIMESTAMP NULL,
    
    INDEX idx_user_email (email),
    INDEX idx_user_role (role)
);

-- Project organization
CREATE TABLE projects (
    project_id INT PRIMARY KEY AUTO_INCREMENT,
    project_name VARCHAR(200) NOT NULL,
    description TEXT,
    project_manager_id INT NOT NULL,
    client_id INT,
    status ENUM('planning', 'active', 'on_hold', 'completed', 'cancelled') DEFAULT 'planning',
    start_date DATE,
    target_end_date DATE,
    actual_end_date DATE NULL,
    budget DECIMAL(12,2),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    FOREIGN KEY (project_manager_id) REFERENCES users(user_id),
    FOREIGN KEY (client_id) REFERENCES users(user_id),
    INDEX idx_project_manager (project_manager_id),
    INDEX idx_project_status (status),
    INDEX idx_project_dates (start_date, target_end_date)
);

-- Task management
CREATE TABLE tasks (
    task_id INT PRIMARY KEY AUTO_INCREMENT,
    project_id INT NOT NULL,
    parent_task_id INT NULL,  -- For subtasks
    assigned_to_user_id INT,
    task_title VARCHAR(200) NOT NULL,
    description TEXT,
    priority ENUM('low', 'medium', 'high', 'critical') DEFAULT 'medium',
    status ENUM('todo', 'in_progress', 'review', 'done', 'blocked') DEFAULT 'todo',
    estimated_hours DECIMAL(6,2),
    actual_hours DECIMAL(6,2) DEFAULT 0,
    due_date DATE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    
    FOREIGN KEY (project_id) REFERENCES projects(project_id) ON DELETE CASCADE,
    FOREIGN KEY (parent_task_id) REFERENCES tasks(task_id) ON DELETE SET NULL,
    FOREIGN KEY (assigned_to_user_id) REFERENCES users(user_id) ON DELETE SET NULL,
    
    INDEX idx_task_project (project_id),
    INDEX idx_task_assignee (assigned_to_user_id),
    INDEX idx_task_status (status),
    INDEX idx_task_due_date (due_date)
);

-- Time tracking
CREATE TABLE time_entries (
    entry_id BIGINT PRIMARY KEY AUTO_INCREMENT,
    user_id INT NOT NULL,
    task_id INT NOT NULL,
    project_id INT NOT NULL,  -- Denormalized for faster reporting
    start_time TIMESTAMP NOT NULL,
    end_time TIMESTAMP,
    duration_minutes INT,  -- Calculated field for easier queries
    description TEXT,
    is_billable BOOLEAN DEFAULT TRUE,
    hourly_rate DECIMAL(8,2),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    FOREIGN KEY (user_id) REFERENCES users(user_id),
    FOREIGN KEY (task_id) REFERENCES tasks(task_id) ON DELETE CASCADE,
    FOREIGN KEY (project_id) REFERENCES projects(project_id) ON DELETE CASCADE,
    
    INDEX idx_time_entries_user_date (user_id, start_time),
    INDEX idx_time_entries_project (project_id, start_time),
    INDEX idx_time_entries_task (task_id)
);

-- Project team membership
CREATE TABLE project_team_members (
    project_id INT NOT NULL,
    user_id INT NOT NULL,
    role ENUM('manager', 'developer', 'tester', 'designer') NOT NULL,
    joined_date DATE DEFAULT (CURRENT_DATE),
    left_date DATE NULL,
    
    PRIMARY KEY (project_id, user_id),
    FOREIGN KEY (project_id) REFERENCES projects(project_id) ON DELETE CASCADE,
    FOREIGN KEY (user_id) REFERENCES users(user_id) ON DELETE CASCADE
);

Now let's add some sample data and test our schema:

-- Insert sample users
INSERT INTO users (username, email, password_hash, first_name, last_name, role) VALUES
('jmanager', 'john.manager@company.com', 'hashed_password_1', 'John', 'Manager', 'manager'),
('adev', 'alice.developer@company.com', 'hashed_password_2', 'Alice', 'Developer', 'developer'),
('bclient', 'bob.client@client.com', 'hashed_password_3', 'Bob', 'Client', 'client');

-- Insert a sample project
INSERT INTO projects (project_name, description, project_manager_id, client_id, status, start_date, target_end_date, budget) VALUES
('E-commerce Platform', 'Build a new e-commerce platform with advanced features', 1, 3, 'active', '2024-01-01', '2024-06-30', 150000.00);

-- Add team members to the project
INSERT INTO project_team_members (project_id, user_id, role) VALUES
(1, 1, 'manager'),
(1, 2, 'developer');

-- Create some tasks
INSERT INTO tasks (project_id, assigned_to_user_id, task_title, description, priority, estimated_hours, due_date) VALUES
(1, 2, 'Database Schema Design', 'Design and implement the database schema for the e-commerce platform', 'high', 16.0, '2024-02-15'),
(1, 2, 'User Authentication System', 'Implement secure user login and registration', 'high', 24.0, '2024-03-01');

-- Test with some queries
-- Get all active projects with their managers
SELECT p.project_name, p.status, CONCAT(u.first_name, ' ', u.last_name) as manager_name
FROM projects p
JOIN users u ON p.project_manager_id = u.user_id
WHERE p.status = 'active';

-- Get project progress summary
SELECT 
    p.project_name,
    COUNT(t.task_id) as total_tasks,
    COUNT(CASE WHEN t.status = 'done' THEN 1 END) as completed_tasks,
    ROUND(COUNT(CASE WHEN t.status = 'done' THEN 1 END) * 100.0 / COUNT(t.task_id), 2) as completion_percentage
FROM projects p
LEFT JOIN tasks t ON p.project_id = t.project_id
GROUP BY p.project_id, p.project_name;

Common Mistakes & Troubleshooting

Even experienced developers make schema design mistakes. Here are the most common ones and how to avoid them:

Mistake 1: Using VARCHAR for Everything

-- Wrong: Everything is VARCHAR
CREATE TABLE orders_wrong (
    order_id VARCHAR(50),      -- Should be INT
    order_date VARCHAR(50),    -- Should be TIMESTAMP
    total_amount VARCHAR(50),  -- Should be DECIMAL
    is_shipped VARCHAR(10)     -- Should be BOOLEAN
);

-- Right: Appropriate data types
CREATE TABLE orders_right (
    order_id INT PRIMARY KEY AUTO_INCREMENT,
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    total_amount DECIMAL(10,2) NOT NULL,
    is_shipped BOOLEAN DEFAULT FALSE
);

Why it matters: Using inappropriate data types leads to:

Invalid data being stored (like "maybe" in a boolean field)
Poor query performance due to type conversions
Incorrect sorting and calculations
Wasted storage space

Mistake 2: Missing Foreign Key Constraints

-- Dangerous: No foreign key constraints
CREATE TABLE order_items_dangerous (
    order_id INT,  -- No constraint - orphaned records possible
    product_id INT -- No constraint - references to deleted products
);

-- Safe: Proper foreign key constraints
CREATE TABLE order_items_safe (
    order_id INT NOT NULL,
    product_id INT NOT NULL,
    FOREIGN KEY (order_id) REFERENCES orders(order_id) ON DELETE CASCADE,
    FOREIGN KEY (product_id) REFERENCES products(product_id) ON DELETE RESTRICT
);

Debugging tip: If you're getting foreign key constraint errors, check:

Does the referenced table exist?
Does the referenced column exist and have the same data type?
Are you trying to insert a value that doesn't exist in the parent table?

Mistake 3: Poor Indexing Strategy

-- Problem: No indexes on frequently queried columns
SELECT * FROM orders WHERE customer_id = 1001 AND order_date > '2024-01-01';
-- This query will be slow without proper indexes

-- Solution: Add appropriate indexes
CREATE INDEX idx_customer_orders ON orders (customer_id, order_date);

-- Also common mistake: Too many single-column indexes instead of composite ones
-- Less efficient:
CREATE INDEX idx_customer_id ON orders (customer_id);
CREATE INDEX idx_order_date ON orders (order_date);

-- More efficient for the above query:
CREATE INDEX idx_customer_orders ON orders (customer_id, order_date);

Mistake 4: Not Planning for Schema Evolution

Your schema will change over time. Design it to evolve gracefully:

-- Brittle: Hard to change later
CREATE TABLE users_brittle (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    address VARCHAR(255)  -- What if users have multiple addresses?
);

-- Flexible: Designed for evolution
CREATE TABLE users_flexible (
    user_id INT PRIMARY KEY AUTO_INCREMENT,
    first_name VARCHAR(100),
    last_name VARCHAR(100),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    version INT DEFAULT 1  -- For optimistic locking
);

-- Separate address table for flexibility
CREATE TABLE user_addresses (
    address_id INT PRIMARY KEY AUTO_INCREMENT,
    user_id INT NOT NULL,
    address_type ENUM('home', 'work', 'billing', 'shipping'),
    street VARCHAR(255),
    city VARCHAR(100),
    state VARCHAR(50),
    postal_code VARCHAR(20),
    country VARCHAR(100),
    is_primary BOOLEAN DEFAULT FALSE,
    
    FOREIGN KEY (user_id) REFERENCES users_flexible(user_id) ON DELETE CASCADE
);

Performance Considerations and Optimization

As your database grows, performance becomes increasingly important. Here are key strategies for optimizing your schema for performance:

Understanding Query Execution Plans

Before optimizing, understand how your database executes queries:

-- Use EXPLAIN to see query execution plans
EXPLAIN SELECT 
    p.project_name,
    COUNT(t.task_id) as task_count
FROM projects p
LEFT JOIN tasks t ON p.project_id = t.project_id
WHERE p.status = 'active'
GROUP BY p.project_id, p.project_name;

-- Look for:
-- - Full table scans (bad)
-- - Index usage (good)
-- - Join types (nested loop vs hash join)
-- - Row estimates vs actual rows

Denormalization for Performance

Sometimes breaking normalization rules improves performance:

-- Normalized but potentially slow
SELECT 
    o.order_id,
    o.order_date,
    SUM(oi.quantity * oi.unit_price) as order_total
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id, o.order_date;

-- Denormalized for speed: store calculated total
CREATE TABLE orders_optimized (
    order_id INT PRIMARY KEY AUTO_INCREMENT,
    customer_id INT NOT NULL,
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    total_amount DECIMAL(10,2) NOT NULL,  -- Denormalized calculated field
    item_count INT NOT NULL,              -- Denormalized count
    
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
    INDEX idx_customer_orders (customer_id, order_date)
);

-- Update triggers maintain consistency
DELIMITER //
CREATE TRIGGER update_order_totals
    AFTER INSERT ON order_items
    FOR EACH ROW
BEGIN
    UPDATE orders_optimized 
    SET 
        total_amount = (
            SELECT SUM(quantity * unit_price) 
            FROM order_items 
            WHERE order_id = NEW.order_id
        ),
        item_count = (
            SELECT COUNT(*) 
            FROM order_items 
            WHERE order_id = NEW.order_id
        )
    WHERE order_id = NEW.order_id;
END//
DELIMITER ;

Partitioning for Large Tables

For very large tables, consider partitioning:

-- Partition time_entries by month for better performance
CREATE TABLE time_entries_partitioned (
    entry_id BIGINT PRIMARY KEY AUTO_INCREMENT,
    user_id INT NOT NULL,
    task_id INT NOT NULL,
    start_time TIMESTAMP NOT NULL,
    end_time TIMESTAMP,
    duration_minutes INT,
    
    INDEX idx_user_time (user_id, start_time)
) PARTITION BY RANGE (YEAR(start_time) * 100 + MONTH(start_time)) (
    PARTITION p202401 VALUES LESS THAN (202402),
    PARTITION p202402 VALUES LESS THAN (202403),
    PARTITION p202403 VALUES LESS THAN (202404),
    -- Add more partitions as needed
    PARTITION p_future VALUES LESS THAN MAXVALUE
);

Summary & Next Steps

You've learned the fundamentals of SQL data types and schema design that will serve as the foundation for all your database projects. The key takeaways are:

Choose data types deliberately - Consider storage efficiency, data integrity, and query performance implications
Design for the future - Your schema will evolve, so build flexibility into your design from the start
Balance normalization and performance - Perfect normalization isn't always the right answer for real-world applications
Index strategically - Focus on your actual query patterns rather than indexing everything
Use constraints to enforce business rules - Let the database help maintain data quality

The schema design principles you've learned here scale from small applications to enterprise systems. The main difference is complexity and volume, not the fundamental approaches.

Your next steps should be:

Practice with real data - Take a dataset you're familiar with and design a proper schema for it
Learn query optimization - Understanding how to write efficient queries is the natural next step
Explore advanced topics - Look into database-specific features like stored procedures, views, and advanced indexing options
Study existing schemas - Examine open-source applications to see how experienced teams structure their databases

Remember that schema design is both an art and a science. The technical principles provide the framework, but good judgment comes from experience. Start applying these concepts to your own projects, and don't be afraid to iterate and improve your designs as you learn more about your application's usage patterns.