Mastering String and Date Functions: Essential SQL Tools for Data Transformation

You're working with a customer database that's been accumulating data for years. Names are inconsistent—some all caps, others mixed case, some with extra spaces. Dates are scattered across different formats and time zones. Phone numbers follow various patterns. Your analytics team needs clean, standardized data, and your reporting dashboard is breaking because date comparisons aren't working as expected.

This isn't just about knowing function syntax—it's about understanding how to systematically clean, transform, and manipulate text and temporal data at scale. By the end of this lesson, you'll have the skills to handle the messy realities of production data systems with confidence.

What you'll learn:

Master essential string functions for data cleaning and standardization
Implement robust date arithmetic and formatting for reporting systems
Handle timezone conversions and date parsing challenges
Build reusable patterns for common data transformation tasks
Debug and troubleshoot function behavior across different SQL dialects

Prerequisites

You should be comfortable writing SELECT statements, understand basic data types (VARCHAR, DATE, TIMESTAMP), and have experience with WHERE clauses and basic joins. Familiarity with NULL handling and CASE statements will help, though we'll review key concepts as needed.

The String Function Arsenal: Beyond Basic Concatenation

Let's start with a realistic scenario. You've inherited a customer table where data entry standards were... flexible:

-- Sample customer data showing common real-world inconsistencies
SELECT 
    customer_id,
    first_name,
    last_name,
    email,
    phone,
    company_name
FROM customers
LIMIT 5;

/*
customer_id | first_name | last_name | email                    | phone          | company_name
1          | john       | SMITH     | John.Smith@GMAIL.com     | (555) 123-4567 | ACME corp.
2          | Mary       | johnson   | mary.j@company.co.uk     | 555.123.4567   |   TechStart Inc  
3          | ROBERT     |  Lee      | bob@personal-email.net   | +1-555-123-4567| Global Solutions LLC
4          | sarah      | O'Connor  | sarah.oconnor@work.org   | 5551234567     | null
5          |  Jennifer  | Brown     | j.brown+work@email.com   | (555)123-4567  | Data & Analytics Co
*/

String Cleaning and Standardization

The foundation of string manipulation starts with cleaning functions. Here's how to systematically standardize this data:

-- Comprehensive name standardization
SELECT 
    customer_id,
    -- Standardize names: proper case with trimmed whitespace
    INITCAP(TRIM(first_name)) AS clean_first_name,
    INITCAP(TRIM(last_name)) AS clean_last_name,
    
    -- Email standardization: lowercase and trimmed
    LOWER(TRIM(email)) AS clean_email,
    
    -- Company name cleaning with NULL handling
    CASE 
        WHEN TRIM(company_name) = '' OR UPPER(TRIM(company_name)) = 'NULL' 
        THEN NULL 
        ELSE INITCAP(TRIM(company_name))
    END AS clean_company_name
FROM customers;

Database Dialect Note: INITCAP isn't available in all SQL databases. In MySQL, use a combination of UPPER(LEFT(column, 1)) and LOWER(SUBSTRING(column, 2)). In SQL Server, consider creating a custom function or using multiple CASE statements.

Advanced String Parsing and Extraction

Real-world scenarios often require extracting meaningful information from unstructured text. Let's parse those inconsistent phone numbers:

-- Phone number standardization using string functions
SELECT 
    customer_id,
    phone AS original_phone,
    
    -- Remove all non-numeric characters
    REGEXP_REPLACE(phone, '[^0-9]', '') AS digits_only,
    
    -- Format as (XXX) XXX-XXXX for 10-digit numbers
    CASE 
        WHEN LENGTH(REGEXP_REPLACE(phone, '[^0-9]', '')) = 10 THEN
            CONCAT('(', 
                   SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', ''), 1, 3), 
                   ') ', 
                   SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', ''), 4, 3),
                   '-',
                   SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', ''), 7, 4))
        WHEN LENGTH(REGEXP_REPLACE(phone, '[^0-9]', '')) = 11 AND 
             LEFT(REGEXP_REPLACE(phone, '[^0-9]', ''), 1) = '1' THEN
            CONCAT('(', 
                   SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', ''), 2, 3), 
                   ') ', 
                   SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', ''), 5, 3),
                   '-',
                   SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', ''), 8, 4))
        ELSE 'INVALID_PHONE'
    END AS formatted_phone
FROM customers;

String Pattern Matching for Data Validation

Beyond cleaning, string functions help identify data quality issues. Here's how to build comprehensive validation checks:

-- Comprehensive data validation using string functions
WITH validation_checks AS (
    SELECT 
        customer_id,
        first_name,
        last_name,
        email,
        phone,
        
        -- Email validation using pattern matching
        CASE 
            WHEN email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$' 
            THEN 'VALID'
            ELSE 'INVALID'
        END AS email_status,
        
        -- Name validation (alphabetic characters and common punctuation)
        CASE 
            WHEN first_name ~ '^[A-Za-z\s''-]+$' 
            THEN 'VALID' 
            ELSE 'INVALID' 
        END AS first_name_status,
        
        -- Phone validation (after cleaning)
        CASE 
            WHEN LENGTH(REGEXP_REPLACE(phone, '[^0-9]', '')) BETWEEN 10 AND 11 
            THEN 'VALID'
            ELSE 'INVALID'
        END AS phone_status
    FROM customers
)
SELECT 
    customer_id,
    first_name,
    email,
    phone,
    email_status,
    first_name_status,
    phone_status,
    -- Overall record quality score
    CASE 
        WHEN email_status = 'VALID' AND first_name_status = 'VALID' AND phone_status = 'VALID'
        THEN 'HIGH_QUALITY'
        WHEN email_status = 'VALID' AND (first_name_status = 'VALID' OR phone_status = 'VALID')
        THEN 'MEDIUM_QUALITY'
        ELSE 'LOW_QUALITY'
    END AS data_quality_tier
FROM validation_checks
ORDER BY data_quality_tier, customer_id;

Date Functions: Mastering Temporal Data

Working with dates in SQL goes far beyond simple date arithmetic. Let's explore a sales analytics scenario where you need to handle multiple date formats, time zones, and complex business logic.

Date Parsing and Standardization

Consider this sales data with inconsistent date formats:

-- Sample sales data with date formatting challenges
SELECT * FROM sales_raw LIMIT 5;

/*
order_id | customer_id | order_date_str      | ship_date_str | amount | region
1001     | 123         | 2024-03-15         | Mar 20, 2024  | 1250.00| Americas
1002     | 456         | 15/03/2024         | 2024-03-18    | 890.50 | EMEA
1003     | 789         | March 15, 2024     | 03/20/24      | 2100.75| APAC
1004     | 234         | 2024-3-15          | 20240320      | 675.25 | Americas
1005     | 567         | 15-Mar-2024        | 2024/03/18    | 1450.00| EMEA
*/

Here's how to systematically parse and standardize these dates:

-- Comprehensive date parsing strategy
WITH parsed_dates AS (
    SELECT 
        order_id,
        customer_id,
        amount,
        region,
        order_date_str,
        ship_date_str,
        
        -- Parse order_date_str using multiple patterns
        CASE 
            -- ISO format: YYYY-MM-DD
            WHEN order_date_str ~ '^\d{4}-\d{2}-\d{2}$' 
            THEN order_date_str::DATE
            
            -- European format: DD/MM/YYYY
            WHEN order_date_str ~ '^\d{1,2}/\d{1,2}/\d{4}$' 
            THEN TO_DATE(order_date_str, 'DD/MM/YYYY')
            
            -- US text format: Month DD, YYYY
            WHEN order_date_str ~ '^[A-Za-z]+ \d{1,2}, \d{4}$' 
            THEN TO_DATE(order_date_str, 'Month DD, YYYY')
            
            -- Compact format: YYYY-M-DD
            WHEN order_date_str ~ '^\d{4}-\d{1,2}-\d{1,2}$' 
            THEN order_date_str::DATE
            
            -- European text: DD-Mon-YYYY
            WHEN order_date_str ~ '^\d{1,2}-[A-Za-z]{3}-\d{4}$' 
            THEN TO_DATE(order_date_str, 'DD-Mon-YYYY')
            
            ELSE NULL
        END AS order_date,
        
        -- Parse ship_date_str with similar logic
        CASE 
            WHEN ship_date_str ~ '^[A-Za-z]+ \d{1,2}, \d{4}$' 
            THEN TO_DATE(ship_date_str, 'Month DD, YYYY')
            
            WHEN ship_date_str ~ '^\d{4}-\d{2}-\d{2}$' 
            THEN ship_date_str::DATE
            
            WHEN ship_date_str ~ '^\d{1,2}/\d{1,2}/\d{2}$' 
            THEN TO_DATE(ship_date_str, 'MM/DD/YY')
            
            WHEN ship_date_str ~ '^\d{8}$' 
            THEN TO_DATE(ship_date_str, 'YYYYMMDD')
            
            WHEN ship_date_str ~ '^\d{4}/\d{2}/\d{2}$' 
            THEN TO_DATE(ship_date_str, 'YYYY/MM/DD')
            
            ELSE NULL
        END AS ship_date
    FROM sales_raw
)
SELECT 
    order_id,
    customer_id,
    order_date_str,
    order_date,
    ship_date_str,
    ship_date,
    amount,
    region,
    
    -- Data quality flags
    CASE WHEN order_date IS NULL THEN 'PARSE_FAILED' ELSE 'OK' END AS order_date_status,
    CASE WHEN ship_date IS NULL THEN 'PARSE_FAILED' ELSE 'OK' END AS ship_date_status,
    
    -- Business logic validation
    CASE 
        WHEN order_date IS NULL OR ship_date IS NULL THEN 'INCOMPLETE_DATES'
        WHEN ship_date < order_date THEN 'INVALID_SEQUENCE'
        WHEN ship_date - order_date > 30 THEN 'LONG_FULFILLMENT'
        ELSE 'VALID'
    END AS date_validation_status
    
FROM parsed_dates
ORDER BY order_date;

Advanced Date Calculations for Business Analytics

With clean dates, you can build sophisticated business intelligence queries. Here are patterns you'll use repeatedly:

-- Comprehensive sales analytics with advanced date functions
WITH sales_with_periods AS (
    SELECT 
        order_id,
        customer_id,
        order_date,
        ship_date,
        amount,
        region,
        
        -- Date component extraction
        EXTRACT(YEAR FROM order_date) AS order_year,
        EXTRACT(MONTH FROM order_date) AS order_month,
        EXTRACT(QUARTER FROM order_date) AS order_quarter,
        EXTRACT(DOW FROM order_date) AS order_day_of_week, -- 0=Sunday
        EXTRACT(WEEK FROM order_date) AS order_week,
        
        -- Fiscal year calculation (assuming April 1 start)
        CASE 
            WHEN EXTRACT(MONTH FROM order_date) >= 4 
            THEN EXTRACT(YEAR FROM order_date)
            ELSE EXTRACT(YEAR FROM order_date) - 1
        END AS fiscal_year,
        
        -- Business day calculations
        ship_date - order_date AS fulfillment_days,
        
        -- Weekend/weekday classification
        CASE 
            WHEN EXTRACT(DOW FROM order_date) IN (0, 6) 
            THEN 'Weekend' 
            ELSE 'Weekday' 
        END AS order_day_type,
        
        -- Month-to-date, quarter-to-date flags
        CASE 
            WHEN order_date >= DATE_TRUNC('month', CURRENT_DATE) 
            THEN 'MTD' 
            ELSE 'Prior' 
        END AS mtd_flag,
        
        CASE 
            WHEN order_date >= DATE_TRUNC('quarter', CURRENT_DATE) 
            THEN 'QTD' 
            ELSE 'Prior' 
        END AS qtd_flag
        
    FROM parsed_dates 
    WHERE order_date IS NOT NULL AND ship_date IS NOT NULL
)
SELECT 
    order_year,
    order_quarter,
    region,
    order_day_type,
    COUNT(*) AS order_count,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value,
    AVG(fulfillment_days) AS avg_fulfillment_days,
    MIN(order_date) AS first_order,
    MAX(order_date) AS last_order,
    
    -- Year-over-year growth calculation
    LAG(SUM(amount)) OVER (
        PARTITION BY order_quarter, region 
        ORDER BY order_year
    ) AS prior_year_revenue,
    
    ROUND(
        (SUM(amount) - LAG(SUM(amount)) OVER (
            PARTITION BY order_quarter, region 
            ORDER BY order_year
        )) / NULLIF(LAG(SUM(amount)) OVER (
            PARTITION BY order_quarter, region 
            ORDER BY order_year
        ), 0) * 100, 2
    ) AS yoy_growth_pct
    
FROM sales_with_periods
GROUP BY order_year, order_quarter, region, order_day_type
ORDER BY order_year, order_quarter, region, order_day_type;

Time Zone Handling and UTC Conversions

Modern applications often deal with global data requiring timezone awareness:

-- Working with timezone-aware data
WITH timezone_conversions AS (
    SELECT 
        order_id,
        customer_id,
        region,
        
        -- Original timestamps (assuming they're in local time)
        order_timestamp_local,
        
        -- Convert to UTC based on region
        CASE 
            WHEN region = 'Americas' THEN 
                order_timestamp_local AT TIME ZONE 'America/New_York' AT TIME ZONE 'UTC'
            WHEN region = 'EMEA' THEN 
                order_timestamp_local AT TIME ZONE 'Europe/London' AT TIME ZONE 'UTC'
            WHEN region = 'APAC' THEN 
                order_timestamp_local AT TIME ZONE 'Asia/Singapore' AT TIME ZONE 'UTC'
            ELSE order_timestamp_local AT TIME ZONE 'UTC'
        END AS order_timestamp_utc,
        
        amount
    FROM sales_with_timestamps
),
global_sales_analysis AS (
    SELECT 
        DATE_TRUNC('hour', order_timestamp_utc) AS order_hour_utc,
        EXTRACT(HOUR FROM order_timestamp_utc) AS hour_of_day_utc,
        region,
        COUNT(*) AS orders_count,
        SUM(amount) AS hourly_revenue
    FROM timezone_conversions
    WHERE order_timestamp_utc >= CURRENT_DATE - INTERVAL '7 days'
    GROUP BY DATE_TRUNC('hour', order_timestamp_utc), EXTRACT(HOUR FROM order_timestamp_utc), region
)
SELECT 
    hour_of_day_utc,
    region,
    AVG(orders_count) AS avg_orders_per_hour,
    AVG(hourly_revenue) AS avg_revenue_per_hour,
    -- Peak hours identification
    CASE 
        WHEN AVG(orders_count) > 
            AVG(AVG(orders_count)) OVER (PARTITION BY region) * 1.5 
        THEN 'PEAK' 
        ELSE 'NORMAL' 
    END AS hour_classification
FROM global_sales_analysis
GROUP BY hour_of_day_utc, region
ORDER BY region, hour_of_day_utc;

Combining String and Date Functions for Complex Transformations

Real-world scenarios often require combining string and date manipulations. Here's a comprehensive example that builds a customer segmentation system:

-- Advanced customer segmentation using combined string and date functions
WITH customer_analysis AS (
    SELECT 
        c.customer_id,
        INITCAP(TRIM(c.first_name)) || ' ' || INITCAP(TRIM(c.last_name)) AS full_name,
        LOWER(TRIM(c.email)) AS email,
        c.registration_date,
        
        -- Email domain extraction for B2B vs B2C classification
        SUBSTRING(LOWER(TRIM(c.email)) FROM '@(.+)$') AS email_domain,
        
        -- Account age calculation
        CURRENT_DATE - c.registration_date AS account_age_days,
        
        -- Registration cohort (month-year)
        TO_CHAR(c.registration_date, 'YYYY-MM') AS registration_cohort,
        
        -- Aggregate sales data
        COUNT(s.order_id) AS total_orders,
        COALESCE(SUM(s.amount), 0) AS total_spent,
        COALESCE(AVG(s.amount), 0) AS avg_order_value,
        MIN(s.order_date) AS first_order_date,
        MAX(s.order_date) AS last_order_date,
        
        -- Recency calculation
        CURRENT_DATE - MAX(s.order_date) AS days_since_last_order
        
    FROM customers c
    LEFT JOIN sales s ON c.customer_id = s.customer_id
    WHERE c.registration_date IS NOT NULL
    GROUP BY c.customer_id, c.first_name, c.last_name, c.email, c.registration_date
),
customer_segments AS (
    SELECT 
        customer_id,
        full_name,
        email,
        email_domain,
        registration_date,
        registration_cohort,
        account_age_days,
        total_orders,
        total_spent,
        avg_order_value,
        days_since_last_order,
        
        -- Domain-based customer type classification
        CASE 
            WHEN email_domain IN ('gmail.com', 'yahoo.com', 'hotmail.com', 'outlook.com', 
                                  'icloud.com', 'aol.com') 
            THEN 'B2C'
            WHEN email_domain LIKE '%.edu' 
            THEN 'Education'
            WHEN email_domain LIKE '%.gov' 
            THEN 'Government'
            ELSE 'B2B'
        END AS customer_type,
        
        -- RFM Segmentation (Recency, Frequency, Monetary)
        CASE 
            WHEN days_since_last_order IS NULL THEN 'Never Purchased'
            WHEN days_since_last_order <= 30 THEN 'Recent'
            WHEN days_since_last_order <= 90 THEN 'Active'
            WHEN days_since_last_order <= 180 THEN 'At Risk'
            ELSE 'Inactive'
        END AS recency_segment,
        
        CASE 
            WHEN total_orders = 0 THEN 'Never Purchased'
            WHEN total_orders = 1 THEN 'One-time'
            WHEN total_orders BETWEEN 2 AND 5 THEN 'Occasional'
            WHEN total_orders BETWEEN 6 AND 15 THEN 'Regular'
            ELSE 'Frequent'
        END AS frequency_segment,
        
        CASE 
            WHEN total_spent = 0 THEN 'No Spend'
            WHEN total_spent < 100 THEN 'Low Value'
            WHEN total_spent < 500 THEN 'Medium Value'
            WHEN total_spent < 2000 THEN 'High Value'
            ELSE 'Premium'
        END AS monetary_segment,
        
        -- Account lifecycle stage
        CASE 
            WHEN account_age_days < 30 THEN 'New'
            WHEN account_age_days < 90 THEN 'Growing'
            WHEN account_age_days < 365 THEN 'Established'
            ELSE 'Mature'
        END AS lifecycle_stage
        
    FROM customer_analysis
)
SELECT 
    registration_cohort,
    customer_type,
    lifecycle_stage,
    recency_segment,
    frequency_segment,
    monetary_segment,
    COUNT(*) AS customer_count,
    AVG(total_spent) AS avg_customer_value,
    AVG(account_age_days) AS avg_account_age,
    SUM(total_spent) AS cohort_total_value,
    
    -- Cohort performance metrics
    ROUND(
        COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY registration_cohort), 
        2
    ) AS segment_pct_of_cohort,
    
    -- Generate actionable customer segment labels
    CONCAT(
        customer_type, ' - ',
        lifecycle_stage, ' - ',
        recency_segment, ' - ',
        frequency_segment, ' - ',
        monetary_segment
    ) AS detailed_segment
    
FROM customer_segments
GROUP BY registration_cohort, customer_type, lifecycle_stage, 
         recency_segment, frequency_segment, monetary_segment
HAVING COUNT(*) >= 10  -- Only show segments with meaningful size
ORDER BY registration_cohort DESC, customer_count DESC;

Hands-On Exercise: Building a Data Quality Dashboard

Now let's put everything together. You'll build a comprehensive data quality monitoring system that combines string and date functions to assess and report on data health across multiple dimensions.

Exercise Setup

First, create a mock dataset that represents typical data quality challenges:

-- Create sample data with intentional quality issues
WITH sample_data AS (
    SELECT 
        ROW_NUMBER() OVER() as record_id,
        first_name,
        last_name,
        email,
        phone,
        birth_date_str,
        registration_timestamp_str,
        last_login_str,
        country
    FROM VALUES 
        ('john', 'SMITH', 'John.Smith@GMAIL.com', '(555) 123-4567', '1985-06-15', '2024-01-15 10:30:00', '2024-03-10 14:22:33', 'USA'),
        ('', 'johnson', 'mary.j@company.co.uk', '555.123.4567', '15/06/1990', '15-Jan-2024 10:30', '2024-03-09', 'UK'),
        ('ROBERT', '  Lee  ', 'bob@personal-email.net', '+1-555-123-4567', 'June 15, 1988', '2024-01-15T10:30:00Z', '2024/03/08 2:15 PM', 'USA'),
        ('sarah', 'O''Connor', 'not-an-email', '5551234567', '1992-6-15', '2024-01-15', NULL, ''),
        (NULL, 'Brown', 'j.brown+work@email.com', '(555)123-4567', '15-Jun-1985', 'January 15, 2024', '2024-03-07T14:22:33', 'Canada'),
        ('  Jennifer  ', '', 'jennifer@work', '', '1990/06/15', '2024/01/15 10:30:00', '2024-03-06 14:22', 'USA'),
        ('Mike', 'Davis', 'mike.davis@company.com', '555-123-4567', '', '2024-01-20 10:30:00', '2024-03-05', 'Mexico')
    AS t(first_name, last_name, email, phone, birth_date_str, registration_timestamp_str, last_login_str, country)
)
SELECT * FROM sample_data;

Your Task

Build a comprehensive data quality assessment query that:

Standardizes all string fields using appropriate functions
Parses and validates all date fields
Calculates data quality scores for each record
Generates actionable quality improvement recommendations
Creates summary statistics for management reporting

Here's the solution framework to get you started:

-- Complete data quality assessment solution
WITH data_quality_analysis AS (
    SELECT 
        record_id,
        
        -- Original fields
        first_name, last_name, email, phone, birth_date_str, 
        registration_timestamp_str, last_login_str, country,
        
        -- String standardization and validation
        CASE 
            WHEN TRIM(COALESCE(first_name, '')) = '' THEN NULL
            ELSE INITCAP(TRIM(first_name))
        END AS clean_first_name,
        
        CASE 
            WHEN TRIM(COALESCE(last_name, '')) = '' THEN NULL
            ELSE INITCAP(TRIM(last_name))
        END AS clean_last_name,
        
        LOWER(TRIM(email)) AS clean_email,
        
        -- Email validation
        CASE 
            WHEN email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$' 
            THEN 'VALID'
            WHEN TRIM(COALESCE(email, '')) = '' 
            THEN 'MISSING'
            ELSE 'INVALID'
        END AS email_quality,
        
        -- Phone standardization
        REGEXP_REPLACE(COALESCE(phone, ''), '[^0-9]', '') AS clean_phone_digits,
        
        CASE 
            WHEN LENGTH(REGEXP_REPLACE(COALESCE(phone, ''), '[^0-9]', '')) = 10 
            THEN 'VALID'
            WHEN TRIM(COALESCE(phone, '')) = '' 
            THEN 'MISSING'
            ELSE 'INVALID'
        END AS phone_quality,
        
        -- Date parsing with multiple format support
        CASE 
            WHEN birth_date_str ~ '^\d{4}-\d{2}-\d{2}$' 
            THEN birth_date_str::DATE
            WHEN birth_date_str ~ '^\d{1,2}/\d{1,2}/\d{4}$' 
            THEN TO_DATE(birth_date_str, 'DD/MM/YYYY')
            WHEN birth_date_str ~ '^[A-Za-z]+ \d{1,2}, \d{4}$' 
            THEN TO_DATE(birth_date_str, 'Month DD, YYYY')
            WHEN birth_date_str ~ '^\d{4}-\d{1,2}-\d{1,2}$' 
            THEN birth_date_str::DATE
            WHEN birth_date_str ~ '^\d{1,2}-[A-Za-z]{3}-\d{4}$' 
            THEN TO_DATE(birth_date_str, 'DD-Mon-YYYY')
            WHEN birth_date_str ~ '^\d{4}/\d{2}/\d{2}$' 
            THEN TO_DATE(birth_date_str, 'YYYY/MM/DD')
            ELSE NULL
        END AS parsed_birth_date,
        
        -- Continue with timestamp parsing...
        -- [Add your timestamp parsing logic here]
        
        -- Country standardization
        CASE 
            WHEN UPPER(TRIM(COALESCE(country, ''))) IN ('USA', 'US', 'UNITED STATES') THEN 'USA'
            WHEN UPPER(TRIM(COALESCE(country, ''))) IN ('UK', 'UNITED KINGDOM', 'ENGLAND') THEN 'UK'
            WHEN UPPER(TRIM(COALESCE(country, ''))) IN ('CANADA', 'CA') THEN 'Canada'
            WHEN UPPER(TRIM(COALESCE(country, ''))) IN ('MEXICO', 'MX') THEN 'Mexico'
            WHEN TRIM(COALESCE(country, '')) = '' THEN NULL
            ELSE TRIM(country)
        END AS clean_country
        
    FROM sample_data
),
quality_scores AS (
    SELECT 
        *,
        -- Individual field quality scores (0-100)
        CASE 
            WHEN clean_first_name IS NOT NULL THEN 100 
            ELSE 0 
        END AS first_name_score,
        
        CASE 
            WHEN clean_last_name IS NOT NULL THEN 100 
            ELSE 0 
        END AS last_name_score,
        
        CASE 
            WHEN email_quality = 'VALID' THEN 100
            WHEN email_quality = 'MISSING' THEN 0
            ELSE 25  -- Invalid format gets partial credit
        END AS email_score,
        
        CASE 
            WHEN phone_quality = 'VALID' THEN 100
            WHEN phone_quality = 'MISSING' THEN 0
            ELSE 25
        END AS phone_score,
        
        CASE 
            WHEN parsed_birth_date IS NOT NULL AND 
                 parsed_birth_date BETWEEN '1900-01-01' AND CURRENT_DATE - INTERVAL '13 years'
            THEN 100
            WHEN TRIM(COALESCE(birth_date_str, '')) = '' 
            THEN 0
            ELSE 25
        END AS birth_date_score,
        
        CASE 
            WHEN clean_country IS NOT NULL THEN 100 
            ELSE 50  -- Missing country gets partial score
        END AS country_score
        
    FROM data_quality_analysis
)
SELECT 
    record_id,
    clean_first_name,
    clean_last_name,
    clean_email,
    email_quality,
    phone_quality,
    parsed_birth_date,
    clean_country,
    
    -- Overall quality score
    ROUND(
        (first_name_score + last_name_score + email_score + 
         phone_score + birth_date_score + country_score) / 6.0, 1
    ) AS overall_quality_score,
    
    -- Quality tier classification
    CASE 
        WHEN (first_name_score + last_name_score + email_score + 
              phone_score + birth_date_score + country_score) / 6.0 >= 90 
        THEN 'EXCELLENT'
        WHEN (first_name_score + last_name_score + email_score + 
              phone_score + birth_date_score + country_score) / 6.0 >= 75 
        THEN 'GOOD'
        WHEN (first_name_score + last_name_score + email_score + 
              phone_score + birth_date_score + country_score) / 6.0 >= 50 
        THEN 'FAIR'
        ELSE 'POOR'
    END AS quality_tier,
    
    -- Specific improvement recommendations
    ARRAY_TO_STRING(
        ARRAY[
            CASE WHEN first_name_score < 100 THEN 'Add first name' END,
            CASE WHEN last_name_score < 100 THEN 'Add last name' END,
            CASE WHEN email_score < 100 THEN 'Fix email format' END,
            CASE WHEN phone_score < 100 THEN 'Fix phone format' END,
            CASE WHEN birth_date_score < 100 THEN 'Fix birth date' END,
            CASE WHEN country_score < 100 THEN 'Add country' END
        ]::TEXT[], 
        '; '
    ) AS improvement_recommendations
    
FROM quality_scores
ORDER BY overall_quality_score DESC, record_id;

Common Mistakes & Troubleshooting

String Function Pitfalls

NULL Handling Issues The most common mistake is forgetting how string functions behave with NULL values. Consider this problematic code:

-- WRONG: This will return NULL for any NULL input
SELECT UPPER(first_name) || ' ' || UPPER(last_name) AS full_name
FROM customers;

-- BETTER: Handle NULLs explicitly
SELECT 
    COALESCE(UPPER(TRIM(first_name)), 'UNKNOWN') || ' ' || 
    COALESCE(UPPER(TRIM(last_name)), 'UNKNOWN') AS full_name
FROM customers;

Performance Problems with LIKE and Regular Expressions String pattern matching can be expensive without proper indexing:

-- SLOW: Full table scan with leading wildcard
SELECT * FROM customers WHERE email LIKE '%@gmail.com';

-- FASTER: Use functional index or restructure query
-- CREATE INDEX idx_email_domain ON customers (SUBSTRING(email FROM '@(.+)$'));
SELECT * FROM customers 
WHERE SUBSTRING(email FROM '@(.+)$') = 'gmail.com';

Character Encoding and Collation Issues Be aware of how different databases handle character comparisons:

-- May not work as expected with accented characters
SELECT * FROM customers WHERE UPPER(last_name) = 'MÜLLER';

-- Better approach using collation-aware comparison
SELECT * FROM customers 
WHERE last_name COLLATE "en_US.UTF-8" ILIKE 'müller';

Date Function Debugging

Time Zone Confusion Always be explicit about time zones when working with timestamps:

-- AMBIGUOUS: What timezone is this?
SELECT * FROM orders WHERE created_at >= '2024-01-01 00:00:00';

-- CLEAR: Explicitly handle timezone
SELECT * FROM orders 
WHERE created_at AT TIME ZONE 'UTC' >= '2024-01-01 00:00:00'::TIMESTAMP AT TIME ZONE 'UTC';

Date Range Boundary Issues Be careful with date comparisons that might miss records:

-- WRONG: Misses records from 2024-01-31
SELECT * FROM orders 
WHERE DATE(created_at) BETWEEN '2024-01-01' AND '2024-01-31';

-- BETTER: Use proper timestamp ranges
SELECT * FROM orders 
WHERE created_at >= '2024-01-01 00:00:00'
  AND created_at < '2024-02-01 00:00:00';

Leap Year and DST Calculations Date arithmetic can be tricky around edge cases:

-- Can fail on leap years
SELECT DATE '2024-02-29' + INTERVAL '1 year'; -- Results in 2025-02-28

-- More robust approach for anniversary calculations
SELECT DATE '2024-02-29' + 
       (INTERVAL '1 year' - INTERVAL '1 day') + 
       INTERVAL '1 day';

Cross-Database Compatibility

Different SQL databases have varying function syntax. Here's how to handle common variations:

-- String concatenation varies by database
-- PostgreSQL: 'Hello' || ' ' || 'World'
-- MySQL: CONCAT('Hello', ' ', 'World')
-- SQL Server: 'Hello' + ' ' + 'World' or CONCAT('Hello', ' ', 'World')

-- Date formatting differences
-- PostgreSQL: TO_CHAR(date_col, 'YYYY-MM-DD')
-- MySQL: DATE_FORMAT(date_col, '%Y-%m-%d')
-- SQL Server: FORMAT(date_col, 'yyyy-MM-dd')

Testing Strategy: Always test your string and date functions with edge cases: NULL values, empty strings, leap years, time zone boundaries, and maximum/minimum date values. Create unit tests that verify behavior across these scenarios.

Summary & Next Steps

You've now mastered the essential string and date functions that form the backbone of data cleaning and transformation work. These skills enable you to handle the messy realities of production data systems with confidence.

Key takeaways:

String standardization requires systematic approaches to handle NULL values, encoding issues, and format variations
Date parsing needs robust error handling and multiple format recognition patterns
Combining string and date functions enables sophisticated data quality and segmentation systems
Performance considerations become critical when applying these functions to large datasets
Cross-database compatibility requires understanding dialect differences

Immediate next steps:

Apply these patterns to your own datasets, starting with the data quality assessment framework
Create reusable functions or stored procedures for your most common transformation patterns
Build monitoring queries that alert you to data quality degradation over time
Practice the debugging techniques with real edge cases from your data

Advanced topics to explore:

Regular expression mastery for complex pattern matching
Window functions combined with date calculations for cohort analysis
JSON string parsing for semi-structured data
Performance optimization techniques for large-scale string/date operations
Building automated data quality pipelines using these functions

The functions you've learned here aren't just syntax—they're problem-solving tools that will serve you throughout your data career. Start applying them to real scenarios immediately, and you'll quickly discover additional patterns and optimizations that make your data work more effective and reliable.

Master SQL String and Date Functions: Essential Data Transformation Skills