You're working with a customer database that's been accumulating data for years. Names are inconsistent—some all caps, others mixed case, some with extra spaces. Dates are scattered across different formats and time zones. Phone numbers follow various patterns. Your analytics team needs clean, standardized data, and your reporting dashboard is breaking because date comparisons aren't working as expected.
This isn't just about knowing function syntax—it's about understanding how to systematically clean, transform, and manipulate text and temporal data at scale. By the end of this lesson, you'll have the skills to handle the messy realities of production data systems with confidence.
What you'll learn:
You should be comfortable writing SELECT statements, understand basic data types (VARCHAR, DATE, TIMESTAMP), and have experience with WHERE clauses and basic joins. Familiarity with NULL handling and CASE statements will help, though we'll review key concepts as needed.
Let's start with a realistic scenario. You've inherited a customer table where data entry standards were... flexible:
-- Sample customer data showing common real-world inconsistencies
SELECT
customer_id,
first_name,
last_name,
email,
phone,
company_name
FROM customers
LIMIT 5;
/*
customer_id | first_name | last_name | email | phone | company_name
1 | john | SMITH | John.Smith@GMAIL.com | (555) 123-4567 | ACME corp.
2 | Mary | johnson | mary.j@company.co.uk | 555.123.4567 | TechStart Inc
3 | ROBERT | Lee | bob@personal-email.net | +1-555-123-4567| Global Solutions LLC
4 | sarah | O'Connor | sarah.oconnor@work.org | 5551234567 | null
5 | Jennifer | Brown | j.brown+work@email.com | (555)123-4567 | Data & Analytics Co
*/
The foundation of string manipulation starts with cleaning functions. Here's how to systematically standardize this data:
-- Comprehensive name standardization
SELECT
customer_id,
-- Standardize names: proper case with trimmed whitespace
INITCAP(TRIM(first_name)) AS clean_first_name,
INITCAP(TRIM(last_name)) AS clean_last_name,
-- Email standardization: lowercase and trimmed
LOWER(TRIM(email)) AS clean_email,
-- Company name cleaning with NULL handling
CASE
WHEN TRIM(company_name) = '' OR UPPER(TRIM(company_name)) = 'NULL'
THEN NULL
ELSE INITCAP(TRIM(company_name))
END AS clean_company_name
FROM customers;
Database Dialect Note:
INITCAPisn't available in all SQL databases. In MySQL, use a combination ofUPPER(LEFT(column, 1))andLOWER(SUBSTRING(column, 2)). In SQL Server, consider creating a custom function or using multiple CASE statements.
Real-world scenarios often require extracting meaningful information from unstructured text. Let's parse those inconsistent phone numbers:
-- Phone number standardization using string functions
SELECT
customer_id,
phone AS original_phone,
-- Remove all non-numeric characters
REGEXP_REPLACE(phone, '[^0-9]', '') AS digits_only,
-- Format as (XXX) XXX-XXXX for 10-digit numbers
CASE
WHEN LENGTH(REGEXP_REPLACE(phone, '[^0-9]', '')) = 10 THEN
CONCAT('(',
SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', ''), 1, 3),
') ',
SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', ''), 4, 3),
'-',
SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', ''), 7, 4))
WHEN LENGTH(REGEXP_REPLACE(phone, '[^0-9]', '')) = 11 AND
LEFT(REGEXP_REPLACE(phone, '[^0-9]', ''), 1) = '1' THEN
CONCAT('(',
SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', ''), 2, 3),
') ',
SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', ''), 5, 3),
'-',
SUBSTRING(REGEXP_REPLACE(phone, '[^0-9]', ''), 8, 4))
ELSE 'INVALID_PHONE'
END AS formatted_phone
FROM customers;
Beyond cleaning, string functions help identify data quality issues. Here's how to build comprehensive validation checks:
-- Comprehensive data validation using string functions
WITH validation_checks AS (
SELECT
customer_id,
first_name,
last_name,
email,
phone,
-- Email validation using pattern matching
CASE
WHEN email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'
THEN 'VALID'
ELSE 'INVALID'
END AS email_status,
-- Name validation (alphabetic characters and common punctuation)
CASE
WHEN first_name ~ '^[A-Za-z\s''-]+$'
THEN 'VALID'
ELSE 'INVALID'
END AS first_name_status,
-- Phone validation (after cleaning)
CASE
WHEN LENGTH(REGEXP_REPLACE(phone, '[^0-9]', '')) BETWEEN 10 AND 11
THEN 'VALID'
ELSE 'INVALID'
END AS phone_status
FROM customers
)
SELECT
customer_id,
first_name,
email,
phone,
email_status,
first_name_status,
phone_status,
-- Overall record quality score
CASE
WHEN email_status = 'VALID' AND first_name_status = 'VALID' AND phone_status = 'VALID'
THEN 'HIGH_QUALITY'
WHEN email_status = 'VALID' AND (first_name_status = 'VALID' OR phone_status = 'VALID')
THEN 'MEDIUM_QUALITY'
ELSE 'LOW_QUALITY'
END AS data_quality_tier
FROM validation_checks
ORDER BY data_quality_tier, customer_id;
Working with dates in SQL goes far beyond simple date arithmetic. Let's explore a sales analytics scenario where you need to handle multiple date formats, time zones, and complex business logic.
Consider this sales data with inconsistent date formats:
-- Sample sales data with date formatting challenges
SELECT * FROM sales_raw LIMIT 5;
/*
order_id | customer_id | order_date_str | ship_date_str | amount | region
1001 | 123 | 2024-03-15 | Mar 20, 2024 | 1250.00| Americas
1002 | 456 | 15/03/2024 | 2024-03-18 | 890.50 | EMEA
1003 | 789 | March 15, 2024 | 03/20/24 | 2100.75| APAC
1004 | 234 | 2024-3-15 | 20240320 | 675.25 | Americas
1005 | 567 | 15-Mar-2024 | 2024/03/18 | 1450.00| EMEA
*/
Here's how to systematically parse and standardize these dates:
-- Comprehensive date parsing strategy
WITH parsed_dates AS (
SELECT
order_id,
customer_id,
amount,
region,
order_date_str,
ship_date_str,
-- Parse order_date_str using multiple patterns
CASE
-- ISO format: YYYY-MM-DD
WHEN order_date_str ~ '^\d{4}-\d{2}-\d{2}$'
THEN order_date_str::DATE
-- European format: DD/MM/YYYY
WHEN order_date_str ~ '^\d{1,2}/\d{1,2}/\d{4}$'
THEN TO_DATE(order_date_str, 'DD/MM/YYYY')
-- US text format: Month DD, YYYY
WHEN order_date_str ~ '^[A-Za-z]+ \d{1,2}, \d{4}$'
THEN TO_DATE(order_date_str, 'Month DD, YYYY')
-- Compact format: YYYY-M-DD
WHEN order_date_str ~ '^\d{4}-\d{1,2}-\d{1,2}$'
THEN order_date_str::DATE
-- European text: DD-Mon-YYYY
WHEN order_date_str ~ '^\d{1,2}-[A-Za-z]{3}-\d{4}$'
THEN TO_DATE(order_date_str, 'DD-Mon-YYYY')
ELSE NULL
END AS order_date,
-- Parse ship_date_str with similar logic
CASE
WHEN ship_date_str ~ '^[A-Za-z]+ \d{1,2}, \d{4}$'
THEN TO_DATE(ship_date_str, 'Month DD, YYYY')
WHEN ship_date_str ~ '^\d{4}-\d{2}-\d{2}$'
THEN ship_date_str::DATE
WHEN ship_date_str ~ '^\d{1,2}/\d{1,2}/\d{2}$'
THEN TO_DATE(ship_date_str, 'MM/DD/YY')
WHEN ship_date_str ~ '^\d{8}$'
THEN TO_DATE(ship_date_str, 'YYYYMMDD')
WHEN ship_date_str ~ '^\d{4}/\d{2}/\d{2}$'
THEN TO_DATE(ship_date_str, 'YYYY/MM/DD')
ELSE NULL
END AS ship_date
FROM sales_raw
)
SELECT
order_id,
customer_id,
order_date_str,
order_date,
ship_date_str,
ship_date,
amount,
region,
-- Data quality flags
CASE WHEN order_date IS NULL THEN 'PARSE_FAILED' ELSE 'OK' END AS order_date_status,
CASE WHEN ship_date IS NULL THEN 'PARSE_FAILED' ELSE 'OK' END AS ship_date_status,
-- Business logic validation
CASE
WHEN order_date IS NULL OR ship_date IS NULL THEN 'INCOMPLETE_DATES'
WHEN ship_date < order_date THEN 'INVALID_SEQUENCE'
WHEN ship_date - order_date > 30 THEN 'LONG_FULFILLMENT'
ELSE 'VALID'
END AS date_validation_status
FROM parsed_dates
ORDER BY order_date;
With clean dates, you can build sophisticated business intelligence queries. Here are patterns you'll use repeatedly:
-- Comprehensive sales analytics with advanced date functions
WITH sales_with_periods AS (
SELECT
order_id,
customer_id,
order_date,
ship_date,
amount,
region,
-- Date component extraction
EXTRACT(YEAR FROM order_date) AS order_year,
EXTRACT(MONTH FROM order_date) AS order_month,
EXTRACT(QUARTER FROM order_date) AS order_quarter,
EXTRACT(DOW FROM order_date) AS order_day_of_week, -- 0=Sunday
EXTRACT(WEEK FROM order_date) AS order_week,
-- Fiscal year calculation (assuming April 1 start)
CASE
WHEN EXTRACT(MONTH FROM order_date) >= 4
THEN EXTRACT(YEAR FROM order_date)
ELSE EXTRACT(YEAR FROM order_date) - 1
END AS fiscal_year,
-- Business day calculations
ship_date - order_date AS fulfillment_days,
-- Weekend/weekday classification
CASE
WHEN EXTRACT(DOW FROM order_date) IN (0, 6)
THEN 'Weekend'
ELSE 'Weekday'
END AS order_day_type,
-- Month-to-date, quarter-to-date flags
CASE
WHEN order_date >= DATE_TRUNC('month', CURRENT_DATE)
THEN 'MTD'
ELSE 'Prior'
END AS mtd_flag,
CASE
WHEN order_date >= DATE_TRUNC('quarter', CURRENT_DATE)
THEN 'QTD'
ELSE 'Prior'
END AS qtd_flag
FROM parsed_dates
WHERE order_date IS NOT NULL AND ship_date IS NOT NULL
)
SELECT
order_year,
order_quarter,
region,
order_day_type,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value,
AVG(fulfillment_days) AS avg_fulfillment_days,
MIN(order_date) AS first_order,
MAX(order_date) AS last_order,
-- Year-over-year growth calculation
LAG(SUM(amount)) OVER (
PARTITION BY order_quarter, region
ORDER BY order_year
) AS prior_year_revenue,
ROUND(
(SUM(amount) - LAG(SUM(amount)) OVER (
PARTITION BY order_quarter, region
ORDER BY order_year
)) / NULLIF(LAG(SUM(amount)) OVER (
PARTITION BY order_quarter, region
ORDER BY order_year
), 0) * 100, 2
) AS yoy_growth_pct
FROM sales_with_periods
GROUP BY order_year, order_quarter, region, order_day_type
ORDER BY order_year, order_quarter, region, order_day_type;
Modern applications often deal with global data requiring timezone awareness:
-- Working with timezone-aware data
WITH timezone_conversions AS (
SELECT
order_id,
customer_id,
region,
-- Original timestamps (assuming they're in local time)
order_timestamp_local,
-- Convert to UTC based on region
CASE
WHEN region = 'Americas' THEN
order_timestamp_local AT TIME ZONE 'America/New_York' AT TIME ZONE 'UTC'
WHEN region = 'EMEA' THEN
order_timestamp_local AT TIME ZONE 'Europe/London' AT TIME ZONE 'UTC'
WHEN region = 'APAC' THEN
order_timestamp_local AT TIME ZONE 'Asia/Singapore' AT TIME ZONE 'UTC'
ELSE order_timestamp_local AT TIME ZONE 'UTC'
END AS order_timestamp_utc,
amount
FROM sales_with_timestamps
),
global_sales_analysis AS (
SELECT
DATE_TRUNC('hour', order_timestamp_utc) AS order_hour_utc,
EXTRACT(HOUR FROM order_timestamp_utc) AS hour_of_day_utc,
region,
COUNT(*) AS orders_count,
SUM(amount) AS hourly_revenue
FROM timezone_conversions
WHERE order_timestamp_utc >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY DATE_TRUNC('hour', order_timestamp_utc), EXTRACT(HOUR FROM order_timestamp_utc), region
)
SELECT
hour_of_day_utc,
region,
AVG(orders_count) AS avg_orders_per_hour,
AVG(hourly_revenue) AS avg_revenue_per_hour,
-- Peak hours identification
CASE
WHEN AVG(orders_count) >
AVG(AVG(orders_count)) OVER (PARTITION BY region) * 1.5
THEN 'PEAK'
ELSE 'NORMAL'
END AS hour_classification
FROM global_sales_analysis
GROUP BY hour_of_day_utc, region
ORDER BY region, hour_of_day_utc;
Real-world scenarios often require combining string and date manipulations. Here's a comprehensive example that builds a customer segmentation system:
-- Advanced customer segmentation using combined string and date functions
WITH customer_analysis AS (
SELECT
c.customer_id,
INITCAP(TRIM(c.first_name)) || ' ' || INITCAP(TRIM(c.last_name)) AS full_name,
LOWER(TRIM(c.email)) AS email,
c.registration_date,
-- Email domain extraction for B2B vs B2C classification
SUBSTRING(LOWER(TRIM(c.email)) FROM '@(.+)$') AS email_domain,
-- Account age calculation
CURRENT_DATE - c.registration_date AS account_age_days,
-- Registration cohort (month-year)
TO_CHAR(c.registration_date, 'YYYY-MM') AS registration_cohort,
-- Aggregate sales data
COUNT(s.order_id) AS total_orders,
COALESCE(SUM(s.amount), 0) AS total_spent,
COALESCE(AVG(s.amount), 0) AS avg_order_value,
MIN(s.order_date) AS first_order_date,
MAX(s.order_date) AS last_order_date,
-- Recency calculation
CURRENT_DATE - MAX(s.order_date) AS days_since_last_order
FROM customers c
LEFT JOIN sales s ON c.customer_id = s.customer_id
WHERE c.registration_date IS NOT NULL
GROUP BY c.customer_id, c.first_name, c.last_name, c.email, c.registration_date
),
customer_segments AS (
SELECT
customer_id,
full_name,
email,
email_domain,
registration_date,
registration_cohort,
account_age_days,
total_orders,
total_spent,
avg_order_value,
days_since_last_order,
-- Domain-based customer type classification
CASE
WHEN email_domain IN ('gmail.com', 'yahoo.com', 'hotmail.com', 'outlook.com',
'icloud.com', 'aol.com')
THEN 'B2C'
WHEN email_domain LIKE '%.edu'
THEN 'Education'
WHEN email_domain LIKE '%.gov'
THEN 'Government'
ELSE 'B2B'
END AS customer_type,
-- RFM Segmentation (Recency, Frequency, Monetary)
CASE
WHEN days_since_last_order IS NULL THEN 'Never Purchased'
WHEN days_since_last_order <= 30 THEN 'Recent'
WHEN days_since_last_order <= 90 THEN 'Active'
WHEN days_since_last_order <= 180 THEN 'At Risk'
ELSE 'Inactive'
END AS recency_segment,
CASE
WHEN total_orders = 0 THEN 'Never Purchased'
WHEN total_orders = 1 THEN 'One-time'
WHEN total_orders BETWEEN 2 AND 5 THEN 'Occasional'
WHEN total_orders BETWEEN 6 AND 15 THEN 'Regular'
ELSE 'Frequent'
END AS frequency_segment,
CASE
WHEN total_spent = 0 THEN 'No Spend'
WHEN total_spent < 100 THEN 'Low Value'
WHEN total_spent < 500 THEN 'Medium Value'
WHEN total_spent < 2000 THEN 'High Value'
ELSE 'Premium'
END AS monetary_segment,
-- Account lifecycle stage
CASE
WHEN account_age_days < 30 THEN 'New'
WHEN account_age_days < 90 THEN 'Growing'
WHEN account_age_days < 365 THEN 'Established'
ELSE 'Mature'
END AS lifecycle_stage
FROM customer_analysis
)
SELECT
registration_cohort,
customer_type,
lifecycle_stage,
recency_segment,
frequency_segment,
monetary_segment,
COUNT(*) AS customer_count,
AVG(total_spent) AS avg_customer_value,
AVG(account_age_days) AS avg_account_age,
SUM(total_spent) AS cohort_total_value,
-- Cohort performance metrics
ROUND(
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY registration_cohort),
2
) AS segment_pct_of_cohort,
-- Generate actionable customer segment labels
CONCAT(
customer_type, ' - ',
lifecycle_stage, ' - ',
recency_segment, ' - ',
frequency_segment, ' - ',
monetary_segment
) AS detailed_segment
FROM customer_segments
GROUP BY registration_cohort, customer_type, lifecycle_stage,
recency_segment, frequency_segment, monetary_segment
HAVING COUNT(*) >= 10 -- Only show segments with meaningful size
ORDER BY registration_cohort DESC, customer_count DESC;
Now let's put everything together. You'll build a comprehensive data quality monitoring system that combines string and date functions to assess and report on data health across multiple dimensions.
First, create a mock dataset that represents typical data quality challenges:
-- Create sample data with intentional quality issues
WITH sample_data AS (
SELECT
ROW_NUMBER() OVER() as record_id,
first_name,
last_name,
email,
phone,
birth_date_str,
registration_timestamp_str,
last_login_str,
country
FROM VALUES
('john', 'SMITH', 'John.Smith@GMAIL.com', '(555) 123-4567', '1985-06-15', '2024-01-15 10:30:00', '2024-03-10 14:22:33', 'USA'),
('', 'johnson', 'mary.j@company.co.uk', '555.123.4567', '15/06/1990', '15-Jan-2024 10:30', '2024-03-09', 'UK'),
('ROBERT', ' Lee ', 'bob@personal-email.net', '+1-555-123-4567', 'June 15, 1988', '2024-01-15T10:30:00Z', '2024/03/08 2:15 PM', 'USA'),
('sarah', 'O''Connor', 'not-an-email', '5551234567', '1992-6-15', '2024-01-15', NULL, ''),
(NULL, 'Brown', 'j.brown+work@email.com', '(555)123-4567', '15-Jun-1985', 'January 15, 2024', '2024-03-07T14:22:33', 'Canada'),
(' Jennifer ', '', 'jennifer@work', '', '1990/06/15', '2024/01/15 10:30:00', '2024-03-06 14:22', 'USA'),
('Mike', 'Davis', 'mike.davis@company.com', '555-123-4567', '', '2024-01-20 10:30:00', '2024-03-05', 'Mexico')
AS t(first_name, last_name, email, phone, birth_date_str, registration_timestamp_str, last_login_str, country)
)
SELECT * FROM sample_data;
Build a comprehensive data quality assessment query that:
Here's the solution framework to get you started:
-- Complete data quality assessment solution
WITH data_quality_analysis AS (
SELECT
record_id,
-- Original fields
first_name, last_name, email, phone, birth_date_str,
registration_timestamp_str, last_login_str, country,
-- String standardization and validation
CASE
WHEN TRIM(COALESCE(first_name, '')) = '' THEN NULL
ELSE INITCAP(TRIM(first_name))
END AS clean_first_name,
CASE
WHEN TRIM(COALESCE(last_name, '')) = '' THEN NULL
ELSE INITCAP(TRIM(last_name))
END AS clean_last_name,
LOWER(TRIM(email)) AS clean_email,
-- Email validation
CASE
WHEN email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'
THEN 'VALID'
WHEN TRIM(COALESCE(email, '')) = ''
THEN 'MISSING'
ELSE 'INVALID'
END AS email_quality,
-- Phone standardization
REGEXP_REPLACE(COALESCE(phone, ''), '[^0-9]', '') AS clean_phone_digits,
CASE
WHEN LENGTH(REGEXP_REPLACE(COALESCE(phone, ''), '[^0-9]', '')) = 10
THEN 'VALID'
WHEN TRIM(COALESCE(phone, '')) = ''
THEN 'MISSING'
ELSE 'INVALID'
END AS phone_quality,
-- Date parsing with multiple format support
CASE
WHEN birth_date_str ~ '^\d{4}-\d{2}-\d{2}$'
THEN birth_date_str::DATE
WHEN birth_date_str ~ '^\d{1,2}/\d{1,2}/\d{4}$'
THEN TO_DATE(birth_date_str, 'DD/MM/YYYY')
WHEN birth_date_str ~ '^[A-Za-z]+ \d{1,2}, \d{4}$'
THEN TO_DATE(birth_date_str, 'Month DD, YYYY')
WHEN birth_date_str ~ '^\d{4}-\d{1,2}-\d{1,2}$'
THEN birth_date_str::DATE
WHEN birth_date_str ~ '^\d{1,2}-[A-Za-z]{3}-\d{4}$'
THEN TO_DATE(birth_date_str, 'DD-Mon-YYYY')
WHEN birth_date_str ~ '^\d{4}/\d{2}/\d{2}$'
THEN TO_DATE(birth_date_str, 'YYYY/MM/DD')
ELSE NULL
END AS parsed_birth_date,
-- Continue with timestamp parsing...
-- [Add your timestamp parsing logic here]
-- Country standardization
CASE
WHEN UPPER(TRIM(COALESCE(country, ''))) IN ('USA', 'US', 'UNITED STATES') THEN 'USA'
WHEN UPPER(TRIM(COALESCE(country, ''))) IN ('UK', 'UNITED KINGDOM', 'ENGLAND') THEN 'UK'
WHEN UPPER(TRIM(COALESCE(country, ''))) IN ('CANADA', 'CA') THEN 'Canada'
WHEN UPPER(TRIM(COALESCE(country, ''))) IN ('MEXICO', 'MX') THEN 'Mexico'
WHEN TRIM(COALESCE(country, '')) = '' THEN NULL
ELSE TRIM(country)
END AS clean_country
FROM sample_data
),
quality_scores AS (
SELECT
*,
-- Individual field quality scores (0-100)
CASE
WHEN clean_first_name IS NOT NULL THEN 100
ELSE 0
END AS first_name_score,
CASE
WHEN clean_last_name IS NOT NULL THEN 100
ELSE 0
END AS last_name_score,
CASE
WHEN email_quality = 'VALID' THEN 100
WHEN email_quality = 'MISSING' THEN 0
ELSE 25 -- Invalid format gets partial credit
END AS email_score,
CASE
WHEN phone_quality = 'VALID' THEN 100
WHEN phone_quality = 'MISSING' THEN 0
ELSE 25
END AS phone_score,
CASE
WHEN parsed_birth_date IS NOT NULL AND
parsed_birth_date BETWEEN '1900-01-01' AND CURRENT_DATE - INTERVAL '13 years'
THEN 100
WHEN TRIM(COALESCE(birth_date_str, '')) = ''
THEN 0
ELSE 25
END AS birth_date_score,
CASE
WHEN clean_country IS NOT NULL THEN 100
ELSE 50 -- Missing country gets partial score
END AS country_score
FROM data_quality_analysis
)
SELECT
record_id,
clean_first_name,
clean_last_name,
clean_email,
email_quality,
phone_quality,
parsed_birth_date,
clean_country,
-- Overall quality score
ROUND(
(first_name_score + last_name_score + email_score +
phone_score + birth_date_score + country_score) / 6.0, 1
) AS overall_quality_score,
-- Quality tier classification
CASE
WHEN (first_name_score + last_name_score + email_score +
phone_score + birth_date_score + country_score) / 6.0 >= 90
THEN 'EXCELLENT'
WHEN (first_name_score + last_name_score + email_score +
phone_score + birth_date_score + country_score) / 6.0 >= 75
THEN 'GOOD'
WHEN (first_name_score + last_name_score + email_score +
phone_score + birth_date_score + country_score) / 6.0 >= 50
THEN 'FAIR'
ELSE 'POOR'
END AS quality_tier,
-- Specific improvement recommendations
ARRAY_TO_STRING(
ARRAY[
CASE WHEN first_name_score < 100 THEN 'Add first name' END,
CASE WHEN last_name_score < 100 THEN 'Add last name' END,
CASE WHEN email_score < 100 THEN 'Fix email format' END,
CASE WHEN phone_score < 100 THEN 'Fix phone format' END,
CASE WHEN birth_date_score < 100 THEN 'Fix birth date' END,
CASE WHEN country_score < 100 THEN 'Add country' END
]::TEXT[],
'; '
) AS improvement_recommendations
FROM quality_scores
ORDER BY overall_quality_score DESC, record_id;
NULL Handling Issues The most common mistake is forgetting how string functions behave with NULL values. Consider this problematic code:
-- WRONG: This will return NULL for any NULL input
SELECT UPPER(first_name) || ' ' || UPPER(last_name) AS full_name
FROM customers;
-- BETTER: Handle NULLs explicitly
SELECT
COALESCE(UPPER(TRIM(first_name)), 'UNKNOWN') || ' ' ||
COALESCE(UPPER(TRIM(last_name)), 'UNKNOWN') AS full_name
FROM customers;
Performance Problems with LIKE and Regular Expressions String pattern matching can be expensive without proper indexing:
-- SLOW: Full table scan with leading wildcard
SELECT * FROM customers WHERE email LIKE '%@gmail.com';
-- FASTER: Use functional index or restructure query
-- CREATE INDEX idx_email_domain ON customers (SUBSTRING(email FROM '@(.+)$'));
SELECT * FROM customers
WHERE SUBSTRING(email FROM '@(.+)$') = 'gmail.com';
Character Encoding and Collation Issues Be aware of how different databases handle character comparisons:
-- May not work as expected with accented characters
SELECT * FROM customers WHERE UPPER(last_name) = 'MÜLLER';
-- Better approach using collation-aware comparison
SELECT * FROM customers
WHERE last_name COLLATE "en_US.UTF-8" ILIKE 'müller';
Time Zone Confusion Always be explicit about time zones when working with timestamps:
-- AMBIGUOUS: What timezone is this?
SELECT * FROM orders WHERE created_at >= '2024-01-01 00:00:00';
-- CLEAR: Explicitly handle timezone
SELECT * FROM orders
WHERE created_at AT TIME ZONE 'UTC' >= '2024-01-01 00:00:00'::TIMESTAMP AT TIME ZONE 'UTC';
Date Range Boundary Issues Be careful with date comparisons that might miss records:
-- WRONG: Misses records from 2024-01-31
SELECT * FROM orders
WHERE DATE(created_at) BETWEEN '2024-01-01' AND '2024-01-31';
-- BETTER: Use proper timestamp ranges
SELECT * FROM orders
WHERE created_at >= '2024-01-01 00:00:00'
AND created_at < '2024-02-01 00:00:00';
Leap Year and DST Calculations Date arithmetic can be tricky around edge cases:
-- Can fail on leap years
SELECT DATE '2024-02-29' + INTERVAL '1 year'; -- Results in 2025-02-28
-- More robust approach for anniversary calculations
SELECT DATE '2024-02-29' +
(INTERVAL '1 year' - INTERVAL '1 day') +
INTERVAL '1 day';
Different SQL databases have varying function syntax. Here's how to handle common variations:
-- String concatenation varies by database
-- PostgreSQL: 'Hello' || ' ' || 'World'
-- MySQL: CONCAT('Hello', ' ', 'World')
-- SQL Server: 'Hello' + ' ' + 'World' or CONCAT('Hello', ' ', 'World')
-- Date formatting differences
-- PostgreSQL: TO_CHAR(date_col, 'YYYY-MM-DD')
-- MySQL: DATE_FORMAT(date_col, '%Y-%m-%d')
-- SQL Server: FORMAT(date_col, 'yyyy-MM-dd')
Testing Strategy: Always test your string and date functions with edge cases: NULL values, empty strings, leap years, time zone boundaries, and maximum/minimum date values. Create unit tests that verify behavior across these scenarios.
You've now mastered the essential string and date functions that form the backbone of data cleaning and transformation work. These skills enable you to handle the messy realities of production data systems with confidence.
Key takeaways:
Immediate next steps:
Advanced topics to explore:
The functions you've learned here aren't just syntax—they're problem-solving tools that will serve you throughout your data career. Start applying them to real scenarios immediately, and you'll quickly discover additional patterns and optimizations that make your data work more effective and reliable.
Learning Path: SQL Fundamentals