You're analyzing a customer database when you discover that critical fields are missing values. Customer records have NULL email addresses, NULL phone numbers, and NULL registration dates. Your dashboard is showing misleading metrics because NULL values are being excluded from calculations. Your data pipeline is failing because downstream systems can't handle NULLs properly.
Welcome to one of the most fundamental challenges in data work: NULL handling. NULLs aren't just "missing data" — they represent unknown, undefined, or inapplicable values that can derail your analysis if you don't handle them correctly. Unlike empty strings or zeros, NULLs have unique behavior that trips up even experienced analysts.
In this lesson, you'll master the three essential SQL functions for NULL handling: IS NULL for detection, COALESCE for substitution, and NULLIF for controlled NULL creation. You'll learn when to use each function, how to combine them for complex data cleaning scenarios, and how to avoid the subtle bugs that NULL handling can introduce.
What you'll learn:
You should be comfortable with basic SQL SELECT statements, WHERE clauses, and understand the concept of NULL values. Familiarity with aggregate functions (COUNT, SUM, AVG) will help you understand the examples, but isn't required.
Before diving into NULL handling functions, you need to understand how NULLs behave differently from other values. This isn't academic — these behaviors cause real bugs in production systems.
NULL represents "unknown" or "not applicable," which means it has three-valued logic: true, false, or unknown. This creates counterintuitive behavior:
-- These all return NULL, not true or false
SELECT NULL = NULL; -- NULL (not true!)
SELECT NULL <> 'hello'; -- NULL (not true!)
SELECT 5 + NULL; -- NULL
SELECT 'Hello' || NULL; -- NULL
This behavior affects your WHERE clauses in ways that catch developers off-guard:
-- Sample customer data
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(100),
phone VARCHAR(20),
loyalty_tier VARCHAR(20)
);
INSERT INTO customers VALUES
(1, 'Sarah Johnson', 'sarah@example.com', '555-0101', 'Gold'),
(2, 'Mike Chen', 'mike@example.com', NULL, 'Silver'),
(3, 'Anna Rodriguez', NULL, '555-0103', 'Gold'),
(4, 'David Kim', 'david@example.com', '555-0104', NULL),
(5, 'Lisa Wong', NULL, NULL, 'Bronze');
-- This query excludes customers with NULL phones
SELECT name, phone FROM customers WHERE phone <> '555-0101';
-- Returns: Sarah Johnson, Anna Rodriguez, David Kim
-- Missing: Mike Chen, Lisa Wong (their NULLs didn't match the <> condition)
The customers with NULL phone numbers disappear from your results because NULL <> '555-0101' evaluates to NULL, which is treated as false in WHERE clauses.
Critical insight: NULL comparisons always return NULL, never true or false. This means standard comparison operators won't work for finding or excluding NULL values.
The IS NULL and IS NOT NULL operators are your primary tools for detecting missing data. Unlike comparison operators, these return true or false, not NULL.
-- Find customers missing email addresses
SELECT customer_id, name, email
FROM customers
WHERE email IS NULL;
-- Results:
-- 3, Anna Rodriguez, NULL
-- 5, Lisa Wong, NULL
-- Find customers missing both email AND phone
SELECT customer_id, name, email, phone
FROM customers
WHERE email IS NULL AND phone IS NULL;
-- Results:
-- 5, Lisa Wong, NULL, NULL
IS NULL becomes powerful when you're auditing data quality across multiple fields:
-- Comprehensive data quality check
SELECT
customer_id,
name,
CASE WHEN email IS NULL THEN 'Missing Email' ELSE 'Has Email' END as email_status,
CASE WHEN phone IS NULL THEN 'Missing Phone' ELSE 'Has Phone' END as phone_status,
CASE WHEN loyalty_tier IS NULL THEN 'No Tier' ELSE loyalty_tier END as tier_status
FROM customers
ORDER BY customer_id;
For large datasets, you'll want aggregate views of missing data:
-- Summary of missing data by field
SELECT
COUNT(*) as total_customers,
COUNT(email) as has_email,
COUNT(*) - COUNT(email) as missing_email,
ROUND((COUNT(*) - COUNT(email)) * 100.0 / COUNT(*), 1) as pct_missing_email,
COUNT(phone) as has_phone,
COUNT(*) - COUNT(phone) as missing_phone,
ROUND((COUNT(*) - COUNT(phone)) * 100.0 / COUNT(*), 1) as pct_missing_phone
FROM customers;
-- Results show: 5 total, 3 have email (40% missing), 3 have phone (40% missing)
Pro tip: The COUNT() function automatically ignores NULL values, which is why
COUNT(email)only counts non-NULL emails. UseCOUNT(*)to count all rows including those with NULLs.
When you want complete records only:
-- Customers with complete contact information
SELECT customer_id, name, email, phone, loyalty_tier
FROM customers
WHERE email IS NOT NULL
AND phone IS NOT NULL
AND loyalty_tier IS NOT NULL;
-- Results:
-- 1, Sarah Johnson, sarah@example.com, 555-0101, Gold
COALESCE returns the first non-NULL value from a list of expressions. It's your go-to function for providing fallback values and cleaning up messy data displays.
-- Provide fallback values for missing contact info
SELECT
customer_id,
name,
COALESCE(email, 'No email provided') as email_display,
COALESCE(phone, 'No phone provided') as phone_display,
COALESCE(loyalty_tier, 'Standard') as tier_display
FROM customers;
-- Results include:
-- 2, Mike Chen, mike@example.com, No phone provided, Silver
-- 3, Anna Rodriguez, No email provided, 555-0103, Gold
-- 5, Lisa Wong, No email provided, No phone provided, Bronze
COALESCE really shines when you have multiple potential sources for the same information:
-- Extended customer table with backup contact methods
CREATE TABLE customer_contacts (
customer_id INT,
primary_email VARCHAR(100),
backup_email VARCHAR(100),
work_email VARCHAR(100),
primary_phone VARCHAR(20),
mobile_phone VARCHAR(20),
emergency_contact VARCHAR(20)
);
INSERT INTO customer_contacts VALUES
(1, 'sarah@example.com', NULL, 'sarah@work.com', '555-0101', '555-1001', '555-2001'),
(2, NULL, 'mike.backup@example.com', NULL, NULL, '555-1002', '555-2002'),
(3, NULL, NULL, NULL, '555-0103', NULL, '555-2003'),
(4, 'david@example.com', 'david.alt@example.com', 'david@work.com', NULL, NULL, NULL);
-- Find the best available contact method for each customer
SELECT
customer_id,
COALESCE(primary_email, backup_email, work_email, 'No email available') as best_email,
COALESCE(primary_phone, mobile_phone, emergency_contact, 'No phone available') as best_phone
FROM customer_contacts;
-- Results:
-- 1, sarah@example.com, 555-0101
-- 2, mike.backup@example.com, 555-1002
-- 3, No email available, 555-0103
-- 4, david@example.com, No phone available
COALESCE becomes critical when you're doing calculations with potentially NULL values:
-- Order data with potential NULL values
CREATE TABLE orders (
order_id INT,
customer_id INT,
base_amount DECIMAL(10,2),
tax_amount DECIMAL(10,2),
shipping_cost DECIMAL(10,2),
discount_amount DECIMAL(10,2)
);
INSERT INTO orders VALUES
(1001, 1, 100.00, 8.50, 15.00, NULL),
(1002, 2, 250.00, NULL, NULL, 25.00),
(1003, 3, 75.00, 6.25, 10.00, 5.00),
(1004, 4, 500.00, 42.50, NULL, NULL);
-- Calculate total order amount, treating NULLs as zero
SELECT
order_id,
base_amount,
COALESCE(tax_amount, 0) as tax_amount,
COALESCE(shipping_cost, 0) as shipping_cost,
COALESCE(discount_amount, 0) as discount_amount,
base_amount +
COALESCE(tax_amount, 0) +
COALESCE(shipping_cost, 0) -
COALESCE(discount_amount, 0) as total_amount
FROM orders;
-- Results:
-- 1001, 100.00, 8.50, 15.00, 0.00, 123.50
-- 1002, 250.00, 0.00, 0.00, 25.00, 225.00
-- 1003, 75.00, 6.25, 10.00, 5.00, 86.25
-- 1004, 500.00, 42.50, 0.00, 0.00, 542.50
Without COALESCE, any NULL value would make the entire calculation NULL:
-- This fails - returns NULL for orders with any NULL components
SELECT
order_id,
base_amount + tax_amount + shipping_cost - discount_amount as total_amount
FROM orders;
-- Results:
-- 1001, NULL (because discount_amount is NULL)
-- 1002, NULL (because tax_amount and shipping_cost are NULL)
-- 1003, 86.25 (only complete record)
-- 1004, NULL (because shipping_cost and discount_amount are NULL)
NULLIF returns NULL if two expressions are equal, otherwise returns the first expression. This might seem backwards, but it's incredibly useful for data cleaning and handling edge cases.
NULLIF excels at converting "placeholder" values into proper NULLs:
-- Raw data import with placeholder values
CREATE TABLE imported_customers (
customer_id INT,
name VARCHAR(100),
email VARCHAR(100),
phone VARCHAR(20),
age INT,
income DECIMAL(10,2)
);
INSERT INTO imported_customers VALUES
(1, 'Sarah Johnson', 'sarah@example.com', '555-0101', 28, 65000.00),
(2, 'Mike Chen', 'N/A', 'Unknown', 0, -1.00),
(3, 'Anna Rodriguez', '', '555-0103', 35, 0.00),
(4, 'David Kim', 'david@example.com', '555-0104', -999, 75000.00),
(5, 'Lisa Wong', 'TBD', 'NULL', 25, 999999.99);
-- Clean up the data by converting placeholders to NULL
SELECT
customer_id,
name,
NULLIF(NULLIF(NULLIF(email, 'N/A'), ''), 'TBD') as email,
NULLIF(NULLIF(phone, 'Unknown'), 'NULL') as phone,
NULLIF(NULLIF(age, 0), -999) as age,
NULLIF(NULLIF(income, -1.00), 999999.99) as income
FROM imported_customers;
-- Results:
-- 1, Sarah Johnson, sarah@example.com, 555-0101, 28, 65000.00
-- 2, Mike Chen, NULL, NULL, NULL, NULL
-- 3, Anna Rodriguez, NULL, 555-0103, 35, 0.00
-- 4, David Kim, david@example.com, 555-0104, NULL, 75000.00
-- 5, Lisa Wong, NULL, NULL, 25, NULL
You can chain NULLIF calls to handle multiple placeholder values, but a CASE statement is often clearer for complex scenarios:
-- More readable approach for multiple placeholder values
SELECT
customer_id,
name,
CASE
WHEN email IN ('N/A', '', 'TBD', 'Unknown') THEN NULL
ELSE email
END as email,
CASE
WHEN phone IN ('Unknown', 'NULL', '', 'N/A') THEN NULL
ELSE phone
END as phone,
CASE
WHEN age <= 0 OR age = -999 THEN NULL
ELSE age
END as age
FROM imported_customers;
NULLIF prevents division by zero errors by converting zero denominators to NULL:
-- Sales performance metrics
CREATE TABLE sales_reps (
rep_id INT,
rep_name VARCHAR(100),
calls_made INT,
meetings_scheduled INT,
deals_closed INT
);
INSERT INTO sales_reps VALUES
(1, 'Alex Turner', 150, 45, 12),
(2, 'Jessica Park', 200, 60, 18),
(3, 'Robert Davis', 0, 0, 0), -- New hire, no activity yet
(4, 'Maria Garcia', 175, 0, 0), -- Made calls but no meetings
(5, 'James Wilson', 125, 35, 8);
-- Calculate conversion rates, avoiding division by zero
SELECT
rep_name,
calls_made,
meetings_scheduled,
deals_closed,
ROUND(meetings_scheduled * 100.0 / NULLIF(calls_made, 0), 1) as meeting_conversion_rate,
ROUND(deals_closed * 100.0 / NULLIF(meetings_scheduled, 0), 1) as close_rate
FROM sales_reps;
-- Results:
-- Alex Turner, 150, 45, 12, 30.0, 26.7
-- Jessica Park, 200, 60, 18, 30.0, 30.0
-- Robert Davis, 0, 0, 0, NULL, NULL
-- Maria Garcia, 175, 0, 0, 0.0, NULL
-- James Wilson, 125, 35, 8, 28.0, 22.9
Without NULLIF, you'd get division by zero errors that crash your query.
NULLIF helps identify and clean invalid data patterns:
-- Product data with potential quality issues
CREATE TABLE products (
product_id INT,
product_name VARCHAR(100),
category VARCHAR(50),
price DECIMAL(10,2),
weight_kg DECIMAL(8,3)
);
INSERT INTO products VALUES
(1, 'Laptop Computer', 'Electronics', 999.99, 2.500),
(2, 'Coffee Mug', 'Kitchenware', 15.99, 0.350),
(3, 'Digital Camera', 'Electronics', 0.00, 0.000), -- Price/weight errors
(4, 'Notebook', 'Office', 5.99, 0.200),
(5, 'Smartphone', 'Electronics', 599.99, 0.000); -- Weight error
-- Identify products with data quality issues
SELECT
product_id,
product_name,
price,
weight_kg,
CASE
WHEN NULLIF(price, 0.00) IS NULL THEN 'Missing Price'
WHEN NULLIF(weight_kg, 0.000) IS NULL THEN 'Missing Weight'
ELSE 'OK'
END as data_quality_flag
FROM products;
-- Results flag products 3 and 5 as having data quality issues
Real-world data cleaning often requires combining these functions to handle complex scenarios. Here are patterns you'll use repeatedly in production.
When you have multiple potential sources and want to establish a clear priority:
-- Customer contact preferences with fallback logic
CREATE TABLE customer_contact_prefs (
customer_id INT,
preferred_email VARCHAR(100),
work_email VARCHAR(100),
personal_email VARCHAR(100),
sms_number VARCHAR(20),
home_phone VARCHAR(20),
contact_preference VARCHAR(20)
);
INSERT INTO customer_contact_prefs VALUES
(1, 'sarah.pref@example.com', 'sarah@work.com', 'sarah.personal@gmail.com', '555-1001', '555-2001', 'email'),
(2, NULL, NULL, 'mike@gmail.com', '555-1002', NULL, 'sms'),
(3, NULL, 'anna@work.com', NULL, NULL, '555-2003', 'phone'),
(4, 'david@example.com', NULL, NULL, NULL, NULL, 'email'),
(5, '', '', '', '', '', 'email'); -- Empty strings need cleaning
-- Determine best contact method based on preference and availability
SELECT
customer_id,
contact_preference,
CASE contact_preference
WHEN 'email' THEN
COALESCE(
NULLIF(preferred_email, ''),
NULLIF(work_email, ''),
NULLIF(personal_email, ''),
'No email available'
)
WHEN 'sms' THEN
COALESCE(
NULLIF(sms_number, ''),
NULLIF(home_phone, ''),
'No phone available'
)
WHEN 'phone' THEN
COALESCE(
NULLIF(home_phone, ''),
NULLIF(sms_number, ''),
'No phone available'
)
ELSE 'Unknown preference'
END as contact_method
FROM customer_contact_prefs;
-- Results:
-- 1, email, sarah.pref@example.com
-- 2, sms, 555-1002
-- 3, phone, 555-2003
-- 4, email, david@example.com
-- 5, email, No email available
Combining all three functions to clean and standardize messy data:
-- Survey responses with inconsistent data entry
CREATE TABLE survey_responses (
response_id INT,
age_group VARCHAR(20),
income_range VARCHAR(30),
satisfaction_score VARCHAR(10)
);
INSERT INTO survey_responses VALUES
(1, '25-34', '$50,000-$75,000', '8'),
(2, 'N/A', 'Prefer not to say', '10'),
(3, '', '$25,000-$50,000', 'N/A'),
(4, '35-44', '0', '7'),
(5, 'Unknown', 'NULL', ''),
(6, '45-54', '$75,000-$100,000', '99'); -- Invalid satisfaction score
-- Standardize and clean the survey data
SELECT
response_id,
-- Clean age_group: convert various "unknown" values to NULL
CASE
WHEN NULLIF(TRIM(age_group), '') IN ('N/A', 'Unknown', 'NULL') THEN NULL
ELSE NULLIF(TRIM(age_group), '')
END as age_group_clean,
-- Clean income_range: standardize "prefer not to say" and invalid entries
CASE
WHEN NULLIF(TRIM(income_range), '') IN ('Prefer not to say', 'NULL', '0') THEN NULL
ELSE NULLIF(TRIM(income_range), '')
END as income_range_clean,
-- Clean satisfaction_score: convert to integer, handle invalid values
CASE
WHEN NULLIF(TRIM(satisfaction_score), '') IN ('N/A', 'NULL') THEN NULL
WHEN CAST(NULLIF(TRIM(satisfaction_score), '') AS INT) > 10 THEN NULL
ELSE CAST(NULLIF(TRIM(satisfaction_score), '') AS INT)
END as satisfaction_score_clean,
-- Provide a data quality assessment
CASE
WHEN COALESCE(
NULLIF(TRIM(age_group), ''),
NULLIF(TRIM(income_range), ''),
NULLIF(TRIM(satisfaction_score), '')
) IS NULL THEN 'Incomplete Response'
ELSE 'Complete Response'
END as response_quality
FROM survey_responses;
Let's build a comprehensive data cleanup pipeline that demonstrates all the NULL handling techniques in a realistic scenario. You're working with a customer database that's been populated from multiple sources with inconsistent data entry practices.
-- Create the messy customer dataset
CREATE TABLE raw_customer_data (
import_id INT PRIMARY KEY,
customer_name VARCHAR(100),
email_address VARCHAR(100),
phone_number VARCHAR(20),
date_of_birth DATE,
annual_income DECIMAL(12,2),
employment_status VARCHAR(30),
credit_score INT,
last_login_date DATE,
account_balance DECIMAL(12,2),
marketing_opt_in VARCHAR(10)
);
-- Insert messy data that represents real-world import scenarios
INSERT INTO raw_customer_data VALUES
-- Good records
(1, 'Sarah Johnson', 'sarah.johnson@email.com', '555-123-4567', '1985-03-15', 65000.00, 'Employed', 750, '2023-12-01', 1250.50, 'Yes'),
(2, 'Michael Chen', 'mike.chen@gmail.com', '555-987-6543', '1990-07-22', 82000.00, 'Self-employed', 720, '2023-11-28', -150.75, 'No'),
-- Records with various NULL/empty patterns
(3, 'Anna Rodriguez', '', '555-555-0123', NULL, NULL, 'Unknown', 0, NULL, 0.00, ''),
(4, 'David Kim', 'david@example.com', 'N/A', '1988-12-03', -1.00, 'Retired', 999, '1900-01-01', 999999.99, 'TRUE'),
(5, '', 'lisa.wong@email.com', 'NULL', '2000-01-01', 0.00, '', 680, NULL, NULL, 'FALSE'),
(6, 'Robert Taylor', 'INVALID', '000-000-0000', '1975-05-18', 95000.00, 'Employed', 800, '2023-12-02', 2500.00, 'Y'),
(7, 'Jennifer White', 'jenny@email.com', '555-111-2222', NULL, 55000.00, 'Student', -999, '2023-11-30', 0, 'N'),
(8, 'TBD', 'TBD', 'TBD', '1980-01-01', 999999.00, 'TBD', 500, '2023-12-01', -999999.99, 'TBD');
-- Your task: Create a comprehensive cleanup query
SELECT
import_id,
-- Step 1: Clean customer names
CASE
WHEN NULLIF(TRIM(customer_name), '') IN ('TBD', 'Unknown', 'NULL') THEN NULL
ELSE NULLIF(TRIM(customer_name), '')
END as clean_customer_name,
-- Step 2: Validate and clean email addresses (basic validation)
CASE
WHEN NULLIF(TRIM(email_address), '') IN ('TBD', 'Unknown', 'NULL', 'INVALID', 'N/A') THEN NULL
WHEN NULLIF(TRIM(email_address), '') NOT LIKE '%@%' THEN NULL
ELSE NULLIF(TRIM(email_address), '')
END as clean_email,
-- Step 3: Clean phone numbers
CASE
WHEN NULLIF(TRIM(phone_number), '') IN ('TBD', 'Unknown', 'NULL', 'N/A', '000-000-0000') THEN NULL
ELSE NULLIF(TRIM(phone_number), '')
END as clean_phone,
-- Step 4: Handle invalid dates (assuming anything before 1930 or after today is invalid)
CASE
WHEN date_of_birth < '1930-01-01' THEN NULL
WHEN date_of_birth > CURRENT_DATE THEN NULL
ELSE date_of_birth
END as clean_date_of_birth,
-- Step 5: Clean income (negative values and obvious placeholders become NULL)
NULLIF(NULLIF(NULLIF(annual_income, -1.00), 0.00), 999999.00) as clean_annual_income,
-- Step 6: Standardize employment status
CASE
WHEN NULLIF(TRIM(employment_status), '') IN ('TBD', 'Unknown', 'NULL', '') THEN NULL
ELSE NULLIF(TRIM(employment_status), '')
END as clean_employment_status,
-- Step 7: Clean credit scores (0, negative, or obvious placeholders)
CASE
WHEN credit_score <= 0 OR credit_score = 999 OR credit_score = -999 THEN NULL
WHEN credit_score > 850 THEN NULL -- Scores typically max at 850
ELSE credit_score
END as clean_credit_score,
-- Step 8: Handle invalid login dates
CASE
WHEN last_login_date < '2020-01-01' THEN NULL -- Assume system started in 2020
WHEN last_login_date > CURRENT_DATE THEN NULL
ELSE last_login_date
END as clean_last_login,
-- Step 9: Clean account balances (remove obvious placeholders)
NULLIF(NULLIF(annual_income, 999999.99), -999999.99) as clean_account_balance,
-- Step 10: Standardize boolean opt-in values
CASE
WHEN UPPER(TRIM(marketing_opt_in)) IN ('YES', 'TRUE', 'Y', '1') THEN 'Yes'
WHEN UPPER(TRIM(marketing_opt_in)) IN ('NO', 'FALSE', 'N', '0') THEN 'No'
ELSE NULL
END as clean_marketing_opt_in,
-- Add data quality indicators
CASE
WHEN NULLIF(TRIM(customer_name), '') IS NULL THEN 'Missing Name'
WHEN NULLIF(TRIM(email_address), '') IS NULL THEN 'Missing Email'
ELSE 'Complete Core Data'
END as data_quality_flag,
-- Count of NULL fields after cleaning
(CASE WHEN NULLIF(TRIM(customer_name), '') IN ('TBD', 'Unknown', 'NULL') OR NULLIF(TRIM(customer_name), '') IS NULL THEN 1 ELSE 0 END +
CASE WHEN NULLIF(TRIM(email_address), '') IN ('TBD', 'Unknown', 'NULL', 'INVALID', 'N/A') OR NULLIF(TRIM(email_address), '') NOT LIKE '%@%' OR NULLIF(TRIM(email_address), '') IS NULL THEN 1 ELSE 0 END +
CASE WHEN NULLIF(TRIM(phone_number), '') IN ('TBD', 'Unknown', 'NULL', 'N/A', '000-000-0000') OR NULLIF(TRIM(phone_number), '') IS NULL THEN 1 ELSE 0 END
) as missing_contact_fields
FROM raw_customer_data
ORDER BY import_id;
Now, create a summary report showing the cleanup results:
-- Cleanup effectiveness summary
WITH cleaned_data AS (
-- [Insert the cleanup query from above here - abbreviated for space]
SELECT import_id,
CASE WHEN NULLIF(TRIM(customer_name), '') IN ('TBD', 'Unknown', 'NULL') THEN NULL ELSE NULLIF(TRIM(customer_name), '') END as clean_customer_name,
CASE WHEN NULLIF(TRIM(email_address), '') IN ('TBD', 'Unknown', 'NULL', 'INVALID', 'N/A') OR NULLIF(TRIM(email_address), '') NOT LIKE '%@%' THEN NULL ELSE NULLIF(TRIM(email_address), '') END as clean_email
FROM raw_customer_data
)
SELECT
'Original Data' as dataset,
COUNT(*) as total_records,
COUNT(customer_name) as has_name,
COUNT(email_address) as has_email,
ROUND(COUNT(customer_name) * 100.0 / COUNT(*), 1) as pct_complete_name,
ROUND(COUNT(email_address) * 100.0 / COUNT(*), 1) as pct_complete_email
FROM raw_customer_data
UNION ALL
SELECT
'Cleaned Data' as dataset,
COUNT(*) as total_records,
COUNT(clean_customer_name) as has_name,
COUNT(clean_email) as has_email,
ROUND(COUNT(clean_customer_name) * 100.0 / COUNT(*), 1) as pct_complete_name,
ROUND(COUNT(clean_email) * 100.0 / COUNT(*), 1) as pct_complete_email
FROM cleaned_data;
This exercise demonstrates how NULL handling functions work together to solve real data quality problems. The key insight is that data cleaning is rarely about just one function — it's about combining IS NULL detection, COALESCE fallbacks, and NULLIF conversions to create robust data pipelines.
Here are the most frequent NULL handling mistakes that cause production issues, along with how to diagnose and fix them:
-- WRONG: This never finds NULL values
SELECT * FROM customers WHERE email = NULL; -- Returns 0 rows always
-- CORRECT: Use IS NULL
SELECT * FROM customers WHERE email IS NULL;
Debugging tip: If your NULL-finding queries return zero rows unexpectedly, check for = NULL comparisons.
-- DANGEROUS: This excludes NULL values silently
SELECT customer_id, loyalty_tier
FROM customers
WHERE loyalty_tier <> 'Bronze';
-- SAFER: Be explicit about NULL handling
SELECT customer_id, loyalty_tier
FROM customers
WHERE loyalty_tier <> 'Bronze' OR loyalty_tier IS NULL;
Debugging tip: When row counts don't match expectations, check if NULLs are being excluded by comparison operators.
-- PROBLEMATIC: Mixing incompatible types
SELECT
customer_id,
COALESCE(email, phone, customer_id) as contact_method -- Error or unexpected behavior
FROM customers;
-- BETTER: Ensure compatible types
SELECT
customer_id,
COALESCE(email, phone, 'No contact info') as contact_method
FROM customers;
-- INCOMPLETE: Only handles NULLs, not empty strings
SELECT COALESCE(email, 'No email') FROM customers;
-- COMPLETE: Handles both NULLs and empty strings
SELECT COALESCE(NULLIF(TRIM(email), ''), 'No email') FROM customers;
-- SLOW: Multiple subqueries in COALESCE
SELECT
customer_id,
COALESCE(
(SELECT email FROM customer_contacts WHERE type = 'primary'),
(SELECT email FROM customer_contacts WHERE type = 'backup'),
(SELECT email FROM customer_contacts WHERE type = 'work')
) as best_email
FROM customers;
-- FASTER: Join first, then COALESCE
WITH contact_pivot AS (
SELECT
customer_id,
MAX(CASE WHEN type = 'primary' THEN email END) as primary_email,
MAX(CASE WHEN type = 'backup' THEN email END) as backup_email,
MAX(CASE WHEN type = 'work' THEN email END) as work_email
FROM customer_contacts
GROUP BY customer_id
)
SELECT
c.customer_id,
COALESCE(cp.primary_email, cp.backup_email, cp.work_email, 'No email') as best_email
FROM customers c
LEFT JOIN contact_pivot cp ON c.customer_id = cp.customer_id;
When NULL handling isn't working as expected, use these diagnostic queries:
-- Debug: Show exact values including NULLs and empty strings
SELECT
customer_id,
email,
CASE
WHEN email IS NULL THEN 'IS NULL'
WHEN email = '' THEN 'EMPTY STRING'
WHEN TRIM(email) = '' THEN 'WHITESPACE ONLY'
ELSE 'HAS VALUE: ' || email
END as email_diagnosis
FROM customers;
-- Debug: Test your COALESCE logic step by step
SELECT
customer_id,
primary_email,
CASE WHEN primary_email IS NULL THEN 'NULL' ELSE 'NOT NULL' END as primary_check,
backup_email,
CASE WHEN backup_email IS NULL THEN 'NULL' ELSE 'NOT NULL' END as backup_check,
COALESCE(primary_email, backup_email, 'DEFAULT') as coalesce_result
FROM customer_contacts;
You now have mastery over SQL's three essential NULL handling functions:
IS NULL / IS NOT NULL for reliable detection of missing data — remember that standard comparison operators won't work with NULLs due to three-valued logic.
COALESCE for providing fallback values and creating robust calculations. Use it whenever you need the first non-NULL value from a list, especially in calculations where NULLs would break your math.
NULLIF for converting problematic values to NULL, particularly useful for handling "placeholder" values in imported data and preventing division by zero errors.
The real power comes from combining these functions in data cleaning pipelines. You'll use patterns like COALESCE(NULLIF(TRIM(field), ''), 'Default Value') constantly to handle both NULL and empty string scenarios.
Key takeaways for production use:
Next steps to deepen your skills:
The principles you've learned here form the foundation for all advanced data cleaning and validation work. Practice these patterns with your own messy datasets — you'll be surprised how often you need exactly these techniques in real-world data projects.
Learning Path: SQL Fundamentals