
Picture this: your company's customer data is scattered across three different databases, sales reports take six hours to generate, and your analysts spend more time waiting for queries than analyzing insights. Your CEO wants real-time dashboards, your marketing team needs to segment millions of customers, and everyone's asking why the monthly reports are always delayed. Sound familiar?
You need a cloud data warehouse — but which one? The three major players are Amazon Redshift, Google BigQuery, and Snowflake, and each takes a fundamentally different approach to storing and processing massive amounts of data. By the end of this lesson, you'll understand how these platforms work under the hood, when to choose each one, and how to make an informed decision for your specific use case.
What you'll learn:
This lesson assumes you understand basic SQL and have worked with traditional databases. You should be familiar with concepts like tables, queries, and indexes, but no prior experience with cloud platforms is required.
Before diving into specific platforms, let's establish what makes cloud data warehouses different from traditional databases.
A data warehouse is designed specifically for analytical workloads — running complex queries across large datasets to generate business insights. Unlike transactional databases that handle many small, fast operations (like processing orders), data warehouses excel at scanning millions of rows to answer questions like "What was our revenue growth by region last quarter?"
Cloud data warehouses bring three key advantages:
Now let's examine how our three platforms implement these concepts.
Redshift was Amazon's first major foray into cloud analytics, launched in 2012. It's essentially a massively parallel version of PostgreSQL, designed for traditional data warehouse workloads.
Redshift organizes resources into clusters. Each cluster consists of:
When you run a query, the leader node creates an execution plan and distributes work across compute nodes. Each node processes its slice of data simultaneously, then combines results.
Here's what this looks like in practice:
-- This query scans 50 million customer records
SELECT region, COUNT(*) as customers, AVG(annual_spend)
FROM customers
WHERE signup_date >= '2023-01-01'
GROUP BY region;
Redshift might split this across 4 compute nodes:
Each node calculates regional counts and averages for its slice, then the leader node combines the results.
Redshift stores data in a columnar format, which is perfect for analytical queries. Instead of storing complete rows together, it groups all values from each column. This means when you need the annual_spend column, Redshift only reads that data — not entire customer records.
However, Redshift requires manual optimization through:
Distribution keys: Tell Redshift how to spread data across nodes
CREATE TABLE sales (
order_id INTEGER,
customer_id INTEGER,
product_id INTEGER,
amount DECIMAL(10,2),
order_date DATE
) DISTKEY (customer_id); -- Distribute by customer_id
Sort keys: Define how data is physically ordered
CREATE TABLE sales (
-- columns as above
) DISTKEY (customer_id)
SORTKEY (order_date); -- Sort by date for time-series queries
Tip: Poor distribution and sort key choices can make queries 10x slower. Redshift requires more database tuning expertise than the other platforms.
Redshift works best for:
BigQuery, launched in 2010, takes a radically different approach. It's completely serverless — you never provision clusters or manage infrastructure.
BigQuery separates storage and compute completely. When you run a query:
For our customer analysis query:
SELECT region, COUNT(*) as customers, AVG(annual_spend)
FROM `project.dataset.customers`
WHERE signup_date >= '2023-01-01'
GROUP BY region;
BigQuery might spin up 2,000 workers for 30 seconds, each processing 25,000 records in parallel. The massive parallelization means complex queries often complete in seconds, regardless of data size.
BigQuery stores data in Capacitor, Google's proprietary columnar format that includes advanced compression and encoding. The system automatically:
You don't tune storage — BigQuery handles optimization automatically.
Partitioning is BigQuery's main performance feature:
CREATE TABLE sales (
order_id INTEGER,
customer_id INTEGER,
amount DECIMAL(10,2),
order_date DATE
) PARTITION BY order_date; -- Automatic date-based partitioning
When you query recent data, BigQuery only scans relevant partitions:
-- Only scans December 2023 partition
SELECT SUM(amount) FROM sales
WHERE order_date >= '2023-12-01';
BigQuery is ideal for:
Snowflake, launched in 2014, was built from scratch for the cloud. It combines the best aspects of both previous approaches while running on any major cloud provider.
Snowflake's multi-cluster shared data architecture separates three layers:
Here's the key insight: multiple compute clusters can access the same data simultaneously without interfering with each other.
-- Marketing team running customer segmentation
SELECT customer_segment, AVG(lifetime_value)
FROM customers
GROUP BY customer_segment;
-- Simultaneously, finance team running revenue analysis
SELECT DATE_TRUNC('month', order_date), SUM(amount)
FROM orders
GROUP BY 1;
These queries can run on separate virtual warehouses, each auto-scaling based on complexity, without impacting each other's performance.
Snowflake automatically organizes data into micro-partitions — small, immutable chunks typically containing 50-500MB of compressed data. The system tracks min/max values for each partition, enabling incredibly fast pruning during queries.
When you insert data:
INSERT INTO sales
SELECT order_id, customer_id, amount, order_date
FROM staging_table;
Snowflake automatically:
For queries, this enables partition pruning:
-- Snowflake automatically skips 95% of partitions
SELECT * FROM sales
WHERE order_date = '2023-12-15';
Snowflake is perfect for:
Understanding performance characteristics helps you choose the right platform for your workload.
For straightforward analytical queries (COUNT, SUM, AVG over large tables):
BigQuery typically wins due to massive parallelization. A query scanning 1TB of data might complete in 15-30 seconds.
Snowflake performs similarly for well-partitioned data, often within 20-40 seconds for the same query.
Redshift can match or beat both if properly tuned with good distribution and sort keys, but may take 60+ seconds without optimization.
For queries joining multiple large tables:
SELECT c.region, p.category, SUM(o.amount) as revenue
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE o.order_date >= '2023-01-01'
GROUP BY c.region, p.category;
Redshift excels here when tables are co-located using proper distribution keys. The data locality reduces network overhead.
Snowflake performs consistently well due to its query optimizer and automatic clustering.
BigQuery can struggle with complex joins unless you use partitioning and clustering strategically.
When multiple users run queries simultaneously:
Snowflake shines with its multi-cluster architecture. Each workload can have dedicated compute resources.
BigQuery handles concurrency well due to its serverless nature, though costs can spike with many simultaneous heavy queries.
Redshift requires careful workload management and may need multiple clusters for different user groups.
For ingesting new data:
BigQuery offers the fastest bulk loading, especially for streaming data. It can ingest millions of records per second.
Snowflake provides excellent loading performance with features like auto-clustering and variant data type support.
Redshift requires more tuning (COPY commands, compression options) but can achieve good performance when configured correctly.
Each platform uses a different pricing approach that significantly impacts total cost of ownership.
Redshift pricing resembles traditional cloud computing:
Example calculation for a medium workload:
2-node ra3.xlplus cluster:
- On-demand: $1.086/hour × 24 × 30 = $781/month
- 1-year reserved: $0.652/hour × 24 × 30 = $470/month
- Plus 5TB storage: 5,000 × $0.024 = $120/month
Total reserved: $590/month
Cost optimization strategy: Size clusters for peak usage, then scale down during off-hours using scheduled scaling.
BigQuery charges based on data processed:
Example calculation:
Analyzing 10TB of data monthly:
- Scanning 2TB per query × 50 queries = 100TB processed
- 100TB × $5 = $500/month query costs
- Plus 10TB storage: 10,000 × $0.020 = $200/month
Total: $700/month
Cost optimization strategy: Use partitioning and clustering to minimize data scanned. Consider flat-rate pricing for heavy usage.
Snowflake uses a credit system:
Example calculation:
Medium warehouse (4 credits/hour) running 8 hours/day:
- 4 credits × 8 hours × 30 days = 960 credits/month
- 960 credits × $3 = $2,880/month compute
- Plus 5TB storage: 5 × $23 = $115/month
Total: $2,995/month
Cost optimization strategy: Use auto-suspend and auto-resume features. Right-size warehouses for specific workloads.
For a typical scenario (5TB storage, moderate query volume):
However, these costs vary dramatically based on:
Warning: Snowflake can be 2-5x more expensive than alternatives for simple workloads, but may provide better value for complex, concurrent scenarios.
Let's work through a realistic scenario to practice platform selection. Imagine you're a data engineer at a growing e-commerce company with these requirements:
Current situation:
Step 1: Map requirements to platform strengths
Create a scoring matrix:
Requirement | Redshift | BigQuery | Snowflake
Predictable costs | 3 | 1 | 2
Ad-hoc query flexibility | 1 | 3 | 3
Multi-user concurrency | 2 | 2 | 3
Complex analytics | 2 | 3 | 3
Operational simplicity | 1 | 3 | 3
PostgreSQL compatibility | 3 | 1 | 2
Scale: 1 = Poor fit, 2 = Adequate, 3 = Excellent fit
Step 2: Estimate costs for each platform
Based on 20 users, 2TB storage, 100TB monthly scanning:
Redshift:
BigQuery:
Snowflake:
Step 3: Consider non-cost factors
Step 4: Make a recommendation
For this scenario, BigQuery might be the best choice because:
However, Snowflake could be worth the premium if you expect rapid user growth or need guaranteed performance isolation between teams.
Mistake: Poor distribution key choice
-- Bad: Creates data skew
CREATE TABLE orders DISTKEY (order_id);
-- Good: Evenly distributes data
CREATE TABLE orders DISTKEY (customer_id);
Symptom: Queries take much longer than expected Solution: Monitor system tables to identify skewed distributions:
SELECT trim(schema), "table", tbl_rows, skew_sortkey1, skew_rows
FROM svv_table_info
WHERE skew_sortkey1 > 1.5; -- Identify skewed tables
Mistake: Scanning entire tables unnecessarily
-- Expensive: Scans all data
SELECT customer_id FROM orders WHERE order_date = '2023-12-01';
-- Cheap: Uses partition pruning
SELECT customer_id FROM orders
WHERE DATE(order_date) = '2023-12-01'; -- Partition-aligned filter
Solution: Always check query costs before running:
Mistake: Leaving large warehouses running 24/7
-- Wasteful: Extra-large warehouse running continuously
ALTER WAREHOUSE analytics_wh SET WAREHOUSE_SIZE = 'X-LARGE';
ALTER WAREHOUSE analytics_wh SET AUTO_SUSPEND = 86400; -- 24 hours
Solution: Configure aggressive auto-suspend and right-size warehouses:
-- Efficient: Smaller warehouse with fast auto-suspend
ALTER WAREHOUSE analytics_wh SET WAREHOUSE_SIZE = 'LARGE';
ALTER WAREHOUSE analytics_wh SET AUTO_SUSPEND = 60; -- 1 minute
ALTER WAREHOUSE analytics_wh SET AUTO_RESUME = TRUE;
Mistake: Assuming SQL compatibility across platforms
Each platform has subtle differences:
DATE_TRUNC() works differentlySolution: Create a migration checklist:
You now understand the fundamental differences between the three leading cloud data warehouses:
Redshift offers the most cost-effective solution for predictable workloads when you have database optimization expertise and are committed to AWS.
BigQuery provides the best serverless experience with excellent price/performance for variable analytical workloads, especially when integrated with Google Cloud's AI/ML services.
Snowflake delivers the most flexible multi-cloud solution with superior concurrency handling, though at a premium price point.
Your choice depends on balancing cost, performance requirements, team expertise, and strategic cloud direction. Most importantly, all three platforms can handle enterprise-scale analytics — the "best" choice depends entirely on your specific context.
Immediate next steps:
Continue your learning:
The cloud data warehouse landscape continues evolving rapidly, but understanding these core architectural differences will help you make informed decisions as new features and competitors emerge.
Learning Path: Modern Data Stack