Implementing dbt Tests and Data Quality Checks in Production Pipelines

Introduction

You've built a dbt project that transforms raw event data into clean, analysis-ready tables. Your models run cleanly in development, stakeholders are happy with the dashboards, and you've shipped the pipeline to production. Then, three weeks later, a null user_id sneaks through an upstream API change, silently propagates through five dependent models, and your revenue dashboard shows a 20% drop that sends your finance team into a panic. The data was never wrong in the database — it was wrong in the business logic, and you had no automated way to catch it.

This is the exact problem dbt tests exist to solve. But there's a meaningful gap between knowing that dbt has a not_null test and actually running a mature, production-grade data quality framework that catches real issues before stakeholders do. In this lesson, we'll close that gap. We'll move from the basic built-in tests most teams start with, through custom singular and generic tests, all the way to a thoughtful test execution strategy that doesn't grind your production pipeline to a halt.

By the end of this lesson, you'll be able to build a layered testing strategy that covers data freshness, referential integrity, business rule validation, and statistical anomaly detection — and you'll know how to configure your pipeline so failures are actionable, not just noisy.

What you'll learn:

How to configure and layer dbt's four built-in generic tests for maximum coverage
How to write singular tests for complex business rules that generic tests can't express
How to build reusable custom generic tests for patterns that repeat across your project
How to use test severity, --store-failures, and failure metadata to make test results operationally useful
How to design a test execution strategy that balances thoroughness against pipeline performance

Prerequisites

You should already be comfortable with:

Writing dbt models (SQL ref() and source() functions)
Running dbt run and dbt test from the CLI
Basic YAML configuration in schema.yml files
SQL at an intermediate level (window functions, aggregates, subqueries)

You don't need prior experience with data quality frameworks or testing theory — we'll build that context as we go.

Understanding What dbt Tests Actually Do

Before we get into configuration, it's worth being precise about the mechanism. When you run dbt test, dbt compiles each test into a SQL query that returns rows. If the query returns zero rows, the test passes. If it returns any rows, the test fails. That's the entire contract.

This is an elegant design because it means every test — whether it's a built-in generic test or something you wrote yourself — behaves the same way. You can always inspect the compiled SQL to understand exactly what's being checked. You'll find compiled test SQL in target/compiled/<project_name>/models/ after a test run, which is invaluable for debugging.

Understanding this also tells you what dbt tests are not. They're not data transformations. They don't modify data. They don't run automatically in real-time — they run when you invoke them. A dbt test is a scheduled assertion, and its power comes from how strategically you define and schedule those assertions.

Layering the Four Generic Tests

dbt ships with four built-in generic tests: unique, not_null, accepted_values, and relationships. Most practitioners have used all four. The question isn't whether you know they exist — it's whether you're applying them at the right layers of your pipeline.

Think of your dbt project as having three broad layers: staging (raw-to-clean), intermediate (business logic assembly), and marts (analytics-ready). Each layer should have a distinct testing philosophy.

Staging layer — Test the contract with your source systems. Your primary concern here is: did the data arrive, and does it have the shape we expect? At this layer, not_null and unique tests on primary keys are non-negotiable.

Intermediate layer — Test the joins and business logic. This is where referential integrity tests (relationships) catch join conditions that silently drop or duplicate rows.

Mart layer — Test the business rules. This is where accepted_values and more complex custom tests live, because this is the layer stakeholders actually query.

Here's what a well-configured staging schema file looks like for a SaaS product's orders data:

# models/staging/schema.yml
version: 2

sources:
  - name: stripe
    database: raw_db
    schema: stripe
    tables:
      - name: charges
        loaded_at_field: created_at
        freshness:
          warn_after: {count: 6, period: hour}
          error_after: {count: 24, period: hour}
        columns:
          - name: id
            tests:
              - unique
              - not_null
          - name: customer_id
            tests:
              - not_null
          - name: status
            tests:
              - accepted_values:
                  values: ['succeeded', 'pending', 'failed', 'refunded']
          - name: amount
            tests:
              - not_null

models:
  - name: stg_stripe__charges
    description: >
      Cleaned and typed charge records from Stripe.
      One row per charge attempt.
    columns:
      - name: charge_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
      - name: charge_status
        tests:
          - accepted_values:
                values: ['succeeded', 'pending', 'failed', 'refunded']
      - name: amount_usd
        tests:
          - not_null

Notice the freshness configuration on the source table. This is separate from column tests — it checks whether new data has arrived within your expected window. Running dbt source freshness in your CI/CD pipeline before dbt run gives you an early warning when upstream pipelines stall.

Tip: Always test both the source and the staging model for primary key uniqueness and nullability. It sounds redundant, but staging models often apply transformations (coalescing, renaming, casting) that can themselves introduce nulls or duplicates. Testing both layers catches problems at the right level of granularity.

Now let's look at the intermediate layer, where referential integrity matters most:

# models/intermediate/schema.yml
version: 2

models:
  - name: int_orders__with_customers
    description: >
      Orders joined to customer profiles.
      One row per order.
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('stg_stripe__charges')
              field: customer_id
      - name: product_sku
        tests:
          - relationships:
              to: ref('stg_products')
              field: sku

The relationships test here is doing something important: it's verifying that every customer_id in your orders model exists in your customer staging model. If someone runs a backfill that introduces orphaned orders, this catches it.

Writing Singular Tests for Business Rules

Generic tests cover structure. Business rules need something more expressive. That's what singular tests are for.

A singular test is a .sql file in the tests/ directory of your dbt project. It follows the same contract: return rows if something is wrong, return zero rows if everything is fine.

Here's a real-world example. In a SaaS billing pipeline, you might have a rule that a customer cannot be charged more than their contracted maximum amount in a single billing cycle. No generic test can express that — it requires looking at multiple columns across an aggregated set of rows.

-- tests/assert_charges_within_contract_limits.sql

with monthly_charges as (
    select
        customer_id,
        date_trunc('month', charge_date) as billing_month,
        sum(amount_usd)                  as total_charged
    from {{ ref('fct_charges') }}
    where charge_status = 'succeeded'
    group by 1, 2
),

contract_limits as (
    select
        customer_id,
        monthly_cap_usd
    from {{ ref('dim_customer_contracts') }}
    where is_current = true
)

select
    mc.customer_id,
    mc.billing_month,
    mc.total_charged,
    cl.monthly_cap_usd
from monthly_charges mc
inner join contract_limits cl
    on mc.customer_id = cl.customer_id
where mc.total_charged > cl.monthly_cap_usd

If this query returns rows, it means customers were charged beyond their contractual limits — a serious business and compliance issue. The test will fail and surface exactly which customers and billing months are affected.

Another common singular test pattern is checking for duplicate records across a composite key when a single-column uniqueness test isn't enough:

-- tests/assert_no_duplicate_order_line_items.sql

select
    order_id,
    product_sku,
    count(*) as row_count
from {{ ref('fct_order_line_items') }}
group by 1, 2
having count(*) > 1

And here's one for a financial reporting model where you'd want to verify that the sum of daily revenue reconciles to an independently-calculated monthly total — a "cross-model consistency" test:

-- tests/assert_daily_revenue_reconciles_to_monthly.sql

with daily_sum as (
    select
        date_trunc('month', order_date) as revenue_month,
        sum(revenue_usd)                as daily_rolled_up
    from {{ ref('fct_daily_revenue') }}
    group by 1
),

monthly_total as (
    select
        revenue_month,
        total_revenue_usd as monthly_reported
    from {{ ref('fct_monthly_revenue_summary') }}
)

select
    d.revenue_month,
    d.daily_rolled_up,
    m.monthly_reported,
    abs(d.daily_rolled_up - m.monthly_reported) as discrepancy
from daily_sum d
inner join monthly_total m using (revenue_month)
where abs(d.daily_rolled_up - m.monthly_reported) > 0.01  -- allow $0.01 rounding tolerance

Warning: Singular tests that join large fact tables can be slow. If a singular test regularly takes more than 60 seconds, consider whether the logic can be pushed into a model and tested with a simpler assertion, or whether the test should be scheduled less frequently.

Building Custom Generic Tests

When you find yourself writing the same singular test logic repeatedly for different columns or models, it's time to promote it into a custom generic test. Generic tests are parametrized macros — you write the logic once and apply it anywhere in your YAML.

Let's build one for a very common pattern: ensuring a numeric column contains no negative values. This seems like a job for not_null but it isn't — a database will happily store -500.00 in a revenue_usd column.

Create a file in tests/generic/:

-- tests/generic/non_negative.sql

{% test non_negative(model, column_name) %}

select {{ column_name }}
from {{ model }}
where {{ column_name }} < 0

{% endtest %}

Now you can apply it in any schema YAML file:

- name: fct_charges
  columns:
    - name: amount_usd
      tests:
        - non_negative
    - name: refund_amount_usd
      tests:
        - non_negative

Let's build a more sophisticated custom test — one that accepts parameters. A common need is validating that values fall within an expected range, with configurable bounds:

-- tests/generic/within_range.sql

{% test within_range(model, column_name, min_value, max_value) %}

select {{ column_name }}
from {{ model }}
where {{ column_name }} < {{ min_value }}
   or {{ column_name }} > {{ max_value }}

{% endtest %}

Applied in YAML:

- name: fct_user_events
  columns:
    - name: session_duration_seconds
      tests:
        - within_range:
            min_value: 0
            max_value: 86400  # sessions can't be longer than 24 hours
    - name: conversion_rate
      tests:
        - within_range:
            min_value: 0
            max_value: 1

Here's one more custom test that's surprisingly powerful in practice — checking that a column's values are monotonically increasing. This is useful for sequence IDs, timestamps on append-only tables, and version numbers:

-- tests/generic/is_monotonically_increasing.sql

{% test is_monotonically_increasing(model, column_name, partition_by=None) %}

with ordered as (
    select
        {{ column_name }},
        lag({{ column_name }}) over (
            {% if partition_by %}
            partition by {{ partition_by }}
            {% endif %}
            order by {{ column_name }}
        ) as previous_value
    from {{ model }}
)

select *
from ordered
where {{ column_name }} < previous_value

{% endtest %}

- name: stg_audit_log
  columns:
    - name: event_sequence_number
      tests:
        - is_monotonically_increasing

Tip: Store custom generic tests in tests/generic/ or macros/ (if you want to share them across packages). If you're building a set of tests that would be useful across multiple dbt projects, consider packaging them as a dbt package and sharing via dbt Hub or a private registry.

Configuring Test Severity and Failure Behavior

By default, every dbt test failure blocks your pipeline — it exits with a non-zero code, which means a CI/CD system or orchestrator will halt downstream steps. That's the right behavior for critical data quality issues, but it's too blunt for tests that should generate warnings rather than alerts.

dbt supports two severity levels: error and warn. A warn test failure is logged but doesn't cause dbt test to exit with a failure code.

You can set severity at the test level:

- name: fct_marketing_attribution
  columns:
    - name: attributed_revenue_usd
      tests:
        - non_negative:
            severity: error  # This is a blocking failure
        - within_range:
            min_value: 0
            max_value: 1000000
            severity: warn   # Flag unusual values but don't block

You can also configure a warn_if / error_if threshold using the number of failing rows, which is useful when small numbers of exceptions are acceptable but large numbers indicate a systemic problem:

- name: stg_mobile_events
  columns:
    - name: device_id
      tests:
        - not_null:
            warn_if: ">10"
            error_if: ">100"

This configuration says: warn us if more than 10 rows have a null device_id (maybe a few records dropped the field), but fail hard if more than 100 do (probably a systemic upstream issue).

The config block gives you even more control, including the ability to set a limit on how many rows the test will scan — critical for performance on large tables:

- name: fct_page_views
  columns:
    - name: session_id
      tests:
        - not_null:
            config:
              severity: error
              limit: 1000  # Only scan 1000 rows; fail fast if any are null

Warning: Using limit with a test like unique can give you false confidence — you might scan 1000 rows and find no duplicates, but duplicates exist further in the table. Use limit only for tests where partial scanning still gives you meaningful signal (like not_null on a column that's null by design in chunks).

Storing and Querying Test Failures

Running tests in CI is valuable, but losing the failure context after the job exits means every incident requires re-running tests to understand what broke. The --store-failures flag tells dbt to write failing rows into your data warehouse, in a schema called dbt_test__audit by default.

dbt test --store-failures

When a test fails with this flag enabled, dbt creates a table like:

dbt_test__audit.not_null_fct_charges_customer_id
dbt_test__audit.relationships_int_orders__with_customers_customer_id__customer_id__ref_stg_stripe__charges_

You can query these tables directly to see which rows failed, not just that the test failed. This is the difference between a 3 AM alert that says "not_null test failed on fct_charges" and one that says "here are the 47 charge IDs with null customer_id."

Configure where failure tables are stored at the project level in dbt_project.yml:

# dbt_project.yml

tests:
  +store_failures: true
  +schema: data_quality_failures

Or override it at the model or test level for selective storage:

- name: fct_charges
  tests:
    - dbt_expectations.expect_table_row_count_to_be_between:
        min_value: 1000
        max_value: 10000000
        config:
          store_failures: true
          schema: critical_test_failures

Tip: In production, only store failures for error-severity tests. Storing failures for warn tests as well creates noise and burns warehouse storage on non-critical issues. Use project-level store_failures: true with model-level overrides to false for warn-only models.

Testing with dbt_expectations for Statistical Checks

The built-in generic tests cover structural integrity well, but statistical data quality checks — detecting anomalies, validating distributions, checking row counts — require a community package called dbt_expectations, which implements a dbt-native version of the Great Expectations framework.

Add it to your packages.yml:

packages:
  - package: calogica/dbt_expectations
    version: [">=0.10.0", "<0.11.0"]

Run dbt deps to install it.

Now you have access to tests like:

Row count validation — catch when upstream data loads silently deliver empty or truncated extracts:

- name: fct_daily_active_users
  tests:
    - dbt_expectations.expect_table_row_count_to_be_between:
        min_value: 500
        max_value: 50000
        row_condition: "activity_date = current_date - interval '1 day'"

Column value distribution — useful for catching drastic shifts that might indicate data quality issues:

- name: fct_order_items
  columns:
    - name: unit_price_usd
      tests:
        - dbt_expectations.expect_column_mean_to_be_between:
            min_value: 10
            max_value: 500
        - dbt_expectations.expect_column_quantile_values_to_be_between:
            quantile: 0.95
            min_value: 0
            max_value: 2000

Pattern matching — validate that string fields conform to expected formats:

- name: stg_crm__contacts
  columns:
    - name: email_address
      tests:
        - dbt_expectations.expect_column_values_to_match_regex:
            regex: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"

The dbt_expectations package turns your dbt test suite into something that can genuinely catch data drift and anomalies, not just structural issues.

Designing a Production Test Execution Strategy

Here's the thing that trips up most teams: they add lots of tests, then realize their dbt test job takes 45 minutes and their pipeline window is 30. Testing everything on every run doesn't scale. You need a strategy.

Tier your tests by how often they run.

Create test tags in your schema YAML:

- name: fct_charges
  columns:
    - name: charge_id
      tests:
        - unique:
            tags: ['critical', 'every_run']
        - not_null:
            tags: ['critical', 'every_run']
    - name: amount_usd
      tests:
        - non_negative:
            tags: ['critical', 'every_run']
        - within_range:
            min_value: 0
            max_value: 100000
            tags: ['statistical', 'daily']

Then in your orchestration layer (Airflow, Prefect, Dagster, or whatever you're using), run different test tiers on different schedules:

# Runs after every dbt run (every 30 min)
dbt test --select tag:critical

# Runs once per hour
dbt test --select tag:every_run

# Runs once per day, overnight
dbt test --select tag:statistical tag:daily

Use --select to run tests on only the changed models. In a CI/CD pipeline, there's no reason to re-test models that didn't change. The --select flag accepts the same model selection syntax as dbt run:

# Test only models touched in this PR and their downstream dependencies
dbt test --select state:modified+

This requires configuring a manifest.json from your production run as the comparison baseline — worth setting up if your project has more than a few dozen models.

Run source freshness checks before the pipeline, not after. Freshenss checks on source data tell you whether it's worth running at all. In Airflow, this looks like a DbtSourceFreshnessOperator or a BashOperator running dbt source freshness --select source:stripe as the first task in your DAG. If sources are stale, fail fast before spending compute on transformations.

Parallelize test execution. dbt runs tests in parallel by default, up to the threads value in your profile. In production, increase threads specifically for test runs:

dbt test --threads 8 --select tag:critical

Hands-On Exercise

In this exercise, you'll build a complete test suite for a simplified e-commerce analytics pipeline. You'll work with three models: stg_orders, fct_orders, and dim_customers.

Setup: Create a new dbt project or use an existing one with a DuckDB or Snowflake connection. Add three models with this structure:

-- models/staging/stg_orders.sql
select
    order_id::varchar          as order_id,
    customer_id::varchar       as customer_id,
    order_date::date           as order_date,
    order_status::varchar      as order_status,
    subtotal_usd::numeric(10,2) as subtotal_usd,
    tax_usd::numeric(10,2)     as tax_usd,
    total_usd::numeric(10,2)   as total_usd
from {{ source('ecommerce', 'raw_orders') }}

Step 1: Write the staging schema YAML with unique and not_null on order_id, accepted_values on order_status with values ['pending', 'processing', 'shipped', 'delivered', 'cancelled', 'refunded'], and not_null on customer_id, order_date, and total_usd.

Step 2: Write a singular test called assert_total_equals_subtotal_plus_tax.sql that catches orders where total_usd doesn't equal subtotal_usd + tax_usd (with a $0.01 rounding tolerance).

Step 3: Write a custom generic test called not_in_future.sql that checks whether a date column contains any dates after the current date. Apply it to order_date in your staging schema YAML.

Step 4: Configure severity. Set the unique and not_null tests on order_id to severity: error. Set the accepted_values test on order_status to warn_if: ">5", error_if: ">50".

Step 5: Add tags for critical (the not_null and unique tests) and business_logic (the singular test). Run only the critical tests with dbt test --select tag:critical.

Bonus: Install dbt_expectations and add an expect_table_row_count_to_be_between test to stg_orders with a minimum of 1 row. Observe how the test compiles and what SQL it generates in target/compiled/.

Common Mistakes & Troubleshooting

"My tests pass in dev but fail in prod."

This is usually a data volume issue. Dev environments often use a subset of data (via dbt_project.yml vars or limit clauses), so edge cases don't appear. The accepted_values test passing in dev because all the bad statuses are in the records not included in your sample. Solution: run dbt test against prod data at least weekly, even in dev environments. Better yet, use dbt test --target prod directly after merges.

"Tests are taking forever."

Usually caused by one of three things: test queries that do full table scans on large tables, too many tests running sequentially due to low thread counts, or --store-failures writing to a slow schema. Diagnose by running dbt test with --debug to see query execution times. Fix by adding limit to non-critical tests, increasing threads, and ensuring your store_failures schema is in the same region as your warehouse.

"The relationships test is failing but the join looks fine."

The relationships test fails if any value in the column doesn't exist in the referenced column. Check whether the reference column has nulls — a null customer_id in fct_orders won't exist in dim_customers.customer_id, so the test fails even though your join might use left join and handle nulls fine. Add a where {{ column_name }} is not null condition using the where config option:

- name: customer_id
  tests:
    - relationships:
        to: ref('dim_customers')
        field: customer_id
        config:
          where: "customer_id is not null"

"I'm getting duplicate test names in my schema YAML."

If you apply the same test to the same column twice with different configurations, dbt will generate a name collision. Use the name property to disambiguate:

- name: amount_usd
  tests:
    - within_range:
        name: amount_usd_non_negative_range
        min_value: 0
        max_value: 10000
    - within_range:
        name: amount_usd_business_range
        min_value: 1
        max_value: 5000
        severity: warn

"Singular tests are referencing models that don't exist in my current selection."

When you run dbt test --select stg_orders, dbt will only run tests defined on stg_orders. Singular tests are standalone — they always run when you include them. If your singular test references fct_orders and that model wasn't built in this run, the test may fail with a "table not found" error. Tag your singular tests carefully and be explicit about when to include them.

"I can't find which rows failed."

If you didn't run with --store-failures, the only information you have is the count of failing rows in the log output. Retrospectively, you can re-run just the failed test with --store-failures added. Going forward, configure store_failures: true at the project level for all error-severity tests. The storage overhead is trivial compared to the operational value of having the failing rows queryable in your warehouse.

Summary & Next Steps

You now have the building blocks for a production-grade data quality framework in dbt. Here's what we covered:

Generic tests are your first line of defense — apply them at every model layer, and test both sources and staging models
Singular tests express business rules in plain SQL — use them for validation logic that involves aggregation, joins between models, or multi-column conditions
Custom generic tests let you package repeated patterns as reusable macros that accept parameters
Severity configuration separates blocking failures from warnings — critical for keeping pipelines running when non-critical quality issues appear
--store-failures turns test results from a binary pass/fail into queryable, actionable data in your warehouse
Test execution strategy using tags and --select is how you scale to hundreds of tests without grinding your pipeline

The natural next step from here is connecting your test results to alerting. When a critical test fails in production, your team should know within minutes — not when a stakeholder notices a dashboard anomaly. Integrate dbt test exit codes with your orchestrator's alerting (Airflow's on_failure_callback, Prefect's failure hooks, or Dagster's asset checks), and consider shipping test metadata to an observability tool like Elementary or Datafold for trend tracking and anomaly detection over time.

Beyond that, explore the full dbt_expectations package — it has over 50 test types, including statistical tests that can detect data drift across pipeline runs, which is increasingly important as ML features pipelines become part of the modern data stack.

The goal of data quality work isn't perfect data — it's known data. When issues arise (and they always do), a mature test suite tells you exactly where things broke and why. That's the difference between a 3 AM incident and a 9 AM Slack message.

Implementing dbt Tests and Data Quality Checks in Production Pipelines

Implementing dbt Tests and Data Quality Checks in Production Pipelines

Introduction

Prerequisites

Understanding What dbt Tests Actually Do

Layering the Four Generic Tests

Writing Singular Tests for Business Rules

Building Custom Generic Tests

Configuring Test Severity and Failure Behavior

Storing and Querying Test Failures

Testing with dbt_expectations for Statistical Checks

Designing a Production Test Execution Strategy

Hands-On Exercise

Common Mistakes & Troubleshooting

Summary & Next Steps

Related Articles

Secrets Management and Credential Rotation for Data Pipelines in Production

Slowly Changing Dimensions in Practice: Handling Historical Data Changes in Your Warehouse

Batch vs. Stream Processing: Choosing the Right Ingestion Pattern for Your Pipeline

Related Articles

Data Engineering⚡ Practitioner
Secrets Management and Credential Rotation for Data Pipelines in Production
25 min

Data Engineering🌱 Foundation
Slowly Changing Dimensions in Practice: Handling Historical Data Changes in Your Warehouse
15 min

Data Engineering🌱 Foundation
Batch vs. Stream Processing: Choosing the Right Ingestion Pattern for Your Pipeline
17 min