Skip to content

BigQuery Data Quality Testing

Data quality testing in Datablast ensures your BigQuery data meets business requirements and maintains consistency across your pipeline. The platform provides a comprehensive testing framework with both built-in and custom test capabilities.

  • Built-in Tests: Standard data quality checks (not_null, unique, positive, etc.)
  • Custom Tests: SQL-based validation for business logic
  • Blocking Control: Configure tests to stop pipelines or continue with warnings
  • Comprehensive Coverage: Column-level and cross-table validation
  • Performance Monitoring: Query performance validation

Column tests validate individual columns against standard data quality rules.

Ensures column values are not null.

tests:
columns:
user_id:
- name: "not_null" # blocking: true (default)
email:
- name: "not_null"
blocking: false # Override default blocking behavior

Validates column uniqueness.

tests:
columns:
user_id:
- name: "unique"
blocking: true
order_id:
- name: "unique"
blocking: false # Non-blocking uniqueness check

Ensures numeric values are positive.

tests:
columns:
revenue:
- name: "positive"
blocking: true
quantity:
- name: "positive"
blocking: false

Validates against allowed values.

tests:
columns:
status:
- name: "accepted_values"
accepted_values: ["active", "inactive", "pending"]
blocking: true
category:
- name: "accepted_values"
accepted_values: ["electronics", "clothing", "books", "home"]
blocking: false

Custom tests allow you to implement business-specific validation logic using SQL queries.

tests:
custom:
- name: "revenue_consistency"
equal_to: 0
query: |
SELECT count(*)
FROM users u
JOIN revenue r ON u.user_id = r.user_id
WHERE u.status = 'active' AND r.revenue <= 0
blocking: true
tests:
custom:
- name: "date_range_check"
equal_to: 0
query: |
SELECT count(*)
FROM events
WHERE event_date > CURRENT_DATE()
blocking: false
tests:
custom:
- name: "user_engagement_validation"
equal_to: 0
query: |
SELECT count(*)
FROM analytics.user_metrics um
JOIN analytics.user_sessions us ON um.user_id = us.user_id
WHERE um.daily_active_users > 0
AND us.session_count = 0
AND um.dt = us.session_date
blocking: true

Blocking tests stop pipeline execution if they fail.

tests:
columns:
user_id:
- name: "not_null" # Pipeline stops if test fails
- name: "unique" # Pipeline stops if test fails
revenue:
- name: "positive" # Pipeline stops if test fails

Use Cases:

  • Critical data quality requirements
  • Data integrity checks
  • Business rule validation
  • Production data validation

Non-blocking tests run but don’t stop pipeline execution if they fail.

tests:
columns:
user_id:
- name: "not_null"
blocking: false # Test runs but doesn't stop pipeline
revenue:
- name: "positive"
blocking: false # Test runs but doesn't stop pipeline

Use Cases:

  • Data quality monitoring
  • Warning conditions
  • Non-critical validations
  • Data quality trends
tests:
columns:
user_id:
- name: "not_null"
blocking: true
- name: "unique"
blocking: true
email:
- name: "not_null"
blocking: true
- name: "accepted_values"
accepted_values: ["@company.com", "@partner.com"]
blocking: false
revenue:
- name: "not_null"
blocking: true
- name: "positive"
blocking: true
status:
- name: "accepted_values"
accepted_values: ["active", "inactive", "pending", "suspended"]
blocking: true
tests:
columns:
user_id:
- name: "not_null"
- name: "unique"
revenue:
- name: "positive"
custom:
- name: "revenue_threshold_check"
equal_to: 0
query: |
SELECT count(*)
FROM analytics.daily_revenue
WHERE revenue > 1000000 AND user_count < 100
blocking: true
- name: "data_freshness_check"
equal_to: 0
query: |
SELECT count(*)
FROM analytics.daily_metrics
WHERE created_at < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR)
blocking: false
Test NameDescriptionUse CaseExample
not_nullEnsures column values are not nullRequired fieldsuser_id, email
uniqueValidates column uniquenessPrimary keysorder_id, user_id
positiveEnsures numeric values are positiveMetrics, countsrevenue, quantity
accepted_valuesValidates against allowed valuesEnums, categoriesstatus, category
tests:
custom:
- name: "user_revenue_consistency"
equal_to: 0
query: |
SELECT count(*)
FROM users u
JOIN user_revenue ur ON u.user_id = ur.user_id
WHERE u.status = 'active' AND ur.total_revenue < 0
blocking: true
tests:
custom:
- name: "order_user_validation"
equal_to: 0
query: |
SELECT count(*)
FROM orders o
LEFT JOIN users u ON o.user_id = u.user_id
WHERE u.user_id IS NULL
blocking: true
tests:
custom:
- name: "discount_validation"
equal_to: 0
query: |
SELECT count(*)
FROM orders
WHERE discount_percentage > 100
OR discount_amount > order_total
blocking: true
tests:
custom:
- name: "data_freshness"
equal_to: 0
query: |
SELECT count(*)
FROM analytics.daily_metrics
WHERE created_at < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
blocking: false

Focus on the most important data quality requirements first.

tests:
columns:
# Critical business fields
user_id:
- name: "not_null"
- name: "unique"
revenue:
- name: "positive"
# Important but not critical
email:
- name: "not_null"
blocking: false

Configure blocking based on business impact.

tests:
columns:
# Critical - must block
user_id:
- name: "not_null"
blocking: true
# Important - can warn
email:
- name: "not_null"
blocking: false
# Nice to have - monitor only
phone:
- name: "not_null"
blocking: false

Test all critical data quality aspects.

tests:
columns:
user_id:
- name: "not_null"
- name: "unique"
email:
- name: "not_null"
- name: "accepted_values"
accepted_values: ["@company.com", "@partner.com"]
revenue:
- name: "not_null"
- name: "positive"
custom:
- name: "business_logic_validation"
equal_to: 0
query: |
SELECT count(*)
FROM analytics.user_metrics
WHERE daily_active_users > total_users
blocking: true

Write efficient test queries to minimize impact on pipeline performance.

-- Good: Efficient test query
SELECT 1
FROM analytics.daily_metrics
WHERE user_id IS NULL
LIMIT 1
-- Avoid: Inefficient test query
SELECT COUNT(*)
FROM analytics.daily_metrics
WHERE user_id IS NULL

Choose the right test type for your use case.

# Good: Use column tests for simple validations
tests:
columns:
user_id:
- name: "not_null"
- name: "unique"
# Good: Use custom tests for complex logic
tests:
custom:
- name: "complex_business_rule"
equal_to: 0
query: |
SELECT count(*)
FROM users u
JOIN orders o ON u.user_id = o.user_id
WHERE u.status = 'active' AND o.order_count = 0

Track test execution time and optimize as needed.

tests:
custom:
- name: "performance_monitoring"
equal_to: 0
query: |
SELECT count(*)
FROM analytics.large_table
WHERE created_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
blocking: false

Problem: Tests failing unexpectedly

Solutions:

  • Review test logic and data expectations
  • Check for data quality issues
  • Verify test configuration
  • Analyze test results and trends

Debug Steps:

  1. Check test query syntax
  2. Review actual data values
  3. Verify test conditions
  4. Check for data changes

Problem: Tests running slowly

Solutions:

  • Optimize test queries
  • Use appropriate test types
  • Implement efficient validation logic
  • Monitor test performance

Debug Steps:

  1. Analyze query execution plans
  2. Check for missing indexes
  3. Review query complexity
  4. Monitor resource usage

Problem: Tests failing due to expected data patterns

Solutions:

  • Review test logic
  • Adjust test conditions
  • Use non-blocking tests for expected patterns
  • Implement more sophisticated validation

Debug Steps:

  1. Analyze failing test results
  2. Review business requirements
  3. Check data patterns and trends
  4. Adjust test parameters

Test your custom queries independently:

-- Test your custom query logic
SELECT 1
FROM users u
JOIN revenue r ON u.user_id = r.user_id
WHERE u.status = 'active' AND r.revenue <= 0
LIMIT 10

Use data sampling for large datasets:

tests:
custom:
- name: "sampled_validation"
equal_to: 0
query: |
SELECT count(*)
FROM analytics.large_table
TABLESAMPLE SYSTEM (1 PERCENT)
WHERE user_id IS NULL
blocking: false

Implement incremental testing for large datasets:

tests:
custom:
- name: "incremental_validation"
equal_to: 0
query: |
SELECT count(*)
FROM analytics.daily_metrics
WHERE dt = '{{ ds }}'
AND user_id IS NULL
blocking: true

Monitor test results and trends to identify data quality issues.

tests:
custom:
- name: "data_quality_trend"
equal_to: 0
query: |
SELECT count(*)
FROM analytics.daily_metrics
WHERE data_quality_score < 0.95
blocking: false

Set up alerts for critical test failures.

tests:
columns:
user_id:
- name: "not_null"
blocking: true
# Alert on failure
revenue:
- name: "positive"
blocking: true
# Alert on failure