BigQuery Data Quality Testing
Overview
Section titled “Overview”Data quality testing in Datablast ensures your BigQuery data meets business requirements and maintains consistency across your pipeline. The platform provides a comprehensive testing framework with both built-in and custom test capabilities.
Key Features
Section titled “Key Features”- Built-in Tests: Standard data quality checks (not_null, unique, positive, etc.)
- Custom Tests: SQL-based validation for business logic
- Blocking Control: Configure tests to stop pipelines or continue with warnings
- Comprehensive Coverage: Column-level and cross-table validation
- Performance Monitoring: Query performance validation
Built-in Testing Framework
Section titled “Built-in Testing Framework”Column Tests
Section titled “Column Tests”Column tests validate individual columns against standard data quality rules.
Not Null Test
Section titled “Not Null Test”Ensures column values are not null.
tests: columns: user_id: - name: "not_null" # blocking: true (default) email: - name: "not_null" blocking: false # Override default blocking behaviorUnique Test
Section titled “Unique Test”Validates column uniqueness.
tests: columns: user_id: - name: "unique" blocking: true order_id: - name: "unique" blocking: false # Non-blocking uniqueness checkPositive Test
Section titled “Positive Test”Ensures numeric values are positive.
tests: columns: revenue: - name: "positive" blocking: true quantity: - name: "positive" blocking: falseAccepted Values Test
Section titled “Accepted Values Test”Validates against allowed values.
tests: columns: status: - name: "accepted_values" accepted_values: ["active", "inactive", "pending"] blocking: true category: - name: "accepted_values" accepted_values: ["electronics", "clothing", "books", "home"] blocking: falseCustom Tests
Section titled “Custom Tests”Custom tests allow you to implement business-specific validation logic using SQL queries.
Basic Custom Test
Section titled “Basic Custom Test”tests: custom: - name: "revenue_consistency" equal_to: 0 query: | SELECT count(*) FROM users u JOIN revenue r ON u.user_id = r.user_id WHERE u.status = 'active' AND r.revenue <= 0 blocking: trueNon-blocking Custom Test
Section titled “Non-blocking Custom Test”tests: custom: - name: "date_range_check" equal_to: 0 query: | SELECT count(*) FROM events WHERE event_date > CURRENT_DATE() blocking: falseComplex Business Logic Test
Section titled “Complex Business Logic Test”tests: custom: - name: "user_engagement_validation" equal_to: 0 query: | SELECT count(*) FROM analytics.user_metrics um JOIN analytics.user_sessions us ON um.user_id = us.user_id WHERE um.daily_active_users > 0 AND us.session_count = 0 AND um.dt = us.session_date blocking: trueTest Execution
Section titled “Test Execution”Blocking vs Non-blocking Tests
Section titled “Blocking vs Non-blocking Tests”Blocking Tests (Default)
Section titled “Blocking Tests (Default)”Blocking tests stop pipeline execution if they fail.
tests: columns: user_id: - name: "not_null" # Pipeline stops if test fails - name: "unique" # Pipeline stops if test fails revenue: - name: "positive" # Pipeline stops if test failsUse Cases:
- Critical data quality requirements
- Data integrity checks
- Business rule validation
- Production data validation
Non-blocking Tests
Section titled “Non-blocking Tests”Non-blocking tests run but don’t stop pipeline execution if they fail.
tests: columns: user_id: - name: "not_null" blocking: false # Test runs but doesn't stop pipeline revenue: - name: "positive" blocking: false # Test runs but doesn't stop pipelineUse Cases:
- Data quality monitoring
- Warning conditions
- Non-critical validations
- Data quality trends
Test Configuration Examples
Section titled “Test Configuration Examples”Comprehensive Column Testing
Section titled “Comprehensive Column Testing”tests: columns: user_id: - name: "not_null" blocking: true - name: "unique" blocking: true email: - name: "not_null" blocking: true - name: "accepted_values" accepted_values: ["@company.com", "@partner.com"] blocking: false revenue: - name: "not_null" blocking: true - name: "positive" blocking: true status: - name: "accepted_values" accepted_values: ["active", "inactive", "pending", "suspended"] blocking: trueMixed Custom and Column Tests
Section titled “Mixed Custom and Column Tests”tests: columns: user_id: - name: "not_null" - name: "unique" revenue: - name: "positive" custom: - name: "revenue_threshold_check" equal_to: 0 query: | SELECT count(*) FROM analytics.daily_revenue WHERE revenue > 1000000 AND user_count < 100 blocking: true - name: "data_freshness_check" equal_to: 0 query: | SELECT count(*) FROM analytics.daily_metrics WHERE created_at < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR) blocking: falseTest Types Reference
Section titled “Test Types Reference”Standard Column Tests
Section titled “Standard Column Tests”| Test Name | Description | Use Case | Example |
|---|---|---|---|
not_null | Ensures column values are not null | Required fields | user_id, email |
unique | Validates column uniqueness | Primary keys | order_id, user_id |
positive | Ensures numeric values are positive | Metrics, counts | revenue, quantity |
accepted_values | Validates against allowed values | Enums, categories | status, category |
Custom Test Patterns
Section titled “Custom Test Patterns”Data Consistency Tests
Section titled “Data Consistency Tests”tests: custom: - name: "user_revenue_consistency" equal_to: 0 query: | SELECT count(*) FROM users u JOIN user_revenue ur ON u.user_id = ur.user_id WHERE u.status = 'active' AND ur.total_revenue < 0 blocking: trueCross-Table Validation
Section titled “Cross-Table Validation”tests: custom: - name: "order_user_validation" equal_to: 0 query: | SELECT count(*) FROM orders o LEFT JOIN users u ON o.user_id = u.user_id WHERE u.user_id IS NULL blocking: trueBusiness Rule Validation
Section titled “Business Rule Validation”tests: custom: - name: "discount_validation" equal_to: 0 query: | SELECT count(*) FROM orders WHERE discount_percentage > 100 OR discount_amount > order_total blocking: trueData Freshness Tests
Section titled “Data Freshness Tests”tests: custom: - name: "data_freshness" equal_to: 0 query: | SELECT count(*) FROM analytics.daily_metrics WHERE created_at < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR) blocking: falseBest Practices
Section titled “Best Practices”Test Design
Section titled “Test Design”1. Start with Critical Tests
Section titled “1. Start with Critical Tests”Focus on the most important data quality requirements first.
tests: columns: # Critical business fields user_id: - name: "not_null" - name: "unique" revenue: - name: "positive" # Important but not critical email: - name: "not_null" blocking: false2. Use Appropriate Blocking Behavior
Section titled “2. Use Appropriate Blocking Behavior”Configure blocking based on business impact.
tests: columns: # Critical - must block user_id: - name: "not_null" blocking: true # Important - can warn email: - name: "not_null" blocking: false # Nice to have - monitor only phone: - name: "not_null" blocking: false3. Implement Comprehensive Coverage
Section titled “3. Implement Comprehensive Coverage”Test all critical data quality aspects.
tests: columns: user_id: - name: "not_null" - name: "unique" email: - name: "not_null" - name: "accepted_values" accepted_values: ["@company.com", "@partner.com"] revenue: - name: "not_null" - name: "positive" custom: - name: "business_logic_validation" equal_to: 0 query: | SELECT count(*) FROM analytics.user_metrics WHERE daily_active_users > total_users blocking: truePerformance Considerations
Section titled “Performance Considerations”1. Optimize Test Queries
Section titled “1. Optimize Test Queries”Write efficient test queries to minimize impact on pipeline performance.
-- Good: Efficient test querySELECT 1FROM analytics.daily_metricsWHERE user_id IS NULLLIMIT 1
-- Avoid: Inefficient test querySELECT COUNT(*)FROM analytics.daily_metricsWHERE user_id IS NULL2. Use Appropriate Test Types
Section titled “2. Use Appropriate Test Types”Choose the right test type for your use case.
# Good: Use column tests for simple validationstests: columns: user_id: - name: "not_null" - name: "unique"
# Good: Use custom tests for complex logictests: custom: - name: "complex_business_rule" equal_to: 0 query: | SELECT count(*) FROM users u JOIN orders o ON u.user_id = o.user_id WHERE u.status = 'active' AND o.order_count = 03. Monitor Test Performance
Section titled “3. Monitor Test Performance”Track test execution time and optimize as needed.
tests: custom: - name: "performance_monitoring" equal_to: 0 query: | SELECT count(*) FROM analytics.large_table WHERE created_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) blocking: falseTroubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Test Failures
Section titled “Test Failures”Problem: Tests failing unexpectedly
Solutions:
- Review test logic and data expectations
- Check for data quality issues
- Verify test configuration
- Analyze test results and trends
Debug Steps:
- Check test query syntax
- Review actual data values
- Verify test conditions
- Check for data changes
Performance Issues
Section titled “Performance Issues”Problem: Tests running slowly
Solutions:
- Optimize test queries
- Use appropriate test types
- Implement efficient validation logic
- Monitor test performance
Debug Steps:
- Analyze query execution plans
- Check for missing indexes
- Review query complexity
- Monitor resource usage
False Positives
Section titled “False Positives”Problem: Tests failing due to expected data patterns
Solutions:
- Review test logic
- Adjust test conditions
- Use non-blocking tests for expected patterns
- Implement more sophisticated validation
Debug Steps:
- Analyze failing test results
- Review business requirements
- Check data patterns and trends
- Adjust test parameters
Debugging Tips
Section titled “Debugging Tips”1. Test Query Validation
Section titled “1. Test Query Validation”Test your custom queries independently:
-- Test your custom query logicSELECT 1FROM users uJOIN revenue r ON u.user_id = r.user_idWHERE u.status = 'active' AND r.revenue <= 0LIMIT 102. Data Sampling
Section titled “2. Data Sampling”Use data sampling for large datasets:
tests: custom: - name: "sampled_validation" equal_to: 0 query: | SELECT count(*) FROM analytics.large_table TABLESAMPLE SYSTEM (1 PERCENT) WHERE user_id IS NULL blocking: false3. Incremental Testing
Section titled “3. Incremental Testing”Implement incremental testing for large datasets:
tests: custom: - name: "incremental_validation" equal_to: 0 query: | SELECT count(*) FROM analytics.daily_metrics WHERE dt = '{{ ds }}' AND user_id IS NULL blocking: trueMonitoring and Alerting
Section titled “Monitoring and Alerting”Test Result Monitoring
Section titled “Test Result Monitoring”Monitor test results and trends to identify data quality issues.
tests: custom: - name: "data_quality_trend" equal_to: 0 query: | SELECT count(*) FROM analytics.daily_metrics WHERE data_quality_score < 0.95 blocking: falseAlert Configuration
Section titled “Alert Configuration”Set up alerts for critical test failures.
tests: columns: user_id: - name: "not_null" blocking: true # Alert on failure revenue: - name: "positive" blocking: true # Alert on failure