Pipeline Configuration
The pipeline.yml file contains all the necessary information to build and configure a data pipeline. This file defines the pipeline’s schedule, connections, notifications, and other settings.
Basic Pipeline Configuration
Section titled “Basic Pipeline Configuration”Here’s a comprehensive example pipeline.yml file:
id: analytics-pipelineschedule: "0 6 * * *"start_date: "2025-08-28"default_connections: gcpConnectionId: analytics-gcpnotifications: slack: - name: data-team connection: "slack-data-team" success: ":tada: Pipeline has finished successfully!" failure: ":red_circle: Pipeline has failed!"description: | The data in this pipeline is obtained from external API. Tables in BigQuery can be found under project "analytics-data". It runs everyday at 06:00 UTC.Configuration Parameters
Section titled “Configuration Parameters”Required Parameters
Section titled “Required Parameters”- Type: String
- Description: Unique identifier for the pipeline
- Example:
marketing-analytics
schedule
Section titled “schedule”- Type: String (Cron format)
- Description: Schedule when the pipeline should run
- Examples:
0 4 * * *- Daily at 4 AM UTC0 */6 * * *- Every 6 hours0 0 * * 1- Weekly on Monday at midnight
start_date
Section titled “start_date”- Type: String (YYYY-MM-DD format)
- Description: Start date for the pipeline (useful for backfills)
- Example:
2022-09-01
Optional Parameters
Section titled “Optional Parameters”description
Section titled “description”- Type: String (Multi-line)
- Description: Detailed description of the pipeline’s purpose and data flow
- Example: See above configuration
default_connections
Section titled “default_connections”- Type: Object
- Description: Default connections to use for tasks if no task-specific connection is specified
- Supported Connections:
gcpConnectionId: Google Cloud Platform connectionaws_conn_id: AWS connectionsnowflake: Snowflake connectionpostgres: PostgreSQL connection
notifications
Section titled “notifications”- Type: Object
- Description: Notification channels for pipeline success/failure
- Supported Channels:
slack: Slack notificationsdiscord: Discord notifications
project_config
Section titled “project_config”- Type: Object
- Description: Project-specific configuration settings
- Parameters:
name: Project nameflags: Feature flagsdefaults: Default settings for specific services
Schedule Examples
Section titled “Schedule Examples”Common Schedule Patterns
Section titled “Common Schedule Patterns”# Daily at 1 AM UTCschedule: "0 1 * * *"# Every 6 hoursschedule: "0 */6 * * *"# Weekly on Sunday at 2 AM UTCschedule: "0 2 * * 0"# Monthly on the 1st at 3 AM UTCschedule: "0 3 1 * *"# Business days only (Monday-Friday) at 8 AM UTCschedule: "0 8 * * 1-5"# Every 15 minutesschedule: "*/15 * * * *"Timezone Considerations
Section titled “Timezone Considerations”By default, schedules use UTC time. To convert to your local timezone:
# For EST (UTC-5): 4 AM EST = 9 AM UTCschedule: "0 9 * * *"# For PST (UTC-8): 4 AM PST = 12 PM UTCschedule: "0 12 * * *"Connection Configuration
Section titled “Connection Configuration”Google Cloud Platform
Section titled “Google Cloud Platform”default_connections: gcpConnectionId: my-gcp-connectionAWS Integration
Section titled “AWS Integration”default_connections: aws_conn_id: my-aws-connectionMultiple Connections
Section titled “Multiple Connections”default_connections: gcpConnectionId: analytics-gcp aws_conn_id: analytics-aws snowflake: analytics-snowflakeNotification Configuration
Section titled “Notification Configuration”Slack Notifications
Section titled “Slack Notifications”notifications: slack: - name: data-team connection: "slack-data-team" success: ":tada: Pipeline has finished successfully!" failure: ":red_circle: Pipeline has failed!"Discord Notifications
Section titled “Discord Notifications”notifications: discord: - name: alerts connection: "discord-alerts" success: "Pipeline has finished successfully!" failure: "Pipeline has failed!"Multiple Notification Channels
Section titled “Multiple Notification Channels”notifications: slack: - name: data-team connection: "slack-data-team" success: ":tada: Pipeline has finished successfully!" failure: ":red_circle: Pipeline has failed!" discord: - name: alerts connection: "discord-alerts" success: "Pipeline has finished successfully!" failure: "Pipeline has failed!"Environment-Specific Configuration
Section titled “Environment-Specific Configuration”Development Environment
Section titled “Development Environment”id: analytics-pipeline-devschedule: "0 8 * * *" # Later schedule for devstart_date: "2025-08-28"default_connections: gcpConnectionId: dev-gcp-connnotifications: slack: - name: dev-team connection: "dev-slack" success: ":tada: Dev pipeline has finished successfully!" failure: ":red_circle: Dev pipeline has failed!"Production Environment
Section titled “Production Environment”id: analytics-pipeline-prodschedule: "0 6 * * *" # Early morning schedule for prodstart_date: "2025-08-28"default_connections: gcpConnectionId: prod-gcp-connnotifications: slack: - name: data-team connection: "slack-data-team" success: ":tada: Pipeline has finished successfully!" failure: ":red_circle: Pipeline has failed!" - name: oncall connection: "oncall-slack" failure: ":red_circle: Pipeline has failed!"Best Practices
Section titled “Best Practices”1. Naming Conventions
Section titled “1. Naming Conventions”- Use descriptive, hierarchical names:
analytics-pipeline,user-engagement-daily - Include environment suffix:
analytics-pipeline-prod,analytics-pipeline-dev - Use kebab-case for consistency
2. Schedule Optimization
Section titled “2. Schedule Optimization”- Schedule pipelines during off-peak hours
- Consider data availability windows
- Use appropriate intervals for data freshness requirements
3. Connection Management
Section titled “3. Connection Management”- Use environment-specific connections
- Implement proper access controls
- Monitor connection health
4. Notification Strategy
Section titled “4. Notification Strategy”- Send success notifications to relevant teams
- Send failure notifications to on-call teams
- Include relevant context in messages
5. Documentation
Section titled “5. Documentation”- Provide comprehensive descriptions
- Document data flow and dependencies
- Include troubleshooting information
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Pipeline Not Starting
Section titled “Pipeline Not Starting”- Check cron schedule syntax
- Verify
start_dateis not in the future - Ensure pipeline is enabled
Connection Failures
Section titled “Connection Failures”- Verify connection IDs exist
- Check connection credentials
- Test connections independently
Notification Issues
Section titled “Notification Issues”- Verify webhook URLs
- Check notification channel permissions
- Test notifications manually
Debugging Tips
Section titled “Debugging Tips”- Check Pipeline Logs: Review execution logs for errors
- Validate Configuration: Use YAML validators
- Test Connections: Verify all connections work
- Monitor Resources: Check resource usage and limits