Skip to content

Pipeline Configuration

The pipeline.yml file contains all the necessary information to build and configure a data pipeline. This file defines the pipeline’s schedule, connections, notifications, and other settings.

Here’s a comprehensive example pipeline.yml file:

id: analytics-pipeline
schedule: "0 6 * * *"
start_date: "2025-08-28"
default_connections:
gcpConnectionId: analytics-gcp
notifications:
slack:
- name: data-team
connection: "slack-data-team"
success: ":tada: Pipeline has finished successfully!"
failure: ":red_circle: Pipeline has failed!"
description: |
The data in this pipeline is obtained from external API.
Tables in BigQuery can be found under project "analytics-data".
It runs everyday at 06:00 UTC.
  • Type: String
  • Description: Unique identifier for the pipeline
  • Example: marketing-analytics
  • Type: String (Cron format)
  • Description: Schedule when the pipeline should run
  • Examples:
    • 0 4 * * * - Daily at 4 AM UTC
    • 0 */6 * * * - Every 6 hours
    • 0 0 * * 1 - Weekly on Monday at midnight
  • Type: String (YYYY-MM-DD format)
  • Description: Start date for the pipeline (useful for backfills)
  • Example: 2022-09-01
  • Type: String (Multi-line)
  • Description: Detailed description of the pipeline’s purpose and data flow
  • Example: See above configuration
  • Type: Object
  • Description: Default connections to use for tasks if no task-specific connection is specified
  • Supported Connections:
    • gcpConnectionId: Google Cloud Platform connection
    • aws_conn_id: AWS connection
    • snowflake: Snowflake connection
    • postgres: PostgreSQL connection
  • Type: Object
  • Description: Notification channels for pipeline success/failure
  • Supported Channels:
    • slack: Slack notifications
    • discord: Discord notifications
  • Type: Object
  • Description: Project-specific configuration settings
  • Parameters:
    • name: Project name
    • flags: Feature flags
    • defaults: Default settings for specific services
# Daily at 1 AM UTC
schedule: "0 1 * * *"
# Every 6 hours
schedule: "0 */6 * * *"
# Weekly on Sunday at 2 AM UTC
schedule: "0 2 * * 0"
# Monthly on the 1st at 3 AM UTC
schedule: "0 3 1 * *"
# Business days only (Monday-Friday) at 8 AM UTC
schedule: "0 8 * * 1-5"
# Every 15 minutes
schedule: "*/15 * * * *"

By default, schedules use UTC time. To convert to your local timezone:

# For EST (UTC-5): 4 AM EST = 9 AM UTC
schedule: "0 9 * * *"
# For PST (UTC-8): 4 AM PST = 12 PM UTC
schedule: "0 12 * * *"
default_connections:
gcpConnectionId: my-gcp-connection
default_connections:
aws_conn_id: my-aws-connection
default_connections:
gcpConnectionId: analytics-gcp
aws_conn_id: analytics-aws
snowflake: analytics-snowflake
notifications:
slack:
- name: data-team
connection: "slack-data-team"
success: ":tada: Pipeline has finished successfully!"
failure: ":red_circle: Pipeline has failed!"
notifications:
discord:
- name: alerts
connection: "discord-alerts"
success: "Pipeline has finished successfully!"
failure: "Pipeline has failed!"
notifications:
slack:
- name: data-team
connection: "slack-data-team"
success: ":tada: Pipeline has finished successfully!"
failure: ":red_circle: Pipeline has failed!"
discord:
- name: alerts
connection: "discord-alerts"
success: "Pipeline has finished successfully!"
failure: "Pipeline has failed!"
id: analytics-pipeline-dev
schedule: "0 8 * * *" # Later schedule for dev
start_date: "2025-08-28"
default_connections:
gcpConnectionId: dev-gcp-conn
notifications:
slack:
- name: dev-team
connection: "dev-slack"
success: ":tada: Dev pipeline has finished successfully!"
failure: ":red_circle: Dev pipeline has failed!"
id: analytics-pipeline-prod
schedule: "0 6 * * *" # Early morning schedule for prod
start_date: "2025-08-28"
default_connections:
gcpConnectionId: prod-gcp-conn
notifications:
slack:
- name: data-team
connection: "slack-data-team"
success: ":tada: Pipeline has finished successfully!"
failure: ":red_circle: Pipeline has failed!"
- name: oncall
connection: "oncall-slack"
failure: ":red_circle: Pipeline has failed!"
  • Use descriptive, hierarchical names: analytics-pipeline, user-engagement-daily
  • Include environment suffix: analytics-pipeline-prod, analytics-pipeline-dev
  • Use kebab-case for consistency
  • Schedule pipelines during off-peak hours
  • Consider data availability windows
  • Use appropriate intervals for data freshness requirements
  • Use environment-specific connections
  • Implement proper access controls
  • Monitor connection health
  • Send success notifications to relevant teams
  • Send failure notifications to on-call teams
  • Include relevant context in messages
  • Provide comprehensive descriptions
  • Document data flow and dependencies
  • Include troubleshooting information
  • Check cron schedule syntax
  • Verify start_date is not in the future
  • Ensure pipeline is enabled
  • Verify connection IDs exist
  • Check connection credentials
  • Test connections independently
  • Verify webhook URLs
  • Check notification channel permissions
  • Test notifications manually
  1. Check Pipeline Logs: Review execution logs for errors
  2. Validate Configuration: Use YAML validators
  3. Test Connections: Verify all connections work
  4. Monitor Resources: Check resource usage and limits