Pipeline Configuration

The pipeline.yml file contains all the necessary information to build and configure a data pipeline. This file defines the pipeline’s schedule, connections, notifications, and other settings.

Basic Pipeline Configuration

Here’s a comprehensive example pipeline.yml file:

id: analytics-pipeline
schedule: "0 6 * * *"
start_date: "2025-08-28"
default_connections:
  gcpConnectionId: analytics-gcp
notifications:
    slack:
        - name: data-team
          connection: "slack-data-team"
          success: ":tada: Pipeline has finished successfully!"
          failure: ":red_circle: Pipeline has failed!"
description: |
  The data in this pipeline is obtained from external API.
  Tables in BigQuery can be found under project "analytics-data".
  It runs everyday at 06:00 UTC.

Configuration Parameters

Required Parameters

`id`

Type: String
Description: Unique identifier for the pipeline
Example: marketing-analytics

`schedule`

Type: String (Cron format)
Description: Schedule when the pipeline should run
Examples:
- 0 4 * * * - Daily at 4 AM UTC
- 0 */6 * * * - Every 6 hours
- 0 0 * * 1 - Weekly on Monday at midnight

`start_date`

Type: String (YYYY-MM-DD format)
Description: Start date for the pipeline (useful for backfills)
Example: 2022-09-01

Optional Parameters

`description`

Type: String (Multi-line)
Description: Detailed description of the pipeline’s purpose and data flow
Example: See above configuration

`default_connections`

Type: Object
Description: Default connections to use for tasks if no task-specific connection is specified
Supported Connections:
- gcpConnectionId: Google Cloud Platform connection
- aws_conn_id: AWS connection
- snowflake: Snowflake connection
- postgres: PostgreSQL connection

`notifications`

Type: Object
Description: Notification channels for pipeline success/failure
Supported Channels:
- slack: Slack notifications
- discord: Discord notifications

`project_config`

Type: Object
Description: Project-specific configuration settings
Parameters:
- name: Project name
- flags: Feature flags
- defaults: Default settings for specific services

Schedule Examples

Common Schedule Patterns

# Daily at 1 AM UTC
schedule: "0 1 * * *"
# Every 6 hours
schedule: "0 */6 * * *"
# Weekly on Sunday at 2 AM UTC
schedule: "0 2 * * 0"
# Monthly on the 1st at 3 AM UTC
schedule: "0 3 1 * *"
# Business days only (Monday-Friday) at 8 AM UTC
schedule: "0 8 * * 1-5"
# Every 15 minutes
schedule: "*/15 * * * *"

Timezone Considerations

By default, schedules use UTC time. To convert to your local timezone:

# For EST (UTC-5): 4 AM EST = 9 AM UTC
schedule: "0 9 * * *"
# For PST (UTC-8): 4 AM PST = 12 PM UTC
schedule: "0 12 * * *"

Connection Configuration

Google Cloud Platform

default_connections:
  gcpConnectionId: my-gcp-connection

AWS Integration

default_connections:
  aws_conn_id: my-aws-connection

Multiple Connections

default_connections:
  gcpConnectionId: analytics-gcp
  aws_conn_id: analytics-aws
  snowflake: analytics-snowflake

Notification Configuration

Slack Notifications

notifications:
    slack:
        - name: data-team
          connection: "slack-data-team"
          success: ":tada: Pipeline has finished successfully!"
          failure: ":red_circle: Pipeline has failed!"

Discord Notifications

notifications:
    discord:
        - name: alerts
          connection: "discord-alerts"
          success: "Pipeline has finished successfully!"
          failure: "Pipeline has failed!"

Multiple Notification Channels

notifications:
    slack:
        - name: data-team
          connection: "slack-data-team"
          success: ":tada: Pipeline has finished successfully!"
          failure: ":red_circle: Pipeline has failed!"
    discord:
        - name: alerts
          connection: "discord-alerts"
          success: "Pipeline has finished successfully!"
          failure: "Pipeline has failed!"

Environment-Specific Configuration

Development Environment

id: analytics-pipeline-dev
schedule: "0 8 * * *"  # Later schedule for dev
start_date: "2025-08-28"
default_connections:
  gcpConnectionId: dev-gcp-conn
notifications:
    slack:
        - name: dev-team
          connection: "dev-slack"
          success: ":tada: Dev pipeline has finished successfully!"
          failure: ":red_circle: Dev pipeline has failed!"

Production Environment

id: analytics-pipeline-prod
schedule: "0 6 * * *"  # Early morning schedule for prod
start_date: "2025-08-28"
default_connections:
  gcpConnectionId: prod-gcp-conn
notifications:
    slack:
        - name: data-team
          connection: "slack-data-team"
          success: ":tada: Pipeline has finished successfully!"
          failure: ":red_circle: Pipeline has failed!"
        - name: oncall
          connection: "oncall-slack"
          failure: ":red_circle: Pipeline has failed!"

Best Practices

1. Naming Conventions

Use descriptive, hierarchical names: analytics-pipeline, user-engagement-daily
Include environment suffix: analytics-pipeline-prod, analytics-pipeline-dev
Use kebab-case for consistency

2. Schedule Optimization

Schedule pipelines during off-peak hours
Consider data availability windows
Use appropriate intervals for data freshness requirements

3. Connection Management

Use environment-specific connections
Implement proper access controls
Monitor connection health

4. Notification Strategy

Send success notifications to relevant teams
Send failure notifications to on-call teams
Include relevant context in messages

5. Documentation

Provide comprehensive descriptions
Document data flow and dependencies
Include troubleshooting information

Troubleshooting

Common Issues

Pipeline Not Starting

Check cron schedule syntax
Verify start_date is not in the future
Ensure pipeline is enabled

Connection Failures

Verify connection IDs exist
Check connection credentials
Test connections independently

Notification Issues

Verify webhook URLs
Check notification channel permissions
Test notifications manually

Debugging Tips

Check Pipeline Logs: Review execution logs for errors
Validate Configuration: Use YAML validators
Test Connections: Verify all connections work
Monitor Resources: Check resource usage and limits