Getting Started

Welcome to Datablast! This guide will help you build your first data pipeline in under 10 minutes. By the end of this tutorial, you’ll have a working pipeline that processes data and runs on a schedule.

What You’ll Build

In this quick start, you’ll create a simple data pipeline that:

Processes user data from a raw table
Cleans and transforms the data
Runs daily on a schedule
Sends notifications on success or failure

Prerequisites

Before you begin, make sure you have:

Git repository access for your project
Cloud provider credentials (GCP, AWS, Snowflake)
Basic knowledge of SQL and Python
Understanding of data pipeline concepts

Don’t worry if you’re new to data pipelines – this guide will walk you through everything step by step!

Quick Start

1. Set up your project structure

First, create a new directory for your project and set up the basic structure:

mkdir my-first-pipeline
cd my-first-pipeline

Create the following directory structure:

my-first-pipeline/
├── pipeline.yml          # Main pipeline configuration
└── tasks/                 # Task definitions
    └── staging/          # Raw data processing
        ├── users.task.yaml
        └── users.sql

2. Create your first task

Create a task configuration file that defines what your task does:

name: staging.users
type: bq.sql
description: Load and clean user data from raw table
run: users.sql

3. Write your SQL transformation

Now create the SQL file that will process your data:

-- tasks/staging/users.sql
-- @blast.name: staging.users
-- @blast.type: bq.sql
-- @blast.description: Load and clean user data from raw table

SELECT
    user_id,
    email,
    created_at,
    updated_at,
    -- Add some data quality checks
    CASE
        WHEN email IS NULL THEN 'missing_email'
        WHEN email NOT LIKE '%@%' THEN 'invalid_email'
        ELSE 'valid'
    END as email_status
FROM raw.users
WHERE email IS NOT NULL
  AND email LIKE '%@%'

4. Configure your pipeline

Create the main pipeline configuration file:

id: my-first-pipeline
schedule: "0 4 * * *"      # Run daily at 4 AM UTC
start_date: "2024-01-01"

default_connections:
    gcpConnectionId: my-gcp-connection

notifications:
    slack:
        - name: data-team
          connection: slack-data-team
          success: "✅ Pipeline completed successfully!"
          failure: "❌ Pipeline failed!"

description: |
  My first Datablast pipeline that processes user data.
  This pipeline cleans and validates user email addresses.

5. Deploy and run

Now it’s time to deploy your pipeline:

# Add all files to git
git add .

# Commit your changes
git commit -m "Add first Datablast pipeline"

# Push to your repository
git push

6. Monitor your pipeline

Once you push your code, Datablast will:

Detect the changes in your repository
Deploy your pipeline automatically
Schedule it to run daily at 4 AM UTC
Send notifications to your Slack channel

You can monitor your pipeline’s progress in the Datablast UI, where you’ll see:

Pipeline execution status
Task logs and outputs
Data lineage visualization
Performance metrics

Understanding Your Pipeline

Let’s break down what you just created:

Pipeline Structure

pipeline.yml – Defines when and how your pipeline runs
tasks/ – Contains your data processing logic
Task files – Each task has a .task.yaml config and a .sql or .py file

Key Concepts

Tasks – Individual units of work (like functions in code)
Dependencies – Tasks can depend on other tasks
Schedules – Pipelines run automatically on a schedule
Connections – Link to your databases and services
Notifications – Get alerts when things succeed or fail

File Naming Conventions

The platform follows specific naming conventions for different asset types:

YAML Task Files: Must end with .task.yaml (e.g., users.task.yaml)
SQL Assets: Use .sql extension (e.g., users.sql, orders.sql)
Python Assets: Use .py extension (e.g., data_processor.py, ml_model.py)

What’s Next?

Congratulations! You’ve built your first Datablast pipeline. Here’s what you can explore next:

Learn More About Datablast

Platform Overview – Discover all Datablast capabilities

Configure Your Projects

Project Structure – Learn best practices for organizing code
Pipeline Configuration – Advanced pipeline settings
Task Configuration – Detailed task configuration options

Master Development Techniques

SQL Development – Master SQL transformations and optimization
Python Development – Build complex data processing workflows
Shared Utilities – Create reusable code patterns
Multiple Repositories – Manage multiple pipelines effectively

Advanced Topics

Sensor Tasks – Wait for external conditions
Jinja Templates – Use dynamic date and time references
BigQuery Development – Leverage BigQuery features
Snowflake Development – Optimize Snowflake workflows
Athena Development – AWS Athena best practices

Need Help?

Search Documentation – Use ⌘/Ctrl + K to search across all docs
Browse Guides – Explore our comprehensive guides section
Check Examples – See practical implementations and patterns
Contact Support – Get help from our team

Ready to build more? Explore our Guides section for comprehensive tutorials and best practices!