Skip to content

Getting Started

Welcome to Datablast! This guide will help you build your first data pipeline in under 10 minutes. By the end of this tutorial, you’ll have a working pipeline that processes data and runs on a schedule.

In this quick start, you’ll create a simple data pipeline that:

  • Processes user data from a raw table
  • Cleans and transforms the data
  • Runs daily on a schedule
  • Sends notifications on success or failure

Before you begin, make sure you have:

  • Git repository access for your project
  • Cloud provider credentials (GCP, AWS, Snowflake)
  • Basic knowledge of SQL and Python
  • Understanding of data pipeline concepts

Don’t worry if you’re new to data pipelines – this guide will walk you through everything step by step!

First, create a new directory for your project and set up the basic structure:

Terminal window
mkdir my-first-pipeline
cd my-first-pipeline

Create the following directory structure:

my-first-pipeline/
├── pipeline.yml # Main pipeline configuration
└── tasks/ # Task definitions
└── staging/ # Raw data processing
├── users.task.yaml
└── users.sql

Create a task configuration file that defines what your task does:

tasks/staging/users.task.yaml
name: staging.users
type: bq.sql
description: Load and clean user data from raw table
run: users.sql

Now create the SQL file that will process your data:

-- tasks/staging/users.sql
-- @blast.name: staging.users
-- @blast.type: bq.sql
-- @blast.description: Load and clean user data from raw table
SELECT
user_id,
email,
created_at,
updated_at,
-- Add some data quality checks
CASE
WHEN email IS NULL THEN 'missing_email'
WHEN email NOT LIKE '%@%' THEN 'invalid_email'
ELSE 'valid'
END as email_status
FROM raw.users
WHERE email IS NOT NULL
AND email LIKE '%@%'

Create the main pipeline configuration file:

pipeline.yml
id: my-first-pipeline
schedule: "0 4 * * *" # Run daily at 4 AM UTC
start_date: "2024-01-01"
default_connections:
gcpConnectionId: my-gcp-connection
notifications:
slack:
- name: data-team
connection: slack-data-team
success: "✅ Pipeline completed successfully!"
failure: "❌ Pipeline failed!"
description: |
My first Datablast pipeline that processes user data.
This pipeline cleans and validates user email addresses.

Now it’s time to deploy your pipeline:

Terminal window
# Add all files to git
git add .
# Commit your changes
git commit -m "Add first Datablast pipeline"
# Push to your repository
git push

Once you push your code, Datablast will:

  • Detect the changes in your repository
  • Deploy your pipeline automatically
  • Schedule it to run daily at 4 AM UTC
  • Send notifications to your Slack channel

You can monitor your pipeline’s progress in the Datablast UI, where you’ll see:

  • Pipeline execution status
  • Task logs and outputs
  • Data lineage visualization
  • Performance metrics

Let’s break down what you just created:

  • pipeline.yml – Defines when and how your pipeline runs
  • tasks/ – Contains your data processing logic
  • Task files – Each task has a .task.yaml config and a .sql or .py file
  • Tasks – Individual units of work (like functions in code)
  • Dependencies – Tasks can depend on other tasks
  • Schedules – Pipelines run automatically on a schedule
  • Connections – Link to your databases and services
  • Notifications – Get alerts when things succeed or fail

The platform follows specific naming conventions for different asset types:

  • YAML Task Files: Must end with .task.yaml (e.g., users.task.yaml)
  • SQL Assets: Use .sql extension (e.g., users.sql, orders.sql)
  • Python Assets: Use .py extension (e.g., data_processor.py, ml_model.py)

Congratulations! You’ve built your first Datablast pipeline. Here’s what you can explore next:

  • Search Documentation – Use ⌘/Ctrl + K to search across all docs
  • Browse Guides – Explore our comprehensive guides section
  • Check Examples – See practical implementations and patterns
  • Contact Support – Get help from our team

Ready to build more? Explore our Guides section for comprehensive tutorials and best practices!