Getting Started
Welcome to Datablast! This guide will help you build your first data pipeline in under 10 minutes. By the end of this tutorial, you’ll have a working pipeline that processes data and runs on a schedule.
What You’ll Build
Section titled “What You’ll Build”In this quick start, you’ll create a simple data pipeline that:
- Processes user data from a raw table
- Cleans and transforms the data
- Runs daily on a schedule
- Sends notifications on success or failure
Prerequisites
Section titled “Prerequisites”Before you begin, make sure you have:
- Git repository access for your project
- Cloud provider credentials (GCP, AWS, Snowflake)
- Basic knowledge of SQL and Python
- Understanding of data pipeline concepts
Don’t worry if you’re new to data pipelines – this guide will walk you through everything step by step!
Quick Start
Section titled “Quick Start”1. Set up your project structure
Section titled “1. Set up your project structure”First, create a new directory for your project and set up the basic structure:
mkdir my-first-pipelinecd my-first-pipelineCreate the following directory structure:
my-first-pipeline/├── pipeline.yml # Main pipeline configuration└── tasks/ # Task definitions └── staging/ # Raw data processing ├── users.task.yaml └── users.sql2. Create your first task
Section titled “2. Create your first task”Create a task configuration file that defines what your task does:
name: staging.userstype: bq.sqldescription: Load and clean user data from raw tablerun: users.sql3. Write your SQL transformation
Section titled “3. Write your SQL transformation”Now create the SQL file that will process your data:
-- tasks/staging/users.sql-- @blast.name: staging.users-- @blast.type: bq.sql-- @blast.description: Load and clean user data from raw table
SELECT user_id, email, created_at, updated_at, -- Add some data quality checks CASE WHEN email IS NULL THEN 'missing_email' WHEN email NOT LIKE '%@%' THEN 'invalid_email' ELSE 'valid' END as email_statusFROM raw.usersWHERE email IS NOT NULL AND email LIKE '%@%'4. Configure your pipeline
Section titled “4. Configure your pipeline”Create the main pipeline configuration file:
id: my-first-pipelineschedule: "0 4 * * *" # Run daily at 4 AM UTCstart_date: "2024-01-01"
default_connections: gcpConnectionId: my-gcp-connection
notifications: slack: - name: data-team connection: slack-data-team success: "✅ Pipeline completed successfully!" failure: "❌ Pipeline failed!"
description: | My first Datablast pipeline that processes user data. This pipeline cleans and validates user email addresses.5. Deploy and run
Section titled “5. Deploy and run”Now it’s time to deploy your pipeline:
# Add all files to gitgit add .
# Commit your changesgit commit -m "Add first Datablast pipeline"
# Push to your repositorygit push6. Monitor your pipeline
Section titled “6. Monitor your pipeline”Once you push your code, Datablast will:
- Detect the changes in your repository
- Deploy your pipeline automatically
- Schedule it to run daily at 4 AM UTC
- Send notifications to your Slack channel
You can monitor your pipeline’s progress in the Datablast UI, where you’ll see:
- Pipeline execution status
- Task logs and outputs
- Data lineage visualization
- Performance metrics
Understanding Your Pipeline
Section titled “Understanding Your Pipeline”Let’s break down what you just created:
Pipeline Structure
Section titled “Pipeline Structure”pipeline.yml– Defines when and how your pipeline runstasks/– Contains your data processing logic- Task files – Each task has a
.task.yamlconfig and a.sqlor.pyfile
Key Concepts
Section titled “Key Concepts”- Tasks – Individual units of work (like functions in code)
- Dependencies – Tasks can depend on other tasks
- Schedules – Pipelines run automatically on a schedule
- Connections – Link to your databases and services
- Notifications – Get alerts when things succeed or fail
File Naming Conventions
Section titled “File Naming Conventions”The platform follows specific naming conventions for different asset types:
- YAML Task Files: Must end with
.task.yaml(e.g.,users.task.yaml) - SQL Assets: Use
.sqlextension (e.g.,users.sql,orders.sql) - Python Assets: Use
.pyextension (e.g.,data_processor.py,ml_model.py)
What’s Next?
Section titled “What’s Next?”Congratulations! You’ve built your first Datablast pipeline. Here’s what you can explore next:
Learn More About Datablast
Section titled “Learn More About Datablast”- Platform Overview – Discover all Datablast capabilities
Configure Your Projects
Section titled “Configure Your Projects”- Project Structure – Learn best practices for organizing code
- Pipeline Configuration – Advanced pipeline settings
- Task Configuration – Detailed task configuration options
Master Development Techniques
Section titled “Master Development Techniques”- SQL Development – Master SQL transformations and optimization
- Python Development – Build complex data processing workflows
- Shared Utilities – Create reusable code patterns
- Multiple Repositories – Manage multiple pipelines effectively
Advanced Topics
Section titled “Advanced Topics”- Sensor Tasks – Wait for external conditions
- Jinja Templates – Use dynamic date and time references
- BigQuery Development – Leverage BigQuery features
- Snowflake Development – Optimize Snowflake workflows
- Athena Development – AWS Athena best practices
Need Help?
Section titled “Need Help?”- Search Documentation – Use ⌘/Ctrl + K to search across all docs
- Browse Guides – Explore our comprehensive guides section
- Check Examples – See practical implementations and patterns
- Contact Support – Get help from our team
Ready to build more? Explore our Guides section for comprehensive tutorials and best practices!