Python Task Overview

Python tasks in Datablast allow you to execute complex data processing, machine learning, and custom logic using Python. This guide covers the basic configuration methods and key features.

Configuration Methods

Method 1: Annotation-based Configuration

Define task information directly in your Python file using annotations:

# @blast.type: python
# @blast.description: Generate churn predictions using trained model
# @blast.depends: ml_models.churn_model_train
# @blast.instance: d1.large
# @blast.secrets: ML_API_KEY:ML_API_KEY,MODEL_SECRET:MODEL_SECRET

import os
import pandas as pd
import numpy as np
from datetime import datetime

# Get execution date from environment variables
execution_date = os.getenv('BLAST_START_DATE')

# Your Python logic here
result = process_data(execution_date)

print(f"Successfully processed data for {execution_date}")

def process_data(execution_date):
    """Process data for the given execution date."""
    # Implementation here
    return "Processing completed"

Method 2: YAML Configuration

Define task information in a separate YAML file:

name: "ml_models.churn_prediction"
type: "python"
description: "Generate churn predictions using trained model"
depends:
  - ml_models.churn_model_train
run: "churn_prediction.py"
instance: "d1.large"
secrets:
  - "ML_API_KEY:ML_API_KEY"
  - "MODEL_SECRET:MODEL_SECRET"

Basic Configuration

Required Fields

name: "task.name"                    # Unique task identifier
type: "python"                       # Task type
description: "Task description"      # Human-readable description
run: "script.py"                     # Script file to execute

Optional Fields

depends:
  - task1
  - task2
root_dir: "tasks/ml_models"          # Root directory for task files
instance: "d1.medium"                # Compute instance type
secrets:
  - "SECRET_NAME:ENV_VAR_NAME"       # Secret management

Instance Types

The platform supports the following instance types for Python tasks:

Instance Type	CPU Limit	Memory Limit	CPU Request	Memory Request	Use Case
`d1.nano`	250m	512Mi	250m	256Mi	Lightweight tasks, testing (Default)
`d1.small`	500m	1200Mi	500m	1Gi	Small data processing
`d1.medium`	750m	2400Mi	750m	2Gi	Medium workloads
`d1.large`	1	4400Mi	1	4Gi	Large data processing
`d1.xlarge`	2	6600Mi	2	6Gi	Heavy workloads, ML training

Default Instance: d1.nano - No need to specify unless you need more resources.

⚠️ Important: Using instance types other than d1.nano may incur additional charges. Please consult with your Datablast representative for pricing details before upgrading instance types.

Environment Variables

Python tasks receive date information through environment variables:

import os
from datetime import datetime

# Access date variables through environment variables
data_interval_start = os.getenv('BLAST_DATA_INTERVAL_START')
data_interval_end = os.getenv('BLAST_DATA_INTERVAL_END')
start_date = os.getenv('BLAST_START_DATE')
end_date = os.getenv('BLAST_END_DATE')
start_date_nodash = os.getenv('BLAST_START_DATE_NODASH')
end_date_nodash = os.getenv('BLAST_END_DATE_NODASH')

# Convert to datetime objects if needed
start_dt = datetime.fromisoformat(data_interval_start.replace('Z', '+00:00'))
end_dt = datetime.fromisoformat(data_interval_end.replace('Z', '+00:00'))

print(f"Processing data from {start_dt} to {end_dt}")

Available Environment Variables

Variable	Description	Example
`BLAST_DATA_INTERVAL_START`	Start of data interval	`2024-01-15T00:00:00+00:00`
`BLAST_DATA_INTERVAL_END`	End of data interval	`2024-01-16T00:00:00+00:00`
`BLAST_START_DATE`	Data interval start date	`2024-01-15`
`BLAST_END_DATE`	Data interval end date	`2024-01-16`
`BLAST_START_DATE_NODASH`	Start date without dashes	`20240115`
`BLAST_END_DATE_NODASH`	End date without dashes	`20240116`

Secret Management

Using Secrets in Python Tasks

# @blast.secrets: my_secret:my_secret_in_env, another_secret:another_secret_var

import os

# Access secrets through environment variables
first_secret = os.getenv("my_secret_in_env")
second_secret = os.getenv("another_secret_var")

# Use secrets in your code
api_key = os.getenv("ML_API_KEY")
database_password = os.getenv("DB_PASSWORD")

Secret Configuration

secrets:
  - "ML_API_KEY:ML_API_KEY"
  - "DB_PASSWORD:DB_PASSWORD"
  - "ENCRYPTION_KEY:ENCRYPTION_KEY"

The format is: name_on_scheduler:name_to_be_exported_on_script

Python Dependencies

Requirements File Location

The platform searches for requirements.txt files hierarchically:

Task Directory: Look for requirements.txt in the same directory as your Python task
Parent Directories: Search upward through parent directories
Repository Root: Check the root directory of your repository

Example Requirements Organization

your-project/
├── requirements.txt          # Global dependencies
└── tasks/
    ├── ml_models/
    │   ├── churn_model.py
    │   └── requirements.txt  # ML-specific dependencies
    └── export/
        ├── csv_export.py
        └── requirements.txt  # Export-specific dependencies

Best Practices

Code Structure

Single Responsibility: Each task should have one clear purpose
Error Handling: Implement proper error handling and logging
Resource Efficiency: Use appropriate resources for task complexity
Testing: Include comprehensive tests for critical functions

Performance

Resource Right-sizing: Match resources to task requirements
Data Processing: Optimize data processing and use appropriate data structures
Memory Management: Monitor memory usage and data sizes
Caching: Implement caching where appropriate

Next Steps

Instance Types – Resource allocation and instance configuration
Environment Variables – Date variables and dynamic configuration
Dependencies – Python package management and requirements
Secret Management – Secure credential handling
Code Structure – Best practices for Python task development
Error Handling – Robust error handling and logging