Skip to content

Python Task Overview

Python tasks in Datablast allow you to execute complex data processing, machine learning, and custom logic using Python. This guide covers the basic configuration methods and key features.

Define task information directly in your Python file using annotations:

ml_models.churn_prediction
# @blast.type: python
# @blast.description: Generate churn predictions using trained model
# @blast.depends: ml_models.churn_model_train
# @blast.instance: d1.large
# @blast.secrets: ML_API_KEY:ML_API_KEY,MODEL_SECRET:MODEL_SECRET
import os
import pandas as pd
import numpy as np
from datetime import datetime
# Get execution date from environment variables
execution_date = os.getenv('BLAST_START_DATE')
# Your Python logic here
result = process_data(execution_date)
print(f"Successfully processed data for {execution_date}")
def process_data(execution_date):
"""Process data for the given execution date."""
# Implementation here
return "Processing completed"

Define task information in a separate YAML file:

name: "ml_models.churn_prediction"
type: "python"
description: "Generate churn predictions using trained model"
depends:
- ml_models.churn_model_train
run: "churn_prediction.py"
instance: "d1.large"
secrets:
- "ML_API_KEY:ML_API_KEY"
- "MODEL_SECRET:MODEL_SECRET"
name: "task.name" # Unique task identifier
type: "python" # Task type
description: "Task description" # Human-readable description
run: "script.py" # Script file to execute
depends:
- task1
- task2
root_dir: "tasks/ml_models" # Root directory for task files
instance: "d1.medium" # Compute instance type
secrets:
- "SECRET_NAME:ENV_VAR_NAME" # Secret management

The platform supports the following instance types for Python tasks:

Instance TypeCPU LimitMemory LimitCPU RequestMemory RequestUse Case
d1.nano250m512Mi250m256MiLightweight tasks, testing (Default)
d1.small500m1200Mi500m1GiSmall data processing
d1.medium750m2400Mi750m2GiMedium workloads
d1.large14400Mi14GiLarge data processing
d1.xlarge26600Mi26GiHeavy workloads, ML training

Default Instance: d1.nano - No need to specify unless you need more resources.

⚠️ Important: Using instance types other than d1.nano may incur additional charges. Please consult with your Datablast representative for pricing details before upgrading instance types.

Python tasks receive date information through environment variables:

import os
from datetime import datetime
# Access date variables through environment variables
data_interval_start = os.getenv('BLAST_DATA_INTERVAL_START')
data_interval_end = os.getenv('BLAST_DATA_INTERVAL_END')
start_date = os.getenv('BLAST_START_DATE')
end_date = os.getenv('BLAST_END_DATE')
start_date_nodash = os.getenv('BLAST_START_DATE_NODASH')
end_date_nodash = os.getenv('BLAST_END_DATE_NODASH')
# Convert to datetime objects if needed
start_dt = datetime.fromisoformat(data_interval_start.replace('Z', '+00:00'))
end_dt = datetime.fromisoformat(data_interval_end.replace('Z', '+00:00'))
print(f"Processing data from {start_dt} to {end_dt}")
VariableDescriptionExample
BLAST_DATA_INTERVAL_STARTStart of data interval2024-01-15T00:00:00+00:00
BLAST_DATA_INTERVAL_ENDEnd of data interval2024-01-16T00:00:00+00:00
BLAST_START_DATEData interval start date2024-01-15
BLAST_END_DATEData interval end date2024-01-16
BLAST_START_DATE_NODASHStart date without dashes20240115
BLAST_END_DATE_NODASHEnd date without dashes20240116
# @blast.secrets: my_secret:my_secret_in_env, another_secret:another_secret_var
import os
# Access secrets through environment variables
first_secret = os.getenv("my_secret_in_env")
second_secret = os.getenv("another_secret_var")
# Use secrets in your code
api_key = os.getenv("ML_API_KEY")
database_password = os.getenv("DB_PASSWORD")
secrets:
- "ML_API_KEY:ML_API_KEY"
- "DB_PASSWORD:DB_PASSWORD"
- "ENCRYPTION_KEY:ENCRYPTION_KEY"

The format is: name_on_scheduler:name_to_be_exported_on_script

The platform searches for requirements.txt files hierarchically:

  1. Task Directory: Look for requirements.txt in the same directory as your Python task
  2. Parent Directories: Search upward through parent directories
  3. Repository Root: Check the root directory of your repository
your-project/
├── requirements.txt # Global dependencies
└── tasks/
├── ml_models/
│ ├── churn_model.py
│ └── requirements.txt # ML-specific dependencies
└── export/
├── csv_export.py
└── requirements.txt # Export-specific dependencies
  1. Single Responsibility: Each task should have one clear purpose
  2. Error Handling: Implement proper error handling and logging
  3. Resource Efficiency: Use appropriate resources for task complexity
  4. Testing: Include comprehensive tests for critical functions
  1. Resource Right-sizing: Match resources to task requirements
  2. Data Processing: Optimize data processing and use appropriate data structures
  3. Memory Management: Monitor memory usage and data sizes
  4. Caching: Implement caching where appropriate