Skip to content

Python Dependencies

This guide explains how to manage Python dependencies for your Python tasks in the Datablast Data Platform.

Python tasks in the Datablast Data Platform can use custom Python libraries by defining them in requirements.txt files. The platform automatically installs these dependencies before executing your Python code.

  • Automatic Installation: Dependencies are installed automatically before task execution
  • Hierarchical Resolution: Supports both global and task-specific dependencies
  • Version Pinning: Specify exact versions for reproducible builds
  • Isolated Environment: Each task runs in its own isolated environment

The platform searches for requirements.txt files in a hierarchical manner, starting from the task directory and moving up the directory tree.

  1. Task Directory: Look for requirements.txt in the same directory as your Python task
  2. Parent Directories: Search upward through parent directories
  3. Repository Root: Check the root directory of your repository
your-project/
├── requirements.txt # Global dependencies for all tasks
├── pipeline.yml
└── tasks/
├── staging/
│ ├── users.task.yaml
│ ├── users.sql
│ ├── events.task.yaml
│ └── events.sql
├── core_model/
│ ├── user_metrics.task.yaml
│ ├── user_metrics.py
│ ├── requirements.txt # Additional dependencies for core_model tasks
│ └── user_metrics.sql
└── ml_models/
├── churn_model.task.yaml
├── churn_model.py
└── requirements.txt # ML-specific dependencies
your-project/
├── requirements.txt # Global dependencies
├── pipeline.yml
├── pipeline-2.yml
└── tasks/
├── pipeline-1/
│ ├── staging/
│ │ ├── users.task.yaml
│ │ ├── users.py
│ │ └── requirements.txt # Pipeline-1 specific dependencies
│ └── core_model/
│ ├── metrics.task.yaml
│ └── metrics.py
└── pipeline-2/
├── analytics/
│ ├── reports.task.yaml
│ ├── reports.py
│ └── requirements.txt # Pipeline-2 specific dependencies
└── ml_models/
├── model.task.yaml
└── model.py

The requirements.txt file follows the standard Python pip format.

# Package name
package_name
# With version
package_name==1.2.3
# Version range
package_name>=1.2.0,<2.0.0
# Comments
# This is a comment
package_name==1.2.3 # Inline comment
requirements.txt
pandas==2.1.4
numpy>=1.21.0
google-cloud-bigquery==3.11.4
requests>=2.28.0
tasks/ml_models/requirements.txt
scikit-learn==1.3.0
lightgbm==4.0.0
joblib==1.3.2
matplotlib==3.7.2
seaborn==0.12.2
tasks/export/requirements.txt
sshtunnel==0.4.0
pandas==1.5.2
pandas-gbq==0.18.0
pymssql==2.2.7
google-auth==2.14.1
tasks/metabase_checker/requirements.txt
requests==2.28.1
  1. Task Execution: When a Python task starts, the platform searches for requirements.txt
  2. File Discovery: Finds the closest requirements.txt file using hierarchical search
  3. Installation: Runs pip install -r requirements.txt -t . to install dependencies
  4. Execution: Runs your Python code with the installed dependencies available

For a task located at tasks/ml_models/churn_prediction.py:

  1. Check tasks/ml_models/requirements.txt ✅ Found
  2. Check tasks/requirements.txt (not checked, found above)
  3. Check requirements.txt (not checked, found above)

Result: Uses tasks/ml_models/requirements.txt

Good:

pandas==2.1.4
numpy==1.24.3
requests==2.28.1

Avoid:

pandas
numpy
requests

Good:

# Core data processing
pandas==2.1.4
numpy==1.24.3
# Machine learning
scikit-learn==1.3.0
lightgbm==4.0.0
# Cloud services
google-cloud-bigquery==3.11.4
google-auth==2.14.1

Good:

# Only include what you actually use
pandas==2.1.4
requests==2.28.1

Avoid:

# Don't include unused packages
pandas==2.1.4
requests==2.28.1
matplotlib==3.7.2 # Not used in this task
seaborn==0.12.2 # Not used in this task

Good:

tasks/ml_models/requirements.txt
scikit-learn==1.3.0
lightgbm==4.0.0
joblib==1.3.2
# tasks/export/requirements.txt
sshtunnel==0.4.0
pymssql==2.2.7

Avoid:

# Single requirements.txt with everything
scikit-learn==1.3.0
lightgbm==4.0.0
sshtunnel==0.4.0
pymssql==2.2.7
# Core data manipulation
pandas==2.1.4
numpy==1.24.3
# Data validation
great-expectations==0.17.0
pandera==0.17.0
# Traditional ML
scikit-learn==1.3.0
lightgbm==4.0.0
xgboost==1.7.6
# Deep learning
tensorflow==2.13.0
torch==2.0.1
# Model serialization
joblib==1.3.2
pickle-mixin==1.0.2
# Google Cloud
google-cloud-bigquery==3.11.4
google-cloud-storage==2.10.0
google-auth==2.14.1
# AWS
boto3==1.28.0
botocore==1.31.0
# Snowflake
snowflake-connector-python==3.0.0
# HTTP requests
requests==2.28.1
httpx==0.24.1
# Authentication
authlib==1.2.1
oauthlib==3.2.2
# SQL Server
pymssql==2.2.7
pyodbc==4.0.39
# PostgreSQL
psycopg2-binary==2.9.7
# MySQL
PyMySQL==1.1.0

Problem: ModuleNotFoundError: No module named 'package_name'

Solution: Add the package to your requirements.txt file

package_name==1.2.3

Problem: Different tasks require different versions of the same package

Solution: Use separate requirements.txt files for different task groups

tasks/ml_models/requirements.txt
pandas==2.1.4
# tasks/export/requirements.txt
pandas==1.5.2

Problem: Dependencies not being installed

Solution: Ensure requirements.txt is in the correct location

your-project/
├── requirements.txt # Global dependencies
└── tasks/
└── your_task/
├── your_task.py
└── requirements.txt # Task-specific dependencies

Problem: Task fails due to long dependency installation

Solution:

  • Use lighter alternatives
  • Pin versions to avoid resolution conflicts
  • Consider pre-built images for heavy dependencies

Problem: Task fails due to memory constraints during installation

Solution:

  • Use smaller packages
  • Install dependencies in smaller chunks
  • Consider using the d1.large instance type
Terminal window
# Verify requirements.txt exists
find . -name "requirements.txt" -type f
Terminal window
# Test installation
pip install -r requirements.txt
# Check installed versions
import package_name
print(package_name.__version__)

Look for installation messages in task logs:

Installing dependencies from requirements.txt
Successfully installed package-name-1.2.3

Here’s a complete example of setting up Python dependencies:

tasks/ml_models/requirements.txt
pandas==2.1.4
numpy==1.24.3
scikit-learn==1.3.0
lightgbm==4.0.0
joblib==1.3.2
google-cloud-bigquery==3.11.4
tasks/ml_models/train_model.py
# @blast.name: ml_models.train_model
# @blast.type: python
# @blast.description: Train machine learning model
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
import joblib
from google.cloud import bigquery
# Your training code here
print("Training model...")
# ... implementation
tasks/ml_models/train_model.task.yaml
name: "ml_models.train_model"
type: "python"
description: "Train machine learning model"
run: "train_model.py"
instance: "d1.large"

The platform will automatically:

  1. Find tasks/ml_models/requirements.txt
  2. Install the specified dependencies
  3. Execute your Python code with the dependencies available

This ensures your Python tasks have access to all the libraries they need while maintaining isolation and reproducibility.