Python Dependencies
This guide explains how to manage Python dependencies for your Python tasks in the Datablast Data Platform.
Overview
Section titled “Overview”Python tasks in the Datablast Data Platform can use custom Python libraries by defining them in requirements.txt files. The platform automatically installs these dependencies before executing your Python code.
Key Features
Section titled “Key Features”- Automatic Installation: Dependencies are installed automatically before task execution
- Hierarchical Resolution: Supports both global and task-specific dependencies
- Version Pinning: Specify exact versions for reproducible builds
- Isolated Environment: Each task runs in its own isolated environment
Requirements File Location
Section titled “Requirements File Location”The platform searches for requirements.txt files in a hierarchical manner, starting from the task directory and moving up the directory tree.
Search Order
Section titled “Search Order”- Task Directory: Look for
requirements.txtin the same directory as your Python task - Parent Directories: Search upward through parent directories
- Repository Root: Check the root directory of your repository
Examples
Section titled “Examples”Single Pipeline Repository
Section titled “Single Pipeline Repository”your-project/├── requirements.txt # Global dependencies for all tasks├── pipeline.yml└── tasks/ ├── staging/ │ ├── users.task.yaml │ ├── users.sql │ ├── events.task.yaml │ └── events.sql ├── core_model/ │ ├── user_metrics.task.yaml │ ├── user_metrics.py │ ├── requirements.txt # Additional dependencies for core_model tasks │ └── user_metrics.sql └── ml_models/ ├── churn_model.task.yaml ├── churn_model.py └── requirements.txt # ML-specific dependenciesMultiple Pipeline Repository
Section titled “Multiple Pipeline Repository”your-project/├── requirements.txt # Global dependencies├── pipeline.yml├── pipeline-2.yml└── tasks/ ├── pipeline-1/ │ ├── staging/ │ │ ├── users.task.yaml │ │ ├── users.py │ │ └── requirements.txt # Pipeline-1 specific dependencies │ └── core_model/ │ ├── metrics.task.yaml │ └── metrics.py └── pipeline-2/ ├── analytics/ │ ├── reports.task.yaml │ ├── reports.py │ └── requirements.txt # Pipeline-2 specific dependencies └── ml_models/ ├── model.task.yaml └── model.pyRequirements File Format
Section titled “Requirements File Format”The requirements.txt file follows the standard Python pip format.
Basic Format
Section titled “Basic Format”# Package namepackage_name
# With versionpackage_name==1.2.3
# Version rangepackage_name>=1.2.0,<2.0.0
# Comments# This is a commentpackage_name==1.2.3 # Inline commentExample Requirements Files
Section titled “Example Requirements Files”Global Dependencies (Root Level)
Section titled “Global Dependencies (Root Level)”pandas==2.1.4numpy>=1.21.0google-cloud-bigquery==3.11.4requests>=2.28.0ML-Specific Dependencies
Section titled “ML-Specific Dependencies”scikit-learn==1.3.0lightgbm==4.0.0joblib==1.3.2matplotlib==3.7.2seaborn==0.12.2Data Export Dependencies
Section titled “Data Export Dependencies”sshtunnel==0.4.0pandas==1.5.2pandas-gbq==0.18.0pymssql==2.2.7google-auth==2.14.1API Integration Dependencies
Section titled “API Integration Dependencies”requests==2.28.1Dependency Resolution
Section titled “Dependency Resolution”How It Works
Section titled “How It Works”- Task Execution: When a Python task starts, the platform searches for
requirements.txt - File Discovery: Finds the closest
requirements.txtfile using hierarchical search - Installation: Runs
pip install -r requirements.txt -t .to install dependencies - Execution: Runs your Python code with the installed dependencies available
Example Resolution
Section titled “Example Resolution”For a task located at tasks/ml_models/churn_prediction.py:
- Check
tasks/ml_models/requirements.txt✅ Found - Check
tasks/requirements.txt(not checked, found above) - Check
requirements.txt(not checked, found above)
Result: Uses tasks/ml_models/requirements.txt
Best Practices
Section titled “Best Practices”1. Version Pinning
Section titled “1. Version Pinning”✅ Good:
pandas==2.1.4numpy==1.24.3requests==2.28.1❌ Avoid:
pandasnumpyrequests2. Logical Grouping
Section titled “2. Logical Grouping”✅ Good:
# Core data processingpandas==2.1.4numpy==1.24.3
# Machine learningscikit-learn==1.3.0lightgbm==4.0.0
# Cloud servicesgoogle-cloud-bigquery==3.11.4google-auth==2.14.13. Minimal Dependencies
Section titled “3. Minimal Dependencies”✅ Good:
# Only include what you actually usepandas==2.1.4requests==2.28.1❌ Avoid:
# Don't include unused packagespandas==2.1.4requests==2.28.1matplotlib==3.7.2 # Not used in this taskseaborn==0.12.2 # Not used in this task4. Separate Requirements by Use Case
Section titled “4. Separate Requirements by Use Case”✅ Good:
scikit-learn==1.3.0lightgbm==4.0.0joblib==1.3.2
# tasks/export/requirements.txtsshtunnel==0.4.0pymssql==2.2.7❌ Avoid:
# Single requirements.txt with everythingscikit-learn==1.3.0lightgbm==4.0.0sshtunnel==0.4.0pymssql==2.2.7Common Dependencies
Section titled “Common Dependencies”Data Processing
Section titled “Data Processing”# Core data manipulationpandas==2.1.4numpy==1.24.3
# Data validationgreat-expectations==0.17.0pandera==0.17.0Machine Learning
Section titled “Machine Learning”# Traditional MLscikit-learn==1.3.0lightgbm==4.0.0xgboost==1.7.6
# Deep learningtensorflow==2.13.0torch==2.0.1
# Model serializationjoblib==1.3.2pickle-mixin==1.0.2Cloud Services
Section titled “Cloud Services”# Google Cloudgoogle-cloud-bigquery==3.11.4google-cloud-storage==2.10.0google-auth==2.14.1
# AWSboto3==1.28.0botocore==1.31.0
# Snowflakesnowflake-connector-python==3.0.0API Integration
Section titled “API Integration”# HTTP requestsrequests==2.28.1httpx==0.24.1
# Authenticationauthlib==1.2.1oauthlib==3.2.2Database Connectivity
Section titled “Database Connectivity”# SQL Serverpymssql==2.2.7pyodbc==4.0.39
# PostgreSQLpsycopg2-binary==2.9.7
# MySQLPyMySQL==1.1.0Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”1. Package Not Found
Section titled “1. Package Not Found”Problem: ModuleNotFoundError: No module named 'package_name'
Solution: Add the package to your requirements.txt file
package_name==1.2.32. Version Conflicts
Section titled “2. Version Conflicts”Problem: Different tasks require different versions of the same package
Solution: Use separate requirements.txt files for different task groups
pandas==2.1.4
# tasks/export/requirements.txtpandas==1.5.23. Requirements File Not Found
Section titled “3. Requirements File Not Found”Problem: Dependencies not being installed
Solution: Ensure requirements.txt is in the correct location
your-project/├── requirements.txt # Global dependencies└── tasks/ └── your_task/ ├── your_task.py └── requirements.txt # Task-specific dependencies4. Installation Timeout
Section titled “4. Installation Timeout”Problem: Task fails due to long dependency installation
Solution:
- Use lighter alternatives
- Pin versions to avoid resolution conflicts
- Consider pre-built images for heavy dependencies
5. Memory Issues
Section titled “5. Memory Issues”Problem: Task fails due to memory constraints during installation
Solution:
- Use smaller packages
- Install dependencies in smaller chunks
- Consider using the
d1.largeinstance type
Debugging Tips
Section titled “Debugging Tips”1. Check Requirements File Location
Section titled “1. Check Requirements File Location”# Verify requirements.txt existsfind . -name "requirements.txt" -type f2. Test Dependencies Locally
Section titled “2. Test Dependencies Locally”# Test installationpip install -r requirements.txt3. Validate Package Versions
Section titled “3. Validate Package Versions”# Check installed versionsimport package_nameprint(package_name.__version__)4. Check Task Logs
Section titled “4. Check Task Logs”Look for installation messages in task logs:
Installing dependencies from requirements.txtSuccessfully installed package-name-1.2.3Example: Complete Setup
Section titled “Example: Complete Setup”Here’s a complete example of setting up Python dependencies:
1. Create Requirements File
Section titled “1. Create Requirements File”pandas==2.1.4numpy==1.24.3scikit-learn==1.3.0lightgbm==4.0.0joblib==1.3.2google-cloud-bigquery==3.11.42. Create Python Task
Section titled “2. Create Python Task”# @blast.name: ml_models.train_model# @blast.type: python# @blast.description: Train machine learning model
import pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitfrom lightgbm import LGBMClassifierimport joblibfrom google.cloud import bigquery
# Your training code hereprint("Training model...")# ... implementation3. Create Task Configuration
Section titled “3. Create Task Configuration”name: "ml_models.train_model"type: "python"description: "Train machine learning model"run: "train_model.py"instance: "d1.large"The platform will automatically:
- Find
tasks/ml_models/requirements.txt - Install the specified dependencies
- Execute your Python code with the dependencies available
This ensures your Python tasks have access to all the libraries they need while maintaining isolation and reproducibility.