Prerequisites
Before using this skill, ensure you have:
- Python environment with AutoML libraries (Auto-sklearn, TPOT, H2O AutoML, or PyCaret)
- Training dataset in accessible format (CSV, Parquet, or database)
- Understanding of problem type (classification, regression, time-series)
- Sufficient computational resources for automated search
- Knowledge of evaluation metrics appropriate for task
- Target variable and feature columns clearly defined
Instructions
Step 1: Define Pipeline Requirements
Specify the machine learning task and constraints:
- Identify problem type (binary/multi-class classification, regression, etc.)
- Define evaluation metrics (accuracy, F1, RMSE, etc.)
- Set time and resource budgets for AutoML search
- Specify feature types and preprocessing needs
- Determine model interpretability requirements
Step 2: Prepare Data Infrastructure
Set up data access and preprocessing:
- Load training data using Read tool
- Perform initial data quality assessment
- Configure train/validation/test split strategy
- Define feature engineering transformations
- Set up data validation checks
Step 3: Configure AutoML Pipeline
Build the automated pipeline configuration:
- Select AutoML framework based on requirements
- Define search space for algorithms (random forest, XGBoost, neural networks, etc.)
- Configure feature preprocessing steps (scaling, encoding, imputation)
- Set hyperparameter tuning strategy (Bayesian optimization, random search, grid search)
- Establish early stopping criteria and timeout limits
Step 4: Execute Pipeline Training
Run the automated training process:
- Initialize AutoML pipeline with configuration
- Execute automated feature engineering
- Perform model selection across algorithm families
- Conduct hyperparameter optimization for top models
- Evaluate models using cross-validation
Step 5: Analyze and Export Results
Evaluate pipeline performance and prepare for deployment:
- Compare model performances across metrics
- Extract best model and configuration
- Generate feature importance analysis
- Create model performance visualizations
- Export trained pipeline for deployment
Output
The skill generates comprehensive AutoML pipeline artifacts:
Pipeline Configuration Files
# {baseDir}/automl_config.py
{
"task_type": "classification",
"time_budget": 3600,
"algorithms": ["rf", "xgboost", "catboost"],
"preprocessing": ["scaling", "encoding"],
"tuning_strategy": "bayesian",
"cv_folds": 5
}
Pipeline Code
- Complete Python implementation of AutoML pipeline
- Data loading and preprocessing functions
- Feature engineering transformations
- Model training and evaluation logic
- Hyperparameter search configuration
Model Performance Report
- Best model architecture and hyperparameters
- Cross-validation scores with confidence intervals
- Feature importance rankings
- Confusion matrix or residual plots
- ROC curves and precision-recall curves (for classification)
Training Artifacts
- Serialized best model file (pickle, joblib, or ONNX)
- Feature preprocessing pipeline
- Training history and search logs
- Model performance metrics on test set
- Documentation for model deployment
Deployment Package
- Prediction API code for serving model
- Input validation and preprocessing scripts
- Model loading and inference functions
- Example usage documentation
- Requirements file with dependencies
Error Handling
Common issues and solutions:
Insufficient Training Time
- Error: AutoML search terminated before finding good model
- Solution: Increase time budget, reduce search space, or use faster algorithms
Memory Exhaustion
- Error: Out of memory during pipeline training
- Solution: Reduce dataset size through sampling, use incremental learning, or simplify feature engineering
Poor Model Performance
- Error: Best model accuracy below acceptable threshold
- Solution: Collect more data, engineer better features, expand algorithm search space, or adjust evaluation metrics
Feature Engineering Failures
- Error: Automated feature transformations produce invalid values
- Solution: Add data validation checks, handle missing values explicitly, restrict transformation types
Model Convergence Issues
- Error: Optimization fails to converge for certain algorithms
- Solution: Adjust hyperparameter ranges, increase iteration limits, or exclude problematic algorithms
Resources
AutoML Frameworks
- Auto-sklearn: Automated scikit-learn pipeline construction with metalearning
- TPOT: Genetic programming for pipeline optimization
- H2O AutoML: Scalable AutoML with ensemble methods
- PyCaret: Low-code ML library with automated workflows
Feature Engineering
- Automated feature selection techniques
- Categorical encoding strategies (one-hot, target, ordinal)
- Numerical transformation methods (scaling, binning, polynomial features)
- Time-series feature extraction
Hyperparameter Optimization
- Bayesian optimization with Gaussian processes
- Random search and grid search strategies
- Hyperband and successive halving algorithms
- Multi-objective optimization for multiple metrics
Evaluation Strategies
- Cross-validation techniques (k-fold, stratified, time-series)
- Evaluation metrics selection guide
- Model ensembling and stacking approaches
- Bias-variance tradeoff analysis
Best Practices
- Start with baseline models before AutoML
- Balance automation with domain knowledge
- Monitor resource consumption during search
- Validate model performance on holdout data
- Document pipeline decisions for reproducibility