Data validation and pipeline testing utilities for ML training projects. Validates datasets, model checkpoints, training pipelines, and dependencies. Use when validating training data, checking model outputs, testing ML pipelines, verifying dependencies, debugging training failures, or ensuring data quality before training.
Limited to specific tools
Additional assets for this skill
This skill is limited to using the following tools:
examples/data-validation-example.mdexamples/pipeline-testing-example.mdscripts/check-dependencies.shscripts/test-pipeline.shscripts/validate-data.shscripts/validate-model.shtemplates/test-config.yamltemplates/validation-schema.jsonPurpose: Production-ready validation and testing utilities for ML training workflows. Ensures data quality, model integrity, pipeline correctness, and dependency availability before and during training.
Activation Triggers:
Key Resources:
scripts/validate-data.sh - Comprehensive dataset validationscripts/validate-model.sh - Model checkpoint and config validationscripts/test-pipeline.sh - End-to-end pipeline testingscripts/check-dependencies.sh - System and package dependency checkingtemplates/test-config.yaml - Pipeline test configuration templatetemplates/validation-schema.json - Data validation schema templateexamples/data-validation-example.md - Dataset validation workflowexamples/pipeline-testing-example.md - Complete pipeline testing guidebash scripts/validate-data.sh \
--data-path ./data/train.jsonl \
--format jsonl \
--schema templates/validation-schema.json
bash scripts/validate-model.sh \
--model-path ./checkpoints/epoch-3 \
--framework pytorch \
--check-weights
bash scripts/test-pipeline.sh \
--config templates/test-config.yaml \
--data ./data/sample.jsonl \
--verbose
bash scripts/check-dependencies.sh \
--framework pytorch \
--gpu-required \
--min-vram 16
Purpose: Validate training datasets for format compliance, data quality, and schema conformance.
Usage:
bash scripts/validate-data.sh [OPTIONS]
Options:
--data-path PATH - Path to dataset file or directory (required)--format FORMAT - Data format: jsonl, csv, parquet, arrow (default: jsonl)--schema PATH - Path to validation schema JSON file--sample-size N - Number of samples to validate (default: all)--check-duplicates - Check for duplicate entries--check-null - Check for null/missing values--check-length - Validate text length constraints--check-tokens - Validate tokenization compatibility--tokenizer MODEL - Tokenizer to use for token validation (default: gpt2)--max-length N - Maximum sequence length (default: 2048)--output REPORT - Output validation report path (default: validation-report.json)Validation Checks:
Example Output:
{
"status": "PASS",
"total_samples": 10000,
"valid_samples": 9987,
"invalid_samples": 13,
"validation_errors": [
{
"sample_id": 42,
"field": "text",
"error": "Exceeds max token length: 2150 > 2048"
},
{
"sample_id": 156,
"field": "label",
"error": "Invalid label value: 'unknwn' (typo)"
}
],
"statistics": {
"avg_text_length": 487,
"avg_token_count": 128,
"label_distribution": {
"positive": 4892,
"negative": 4895,
"neutral": 213
}
},
"recommendations": [
"Remove or fix 13 invalid samples before training",
"Label distribution is imbalanced - consider class weighting"
]
}
Exit Codes:
0 - All validations passed1 - Validation errors found2 - Script error (invalid arguments, file not found)Purpose: Validate model checkpoints, configurations, and inference readiness.
Usage:
bash scripts/validate-model.sh [OPTIONS]
Options:
--model-path PATH - Path to model checkpoint directory (required)--framework FRAMEWORK - Framework: pytorch, tensorflow, jax (default: pytorch)--check-weights - Verify model weights are loadable--check-config - Validate model configuration file--check-tokenizer - Verify tokenizer files are present--check-inference - Test inference with sample input--sample-input TEXT - Sample text for inference test--expected-output TEXT - Expected output for verification--output REPORT - Output validation report pathValidation Checks:
Example Output:
{
"status": "PASS",
"model_path": "./checkpoints/llama-7b-finetuned",
"framework": "pytorch",
"checks": {
"file_structure": "PASS",
"config": "PASS",
"weights": "PASS",
"tokenizer": "PASS",
"inference": "PASS"
},
"model_info": {
"architecture": "LlamaForCausalLM",
"parameters": "7.2B",
"precision": "float16",
"lora_enabled": true,
"lora_rank": 16
},
"memory_estimate": {
"model_size_gb": 13.5,
"inference_vram_gb": 16.2,
"training_vram_gb": 24.8
},
"inference_test": {
"input": "Hello, world!",
"output": "Hello, world! How can I help you today?",
"latency_ms": 142
}
}
Purpose: Test end-to-end ML training and inference pipelines with sample data.
Usage:
bash scripts/test-pipeline.sh [OPTIONS]
Options:
--config PATH - Pipeline test configuration file (required)--data PATH - Sample data for testing--steps STEPS - Comma-separated steps to test (default: all)--verbose - Enable detailed output--output REPORT - Output test report path--fail-fast - Stop on first failure--cleanup - Clean up temporary files after testPipeline Steps:
Test Configuration (test-config.yaml):
pipeline:
name: llama-7b-finetuning
framework: pytorch
data:
train_path: ./data/sample-train.jsonl
val_path: ./data/sample-val.jsonl
format: jsonl
model:
base_model: meta-llama/Llama-2-7b-hf
checkpoint_path: ./checkpoints/test
load_in_8bit: false
lora:
enabled: true
r: 16
alpha: 32
training:
batch_size: 1
gradient_accumulation: 1
learning_rate: 2e-4
max_steps: 5
testing:
sample_size: 10
timeout_seconds: 300
expected_loss_range: [0.5, 3.0]
Example Output:
Pipeline Test Report
====================
Pipeline: llama-7b-finetuning
Started: 2025-11-01 12:34:56
Duration: 127 seconds
Test Results:
✓ data_loading (2.3s) - Loaded 10 samples successfully
✓ tokenization (1.1s) - Tokenized all samples, avg length: 128 tokens
✓ model_loading (8.7s) - Model loaded, 7.2B parameters
✓ training_step (15.4s) - Training step completed, loss: 1.847
✓ validation_step (12.1s) - Validation step completed, loss: 1.923
✓ checkpoint_save (3.2s) - Checkpoint saved to ./checkpoints/test
✓ checkpoint_load (6.8s) - Checkpoint loaded successfully
✓ inference (2.9s) - Inference completed, latency: 142ms
✓ metrics (0.4s) - Metrics calculated correctly
Overall: PASS (8/8 tests passed)
Performance Metrics:
- Total time: 127s
- GPU memory used: 15.2 GB
- CPU memory used: 8.4 GB
Recommendations:
- Pipeline is ready for full training
- Consider increasing batch_size to improve throughput
Purpose: Verify all required system dependencies, packages, and GPU availability.
Usage:
bash scripts/check-dependencies.sh [OPTIONS]
Options:
--framework FRAMEWORK - ML framework: pytorch, tensorflow, jax (default: pytorch)--gpu-required - Require GPU availability--min-vram GB - Minimum GPU VRAM required (GB)--cuda-version VERSION - Required CUDA version--packages FILE - Path to requirements.txt or packages list--platform PLATFORM - Platform: modal, lambda, runpod, local (default: local)--fix - Attempt to install missing packages--output REPORT - Output dependency report pathDependency Checks:
Example Output:
{
"status": "PASS",
"platform": "modal",
"checks": {
"python": {
"status": "PASS",
"version": "3.10.12",
"required": ">=3.9"
},
"pytorch": {
"status": "PASS",
"version": "2.1.0",
"cuda_available": true,
"cuda_version": "12.1"
},
"gpu": {
"status": "PASS",
"count": 1,
"type": "NVIDIA A100",
"vram_gb": 40,
"driver_version": "535.129.03"
},
"packages": {
"status": "PASS",
"installed": 42,
"missing": 0,
"outdated": 3
},
"storage": {
"status": "PASS",
"available_gb": 128,
"required_gb": 50
}
},
"recommendations": [
"Update transformers to latest version (4.36.0 -> 4.37.2)",
"Consider upgrading to PyTorch 2.2.0 for better performance"
]
}
Defines expected data structure and validation rules for training datasets.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["text", "label"],
"properties": {
"text": {
"type": "string",
"minLength": 10,
"maxLength": 5000,
"description": "Input text for training"
},
"label": {
"type": "string",
"enum": ["positive", "negative", "neutral"],
"description": "Classification label"
},
"metadata": {
"type": "object",
"properties": {
"source": {
"type": "string"
},
"timestamp": {
"type": "string",
"format": "date-time"
}
}
}
},
"validation_rules": {
"max_token_length": 2048,
"tokenizer": "meta-llama/Llama-2-7b-hf",
"check_duplicates": true,
"min_label_count": 100,
"max_label_imbalance_ratio": 10.0
}
}
Customization:
required fields for your use caseDefines pipeline testing parameters and expected behaviors.
pipeline:
name: test-pipeline
framework: pytorch
platform: modal # modal, lambda, runpod, local
data:
train_path: ./data/sample-train.jsonl
val_path: ./data/sample-val.jsonl
format: jsonl # jsonl, csv, parquet
sample_size: 10 # Number of samples to use for testing
model:
base_model: meta-llama/Llama-2-7b-hf
checkpoint_path: ./checkpoints/test
quantization: null # null, 8bit, 4bit
lora:
enabled: true
r: 16
alpha: 32
dropout: 0.05
target_modules: ["q_proj", "v_proj"]
training:
batch_size: 1
gradient_accumulation: 1
learning_rate: 2e-4
max_steps: 5
warmup_steps: 0
eval_steps: 2
testing:
sample_size: 10
timeout_seconds: 300
expected_loss_range: [0.5, 3.0]
fail_fast: true
cleanup: true
validation:
metrics: ["loss", "perplexity"]
check_gradients: true
check_memory_leak: true
gpu:
required: true
min_vram_gb: 16
allow_cpu_fallback: false
# 1. Check system dependencies
bash scripts/check-dependencies.sh \
--framework pytorch \
--gpu-required \
--min-vram 16
# 2. Validate training data
bash scripts/validate-data.sh \
--data-path ./data/train.jsonl \
--schema templates/validation-schema.json \
--check-duplicates \
--check-tokens
# 3. Test pipeline with sample data
bash scripts/test-pipeline.sh \
--config templates/test-config.yaml \
--verbose
# 1. Validate model checkpoint
bash scripts/validate-model.sh \
--model-path ./checkpoints/final \
--framework pytorch \
--check-weights \
--check-inference
# 2. Test inference pipeline
bash scripts/test-pipeline.sh \
--config templates/test-config.yaml \
--steps inference,metrics
# .github/workflows/validate-training.yml
- name: Validate Training Data
run: |
bash plugins/ml-training/skills/validation-scripts/scripts/validate-data.sh \
--data-path ./data/train.jsonl \
--schema ./validation-schema.json
- name: Test Training Pipeline
run: |
bash plugins/ml-training/skills/validation-scripts/scripts/test-pipeline.sh \
--config ./test-config.yaml \
--fail-fast
Data Validation Failures:
Model Validation Failures:
Pipeline Test Failures:
All scripts support verbose debugging:
bash scripts/validate-data.sh --data-path ./data/train.jsonl --verbose --debug
Outputs:
--sample-size for quick validation of large datasets--fail-fast in CI/CD to save timeScript Not Found:
# Ensure you're in the correct directory
cd /path/to/ml-training/skills/validation-scripts
bash scripts/validate-data.sh --help
Permission Denied:
# Make scripts executable
chmod +x scripts/*.sh
Missing Dependencies:
# Install required tools
pip install jsonschema pandas pyarrow
sudo apt-get install jq bc
Plugin: ml-training Version: 1.0.0 Supported Frameworks: PyTorch, TensorFlow, JAX Platforms: Modal, Lambda Labs, RunPod, Local