**Version**: 1.0.0
Inherits all available tools
Additional assets for this skill
This skill inherits all available tools. When active, it can use any tool Claude has access to.
README.mdagents/ml-debugger-specialist.promptexamples/convergence-debugging.pyexamples/overfitting-detection.pyexamples/vanishing-gradients.pyresources/readme.mdresources/scripts/gradient-debugger.pyresources/scripts/loss-analyzer.pyresources/scripts/overfitting-detector.jsresources/scripts/training-monitor.shresources/templates/debug-config.yamlresources/templates/loss-curve-template.jsonresources/templates/training-metrics.yamltests/test-gradient-analysis.pytests/test-loss-divergence.pytests/test-mode-collapse.jsname: ml-training-debugger description: 'Version: 1.0.0' version: 1.0.0 category: specialists tags:
Version: 1.0.0 Type: Agent-based skill with SDK implementation Domain: Machine learning training diagnostics
Diagnose machine learning training failures including loss divergence, mode collapse, gradient issues, architecture problems, and optimization failures. This skill spawns a specialist ML debugging agent that systematically analyzes training artifacts to identify root causes and propose evidence-based fixes.
Use this skill when encountering training failures, when loss curves exhibit pathological behavior, when models produce degenerate outputs, when experiencing GPU memory issues, or when hyperparameter tuning produces inconsistent results.
This skill activates when users request:
The skill handles:
The ML debugging agent handles:
{
"task": "Diagnose training failure",
"artifacts": {
"training_logs": "path/to/logs.txt",
"loss_curves": "path/to/losses.csv",
"model_code": ["model.py", "trainer.py"],
"error_messages": ["error1.txt"],
"config": "config.yaml"
},
"symptoms": [
"Loss diverged at epoch 7",
"Mode collapse to single token",
"Gradient norm exploded"
],
"constraints": {
"max_analysis_time": "5 minutes",
"output_format": "structured_diagnosis"
}
}
{
"status": "diagnosis_complete",
"root_causes": [
{
"issue": "Learning rate too high for Muon optimizer",
"severity": "critical",
"evidence": ["grad_norm spike at step 24590", "loss increased 15% in epoch 7"],
"fix": "Reduce muon_lr from 1e-2 to 5e-3",
"confidence": 0.95
}
],
"quick_fixes": ["Reduce LR by 50%", "Enable gradient clipping"],
"analysis_artifacts": {
"gradient_analysis": "path/to/grad_analysis.md",
"loss_visualization": "path/to/loss_plot.png"
}
}
from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions
import asyncio
async def execute_ml_debugger(context: dict):
"""Spawn ML debugging specialist agent."""
# Load specialist agent prompt
with open('agents/ml-debugger-specialist.prompt', 'r') as f:
specialist_prompt = f.read()
# Configure agent
options = ClaudeAgentOptions(
model='claude-sonnet-4-5',
system_prompt=specialist_prompt,
permission_mode='default', # Read-only for safety
allowed_tools=['Read', 'Grep', 'Bash'], # Analysis tools only
setting_sources=['project']
)
client = ClaudeSDKClient(options)
try:
await client.connect()
# Format task for agent
task = f"""Diagnose ML training failure:
Symptoms: {context['symptoms']}
Artifacts available:
- Training logs: {context['artifacts']['training_logs']}
- Loss curves: {context['artifacts']['loss_curves']}
- Model code: {', '.join(context['artifacts']['model_code'])}
Perform systematic analysis and provide structured diagnosis."""
await client.query(task)
# Collect diagnosis
diagnosis = []
async for message in client.receive_messages():
if message.type == 'assistant':
diagnosis.append(message.content)
return parse_diagnosis(diagnosis)
finally:
await client.disconnect()
scripts/analyze_loss_curve.py - Loss curve analysis and visualizationscripts/check_gradients.py - Gradient flow analysisscripts/count_parameters.py - Model parameter counting and distributionscripts/profile_memory.py - GPU memory profilingreferences/common-failure-modes.md - Catalog of ML training failuresreferences/debugging-checklist.md - Systematic debugging workflowreferences/fix-templates.md - Code templates for common fixesextract_training_metrics() - Parse logs for key metricsvisualize_loss_curve() - Generate loss/gradient plotsanalyze_architecture() - Check model architecture balanceUser: "My model was training fine until epoch 7, then loss started increasing. Help debug this."
Skill gathers:
- Training logs from epochs 1-10
- Loss curve data
- trainer.py and model.py
- Hyperparameter config
Agent diagnoses:
- Root cause: Learning rate too high for curriculum transition
- Evidence: Loss increased 15% at epoch 7, gradient norm spiked
- Fix: Reduce learning rate by 50%, add cosine annealing
- Confidence: 95%
User: "Model only outputs colons (::::) regardless of input. What's wrong?"
Skill gathers:
- Model checkpoint
- Inference test logs
- Training loss history
- Model architecture code
Agent diagnoses:
- Root cause: Embedding layer has 79% of params, transformer underparameterized
- Evidence: Training loss decreased but model has no capacity to learn patterns
- Fix: Rebalance architecture (50% embeddings, 50% transformers)
- Confidence: 90%
User: "Getting warning 'var(): degrees of freedom is <= 0' during training"
Skill gathers:
- Full error traceback
- Gradient statistics from logs
- ACT head implementation code
Agent diagnoses:
- Root cause: ACT variance = 0 (all tokens use same halting steps)
- Evidence: Warning appears in ACT loss computation
- Fix: Add diversity regularization to ACT loss
- Confidence: 98%
The skill processes agent diagnosis into user-friendly format:
The ML debugging agent must:
This skill can be used in conjunction with:
If the agent cannot diagnose the issue:
The agent should NEVER:
Test the skill with:
agents/ml-debugger-specialist.promptindex.pyml-training-debugger-process.dottests/README.mdNext Steps: