Methodology-first deep learning training framework. Idea is cheap; infrastructure that lets you validate ideas fast is valuable.
Use this agent to diagnose a failed or stalled training run by reading recent logs, metrics, config, and traces. Trigger when the user asks "why did this crash", reports NaN/OOM/divergence, or shows an unhealthy loss curve. Produces a structured diagnosis with ranked candidate causes and fixes.
Use this agent to propose an Optuna search space and kick off a hyperparameter study. Trigger when the user asks to "tune hyperparameters", "set up an Optuna sweep", "search over LR and weight decay", or "find the best hyperparameters for this experiment". Operates only on configs that already passed Stage 3 pre-validation.
Use this agent to compare two completed training runs and produce a concise variance-aware markdown diff (config, metrics, stability, verdict). Trigger when the user asks "did this change help", "is run A better than run B", or "compare two experiments". Reads run journals only; does not re-run training.
Use this agent to scaffold a new model package (config.py, model.py, checkpoint.py, protocol.py + Hydra config) inside a curryTrain project. Trigger when the user asks to "add a new model called X", "scaffold an experiment", or "generate a curryTrain model from this HF model".
Run a short, reproducible benchmark of one optimizer step (forward + backward + optimizer step over N microbatches) using the project's registered runtime. Activate when the user asks to "benchmark a training step", "measure throughput", "time one optimizer step", or "smoke test the runtime". Wraps run_accumulated_step from curry_train.benchmark.
Diagnose a training failure or stall by inspecting recent logs, loss curves, OOM traces, NaN events, and config. Activate when the user asks "why did my training crash", "loss went to NaN", "OOM during step X", "training is not improving", or "help me debug this run". Delegates to the failure-diagnoser agent.
Lightning Fabric integration recipe — minimal 5-line setup that gives DDP / FSDP / mixed precision / mixed-precision while keeping a raw PyTorch training loop. Activate when the user asks "Lightning Fabric", "torchrun", "DDP setup", "FSDP setup", "mixed precision", or wires up the launch script.
Hydra + OmegaConf configuration layout for curryTrain projects — composable defaults, structured configs, CLI override syntax, sweep integration. Activate when the user asks "Hydra setup", "config management", "compose configs", "override CLI", "Hydra defaults list", or builds the experiment configuration.
Concrete recipe for running an Optuna-driven hyperparameter sweep through Hydra, with TPE/CMA-ES/Hyperband, distributed multi-rank trials, study persistence, and per-trial run journal. Activate when the user asks "set up an Optuna sweep", "run hyperparameter search", "Hydra Optuna sweeper", or "parallel HPO".
Uses power tools
Uses Bash, Write, or Edit tools
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
A methodology-first deep learning training framework, packaged as a Claude Code plugin.
Idea is cheap. Infrastructure that lets you validate ideas fast is valuable.
curryTrain organizes deep learning training around the actual end-to-end workflow, not around an algorithm catalog. The plugin provides Skills, Agents, and a minimal Python template that scaffolds a new training project and assists you through six well-defined stages.
| Stage | Question it answers | Representative skills |
|---|---|---|
| 1. Skeleton | Does the architecture exist and does data flow through it? | scaffolder, preflight-asserts, data-pipeline |
| 2. Sanity | Is the implementation actually correct? | overfit-single-batch, init-loss-check, grad-flow-viz |
| 3. Pre-validate | Will this idea pay off, before I burn the compute? | lr-range-test, small-scale-ablation, multi-seed-variance, mup-coord-check, scaling-fit, surrogate-task, compute-budget, kill-criterion |
| 4. Scale-up | Will it scale stably to the target size? | capacity-sweep, optuna-integration, parallel-primitive-intro |
| 5. Stabilize | Will it survive a long run? | warmup-cosine, loss-spike-rollback, checkpoint-cadence, run-journal |
| 6. Iterate | Which experiment was actually better? | variance-aware-decision, error-cluster, ablation-matrix, runs-diff |
Stage 3 is where most projects waste compute and where curryTrain provides the most differentiated value.
/curry-train:init is exposed as a slash command; the other 46 skills auto-activate from natural-language phrasing in your messages.template/curry_train/ — a minimal layered scaffold (Runtime / Primitive / Model) you copy into your project via /curry-train:initIn Claude Code, run:
/plugin marketplace add curryfromuestc/curry-train
/plugin install curry-train@curry-train
This adds the GitHub repo as a marketplace and installs the curry-train plugin from it. After installation, the /curry-train:init slash command and all description-activated skills (workflow, methodology, primitive, infra) become available in your sessions.
If you cloned this repo locally and want to edit the plugin while using it:
git clone https://github.com/curryfromuestc/curry-train.git
mkdir -p ~/.claude/plugins
ln -s "$(pwd)/curry-train" ~/.claude/plugins/curry-train
Reload Claude Code (or run /reload-plugins) and the plugin will be picked up.
claude --plugin-dir /path/to/curry-train
/curry-train:init is the only explicit slash command; everything else activates from natural-language phrasing.
# Bootstrap a new training project (copies the Python template into ./curry_train)
/curry-train:init my-experiment
Then drive the rest of the workflow by describing what you want:
new-experiment skill (Stage 1)bench skilldiagnose skillruns-diff skillThis is by design: the methodology lives in skills and trips on what you describe, so you don't have to memorize a command surface.
Logger protocol with TensorBoard as the default backend (no lock-in to W&B / MLflow)torchrun for launch (no custom launcher)Architecture inspired by NVIDIA Bumblebee's three-layer split (Runtime ↔ Primitive ↔ Model). Workflow inspired by Karpathy's "A Recipe for Training Neural Networks". Built for engineers who train models — including unconventional ones (SNN, CV, multimodal) — and need fast, trustworthy iteration.
The framework intentionally keeps the Python core small. The framework's value is in methodology (skills), not in re-implementing what Lightning Fabric / Accelerate / DeepSpeed already do well.
npx claudepluginhub curryfromuestc/curry-train --plugin curry-trainAI-assisted writing, review, and revision for empirical EECS papers (LaTeX, bibtex, NeurIPS/CVPR/IEEE). 8 active skills + 11 agents with academic-aware AI-writing pattern removal. Phase C annotator lands later.
Generate comprehensive repository digest reports with directory tree, core modules, dependency graph, build methods, and test methods.
Methodology library for chip frontend design — covers performance model, behavior model, RTL, and behavior↔RTL difftest. Eight explicit-command-triggered skills organized by task, not by chapter.
Ultra-compressed communication mode. Cuts ~75% of tokens while keeping full technical accuracy by speaking like a caveman.
Comprehensive UI/UX design plugin for mobile (iOS, Android, React Native) and web applications with design systems, accessibility, and modern patterns
Multi-model consensus engine integrating OpenAI Codex CLI, Gemini CLI, and Claude CLI for collaborative code review and problem-solving.
Curate auto-memory, promote learnings to CLAUDE.md and rules, extract proven patterns into reusable skills.