From tao-skill-bank
Fine-tune HuggingFace CV/VLM/LLM models on local NVIDIA GPUs in NGC PyTorch containers. Supports full/LoRA tuning, Hub push, and reproducible pipelines.
How this skill is triggered — by the user, by Claude, or both
Slash command
/tao-skill-bank:tao-finetune-huggingface-modelThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
<!-- Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. Licensed under the Apache License, Version 2.0; see http://www.apache.org/licenses/LICENSE-2.0 -->
BENCHMARK.mdevals/evals.jsonexamples/README.mdexamples/convnext-tiny-cifar10/Dockerfileexamples/convnext-tiny-cifar10/config.yamlexamples/convnext-tiny-cifar10/infer.pyexamples/convnext-tiny-cifar10/prepare_data.pyexamples/convnext-tiny-cifar10/reports/baseline_results.jsonexamples/convnext-tiny-cifar10/reports/eval_results.jsonexamples/convnext-tiny-cifar10/requirements.txtexamples/convnext-tiny-cifar10/run_eval.pyexamples/convnext-tiny-cifar10/train.pyexamples/detr-resnet50-cppe5/Dockerfileexamples/detr-resnet50-cppe5/config.yamlexamples/detr-resnet50-cppe5/infer.pyexamples/detr-resnet50-cppe5/prepare_data.pyexamples/detr-resnet50-cppe5/reports/baseline_results.jsonexamples/detr-resnet50-cppe5/reports/eval_results.jsonexamples/detr-resnet50-cppe5/requirements.txtexamples/detr-resnet50-cppe5/run_eval.pyLocal NVIDIA GPU fine-tuning for HuggingFace models, grounded in live-fetched documentation with curated references as a fallback safety net. One NGC container, a few focused scripts, one push to HF Hub. Follow the rules in this file; don't improvise.
Order of authority (highest first):
model_id, dataset_id, training_method, config.yaml overrides.references/research-priorities.md).references/*.md) — fallback when live research is silent/ambiguous.Conflict resolution between (2) and (3) and the source-line discrepancy note are
in references/research-priorities.md.
Required:
model_id — HuggingFace model ID, e.g. google/vit-base-patch16-224Conditional credentials (read from the session environment, exported before launching when present):
HF_TOKEN — only when the model/dataset is gated (read) or push_to_hub is on (write); public + public + push_to_hub: false needs none. Value never read — presence-only via [ -n "$HF_TOKEN" ].WANDB_API_KEY, WANDB_PROJECT — only when WandB is enabled; WANDB_MODE=disabled opts out.Dataset — exactly one:
dataset_id — HuggingFace dataset ID (source: hf)local_dataset_path — local folder or file (source: local); optional
local_dataset_format ∈ {auto, imagefolder, coco, voc, jsonl, arrow, parquet,
csv} (default: auto-detect).recommend)Optional (have defaults):
task_type — auto-detected from config + model cardn_train=10000, n_eval=1000, n_epochs=3, lora_r=16output_dir=./output/<model_short_name>hf_model_repo — push target; if unset and HF_TOKEN has write access,
auto-derived as <whoami>/<model_short_name>-finetuned.push_to_hub=True — set to False to skipskip_baseline=False — skip zero-shot baseline evalOptional deliverables (off by default):
emit_progress_log: false # output_dir/PROGRESS.md (per-step journal)
emit_report: false # reports/report.{pdf,html} with curves & samples
emit_unit_tests: false # tests/ with fake-data heterogeneous-batch tests
All values live in output_dir/config.yaml. Never hardcode in Python.
This skill orchestrates what to run; the platform skills own how to run it on a GPU host — read them first.
| Concern | Authoritative skill |
|---|---|
| GPU host runtime (driver 580, CUDA Toolkit 13.0, NVIDIA Container Toolkit 1.19.0) | tao-skill-bank:tao-setup-nvidia-gpu-host |
docker run flags, NGC auth, mounts, env passthrough | tao-skill-bank:tao-run-on-docker |
| Local Docker job preflight (daemon, GPU smoke) | tao-skill-bank:tao-run-on-local-docker |
Default platform: local-docker — build a one-off image (run-<short>:latest)
and run it on the local Docker daemon. Ask only when the user explicitly needs a
different backend (Brev remote GPU, SLURM/Kubernetes); then run that platform's
Preflight first and route the Steps 4–5 docker run commands through it. The
GPU-runtime and presence-only credential preflights (values never read), the
canonical docker run flag set, the list_tao_platforms.py selection command, and
the workflow-specific flags (--entrypoint /bin/bash -lc, PYTORCH_CUDA_ALLOC_CONF,
--name hft_train) are in references/workflow-intake-preflight.md.
Consulted only when live research is silent, ambiguous, or unavailable; live
docs always win for the specific model and current API. Each step links the
references it needs; full catalog in references/detailed-workflow.md.
Always-on: core-rules.md, error-playbook.md, compat-workarounds.md,
model-discovery.md, dataset-recommendations.md, dataset-sources.md,
dataset-patterns.md, hardware-container.md, research-priorities.md,
cv-scripts.md, vlm-scripts.md, docker-runs.md, hub-push.md,
pipeline-skill-template.md, deliverables.md. Opt-in (when their flag/need
applies): progress-tracking.md, testing.md, reporting.md,
workflow-intake-preflight.md, workflow-generate-train.md, workflow-push-rerun.md.
Rule: before falling back, log the live source you tried and why it was
insufficient (config.yaml notes:, and PROGRESS.md if enabled). [FETCH LIVE]
markers in cv-scripts.md / vlm-scripts.md are a research checklist, not code to
inline — refetch the listed URL if a block has no Step 3 finding.
Non-negotiable behaviors. Short version (full enumeration —
hallucinated-imports list, never-without-approval list, full error-recovery and
hardware-sizing tables — in references/core-rules.md, consult before any
training-time decision):
--max_steps 1 before any full run; no batch
launches without a verified smoke.prepare_data.py;
restructuring needed → stop and ask.Single pass, sequential; each step has a clear gate before the next begins.
Goal: decide whether to proceed. Probe model + dataset, apply accept/reject,
register applicable compat fixes, write the initial config.yaml.
Prerequisites: MODEL_ID, optional DATASET_ID / local_dataset_path,
optional HF_TOKEN, OUTPUT_DIR (default ./output/<model_short_name>). Probes
run in a CPU-only python:3.12-slim Docker container (bind-mounted .probe/
scratch) so the host needs no virtualenv — Docker must exist first. Docker-presence
guard, container env, full probe invocation, and the model/dataset probe scripts
are in references/workflow-intake-preflight.md, references/model-discovery.md,
and references/dataset-sources.md.
Probe requirements:
AutoConfig, read model-card tags, detect task from
architectures + tags + card examples (fallback logging in model-discovery.md).dataset-recommendations.md; for local data, bind-mount read-only and use
dataset-sources.md format detection.compat-workarounds.md against the model/task; defer hardware-dependent
rules to Step 2.Write the initial config.yaml (model_id, task, dataset_id or
local_dataset_path, research_sources: [] filled in Step 3,
applicable_workarounds: from Step 1, notes: [] for reference fallbacks,
push_to_hub: true default — annotated template in
references/workflow-intake-preflight.md). Optionally rm -rf "$OUTPUT_DIR/.probe"
once the gate is met.
Gate: config.yaml exists with model, dataset, task, applicable_workarounds;
do not proceed if any field is missing.
Goal: verify Docker + GPU + disk, pick the NGC PyTorch image live, finalize hardware-dependent compat rules.
2a. Audit (hard gate) — three checks (commands in
references/workflow-intake-preflight.md):
tao-setup-nvidia-gpu-host's
setup-nvidia-gpu-host.sh --backend docker --check-only; on fail, ask approval
then re-run with --install --yes.MIN_DISK_GB (default 100 GB); recommend
≥ 100 GB for NGC base (~20 GB) + HF cache + checkpoints + data.HF_TOKEN only when gated or push_to_hub is on; WANDB_* only when
WandB is on.Do not proceed to Step 4 on a hard-fail — Step 4's docker build pulls a
20+ GB NGC base, and a missing nvidia-container-toolkit only surfaces later as
could not select device driver "" with capabilities: [[gpu]]. Record gpu_count,
gpu_name, driver_major, vram_gb_per_gpu in config.yaml.
2b. Pick NGC image (live): from the NVIDIA Deep Learning Frameworks support
matrix (https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html),
PyTorch NGC container section, pick the highest-versioned image where
Min driver ≤ detected driver_major and container CUDA ≤ host CUDA Toolkit
(match closely so cuDNN / TensorRT line up). Do not reject an image for an
aN/bN/rcN PyTorch tag — NGC validates the full image; pick the newest
CUDA-aligned one and let compat-workarounds.md handle per-version issues. If the
matrix is unreachable, use the fallbacks in references/hardware-container.md;
default nvcr.io/nvidia/pytorch:24.09-py3 (driver ≥ 545; SDPA+GQA bug — if
num_key_value_heads < num_attention_heads, set attn_implementation: "eager").
Record ngc_image in config.yaml.
2c. Re-evaluate hardware-dependent compat rules: re-run the
compat-workarounds.md walk for entries whose detect needs hw; update
applicable_workarounds: in place.
2d. Model-fit check: estimate param_bytes ≈ 2×param_count (bf16); if
60% of
vram_gb_per_gpu × 1e9, recommend LoRA in the user-facing summary.
Gate: config.yaml has ngc_image, gpu_count, gpu_name, driver_major,
vram_gb_per_gpu; hardware-dependent compat fixes recorded.
Goal: fetch the live recipe — training-data knowledge of
transformers/trl/peft is suspect, so Step 3 is non-negotiable. Walk
references/research-priorities.md in priority order (Priority 1 → 6); stop once
you have, for the detected task:
AutoModel / processor classcompute_metricsRecord findings in meta/recipe.md, append source URLs to
config.yaml: research_sources:. A slot with no live finding falls back to the
matching scaffold (cv-scripts.md / vlm-scripts.md), logged as "fallback to
scaffold — no live source for " under notes:. Conflict-resolution rules
are in references/research-priorities.md.
Gate: every required slot filled, with a source URL or scaffold-fallback note.
Goal: write all scripts, build the image, prepare data, run a 1-step smoke on
real data (one docker build, two docker runs).
4a. Generate project files in output_dir/: config.yaml, Dockerfile,
requirements.txt, prepare_data.py, train.py, run_eval.py, infer.py,
optional merge_lora.py, optional tests/, .gitignore. Live Step 3 research is
authority; cv-scripts.md / vlm-scripts.md give scaffold shape only. Apply every
applicable_workarounds entry as a Dockerfile block, requirement pin, config
override, or runtime env var. Hard rules: run_eval.py keeps that exact filename
(avoids colliding with the HF evaluate package); every generated .py starts
with the NVIDIA Apache-2.0 copyright header and any emitter fails when it is
missing; emit_unit_tests: true generates and runs tests per
references/testing.md. Script bodies, Dockerfile shape, and the emitter contract
are in references/workflow-generate-train.md.
4b. Build, prepare, smoke — docker build -t run-<short>:latest ., then
prepare_data and the --smoke --max_steps 1 run (references/docker-runs.md
§1-3). Smoke pass criteria (in logs/smoke.log):
0.0, not NaN)grad_norm > 0 at step 1If emit_unit_tests: true, also run pytest tests/ in the container. Any failure → STOP.
4c. Preflight summary — before full training, print and verify: reference URL, dataset columns, Hub target, monitoring target, NGC image, hardware, smoke loss/grad norm.
Gate: project files written, image built, smoke PASSED, preflight has no blank fields.
Goal: baseline eval, full training, post-train eval, optional LoRA merge, 5
inference samples (all commands: references/docker-runs.md §4-8).
| Sub-step | docker-runs.md | Skip if |
|---|---|---|
| 5a. Baseline eval (zero-shot) | §4 | skip_baseline: true |
| 5b. Full training (detached) | §5 | — |
| 5c. LoRA merge | §6 | not VLM+LoRA |
| 5d. Post-train eval | §7 | — |
| 5e. Inference (5 samples) | §8 | — |
Multi-GPU: prepend torchrun --nproc_per_node=$gpu_count to python train.py.
While training streams, watch docker logs -f hft_train: loss should drop within
10-20 steps; flat loss (collator/label-masking bug), NaN (LR too high), and OOM
all stop the run — recovery in references/core-rules.md. If emit_report: true,
run report.py after Step 5e per references/reporting.md.
Gate: all of:
checkpoints/final/ (or checkpoints/merged/ for LoRA) existsreports/eval_results.json has a numeric primary metricreports/baseline_results.json exists (unless skipped)reports/inference_samples/ has 5 samplesGoal: publish the run and make it reproducible without re-research.
Push per references/hub-push.md (weights, model card, eval/baseline JSONs,
config.yaml, Dockerfile, requirements.txt, inference samples, reports when
emitted) unless push_to_hub: false is explicit. Emit
<output_dir>/skills/run-<short>/SKILL.md from
references/pipeline-skill-template.md — substitute every placeholder, include
full YAML metadata + the NVIDIA copyright HTML comment, and make any emitter fail
if those are missing.
Gate (Done criteria): all of:
results/
(unless push_to_hub: false)<output_dir>/skills/run-<short>/SKILL.md exists, no <placeholder> left,
with metadata + copyright HTML comment per pipeline-skill-template.mdFinal message: wandb URL, HF Hub URL, baseline -> fine-tuned primary metric,
reports/inference_samples/, and the rerun skill path.
On a known runtime error, consult the symptom → minimal-fix table in
references/error-playbook.md (NGC entrypoint, PyTorch/Transformers regressions,
numpy ABI, Albumentations bbox, PEFT/checkpointing, LoRA target breadth, CV
augmentation gaps, OOM at step 0) before redesigning anything. When a row there
fires twice across runs, lift it into compat-workarounds.md with a detect rule
— auto-applied in Step 1 before the error can fire.
npx claudepluginhub nvidia-tao/tao-skills-bank --plugin tao-daft-processCreates bite-sized, testable implementation plans from specs or requirements, with file structure and task decomposition. Activates before coding multi-step tasks.