From togetherai-skills
Provisions and manages on-demand/reserved GPU clusters (H100, H200, B200) on Together AI with Kubernetes or Slurm orchestration, shared storage, credentials, and scaling for ML/HPC workloads.
How this skill is triggered — by the user, by Claude, or both
Slash command
/togetherai-skills:together-gpu-clustersThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use Together AI GPU clusters when the user needs infrastructure control instead of a managed
Use Together AI GPU clusters when the user needs infrastructure control instead of a managed inference product.
Typical fits:
together-dedicated-endpoints for managed single-model hostingtogether-dedicated-containers for containerized inference without owning the full clustertogether-sandboxes for short-lived remote Python executiontogether-fine-tuning for managed training jobs instead of raw cluster operationstogether>=2.0.0). If the user is on an older version, they must upgrade first: uv pip install --upgrade "together>=2.0.0".shared_volume over creating a volume separately and attaching via volume_id. Separately created volumes may land in a different datacenter partition than the cluster, causing a "does not exist in the datacenter" error even when the volume shows as available.list_regions() first and be prepared to try multiple regions.cuda_version and nvidia_driver_version as separate fields in addition to the combined driver_version string. Pass them via extra_body in the Python SDK.slurm.conf) are Slinky v1.0 only. A non-zero exit from a worker prolog or epilog drains the node, and calling Slurm commands (squeue, scontrol, sacctmgr) inside any prolog/epilog can deadlock the scheduler.npx claudepluginhub togethercomputer/skills --plugin togetherai-skillsLaunches GPU/TPU clusters, training jobs, and inference servers across 25+ clouds, Kubernetes, Slurm using SkyPilot; debugs YAML, optimizes costs.
Deploys, monitors, and debugs long GPU jobs on rented/remote instances (AutoDL, RunPod, vast.ai, Lambda, Slurm, K8s) with teardown/billing safety, spot resilience, resumable checkpointing, and OOM/NaN triage.
Generates correct SLURM sbatch job scripts with MPI/OpenMP layout guidance, resource validation, and conflict detection. Use when preparing cluster submissions or debugging job failures.