From sagemaker-ai
Debug-only skill that identifies and classifies Slurm scheduler and node-daemon issues on Amazon SageMaker HyperPod clusters.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sagemaker-ai:hyperpod-slurm-debuggerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Diagnostic-only. Identify and classify Slurm scheduler and node-daemon issues on
Diagnostic-only. Identify and classify Slurm scheduler and node-daemon issues on HyperPod Slurm clusters. Do not run, recommend, or print any state-mutating command. For remediation, link to the official AWS or Slurm documentation.
Invoke when the user reports any of the symptoms in the decision table.
Orchestrator.Eks — invoke hyperpod-node-debugger or hyperpod-nccl.hyperpod-node-debugger.hyperpod-nccl.hyperpod-ssm.Canonical recovery URLs: references/slurm-details.md → Authoritative recovery documentation.
sagemaker:DescribeCluster, sagemaker:ListClusterNodesssm:StartSession on the HyperPod-created SSM documentjq ≥ 1.6.unbuffer (from the expect package). Required — without it aws ssm start-session
returns empty stdout intermittently with Cannot perform start session: EOF and every
check silently misreports. Install: expect package on Amazon Linux / RHEL / Debian /
Ubuntu / macOS. Script exits at prerequisite check if missing.Ask the user for:
aws sagemaker describe-cluster --cluster-name <NAME/ARN> --region <REGION> \
--query 'Orchestrator' --output json
If Orchestrator.Eks is present, stop. Route per When NOT to invoke.
bash scripts/slurm-diagnose.sh --cluster <NAME> --region <REGION>
# Scope to a node:
bash scripts/slurm-diagnose.sh --cluster <NAME> --region <REGION> --node <SLURM_NODE>
Relay the script output to the user verbatim.
For each finding, look up the section in the decision table and link the user to the corresponding AWS / Slurm doc. Do not type out remediation commands.
Symptom (sinfo -o "%N %T %30E" or script finding) | Section |
|---|---|
Node state = down or down*, reason other than below | A: Node Down |
Node state = down*, Reason = Node unexpectedly rebooted | B: Unexpected Reboot |
Jobs PENDING with REASON=Resources while nodes are idle | C: Controller State |
Jobs stuck COMPLETING after node replacement | C: Controller State |
scontrol ping returns DOWN for the controller | C: Controller State |
| GRES (GPU) counts incorrect or not released | C: Controller State |
state=fail issued but no recovery occurred | D: Action Reason Mismatch |
Accounting errors or RPC errors mentioning dbd | C: Controller State (slurmdbd) |
slurm.conf edited; new partitions or nodes not visible | C: Controller State (config) |
| Job exited on a hardware failure but did not restart | E: Auto-resume |
| Behavior | Default | Override |
|---|---|---|
| Mode | read-only — always; no remediation flag exists | n/a |
| Region | $AWS_DEFAULT_REGION, falling back to us-east-1 | --region <R> |
| Scope | all nodes in down / drain / fail / "unexpectedly rebooted" | --node <SLURM_NODE_NAME> |
| Output | colorized terminal | --no-color |
| SSM target format | sagemaker-cluster:<clusterId>_<instanceGroupName>-<instanceId> (derived) | n/a |
| Controller discovery | --controller-group (if set) → SlurmConfig.NodeType=Controller → provisioning_parameters.json | --controller-group <N> |
| Failure | Skill behavior | Required user action |
|---|---|---|
describe-cluster fails | Print AWS error; exit 1 | Fix credentials/region; verify cluster name |
Cluster has Orchestrator.Eks | Exit 1 with pointer to EKS-side skills | Use hyperpod-node-debugger or hyperpod-nccl |
session-manager-plugin missing / SSM unreachable | sinfo returns empty; exit 1 | Install plugin; verify node InService |
Disk ≥ 95 % full on a down node | Report finding disk-full-<node> | Refer to AWS troubleshooting docs |
Missing jq or aws | Exit 1 at prerequisite check | Install per Prerequisites |
Node is down because slurmd stopped responding. Causes: slurmd crash, disk full,
OOM, network partition, hardware fault.
Script checks: systemctl is-active slurmd, srun -w <NODE> hostname (RPC layer), disk,
memory.
If node returns to down after a manual resume → escalate to hyperpod-node-debugger.
Context: references/slurm-details.md § A.
Node is down* with Reason "Node unexpectedly rebooted" because slurmd
re-registered after an out-of-band reboot. Upstream Slurm behavior, not HyperPod.
Node is typically healthy.
Links:
state=resume semantics)If node reboots again within minutes → escalate to hyperpod-node-debugger.
Context: references/slurm-details.md § B.
slurmctld in-memory state can desync from the on-disk state. A controller restart reloads from StateSaveLocation and clears bad caches. User decides and executes.
Restart may help:
| Symptom | Why |
|---|---|
PENDING with REASON=Resources, idle nodes | Re-evaluates the queue |
Jobs stuck COMPLETING after node replacement | Controller held a reference to the old node |
| GRES (GPU, EFA) not released after a job ends | Resource accounting de-synced |
Nodes stuck Unknown after reboot, slurmd is up | Re-registration was not processed |
scontrol ping times out | Controller event loop is hung |
Lost connection to slurmdbd / RPC errors | DBD connection wedged |
Do NOT restart when:
Action:Replace) in progress on any node — concurrent changes
fail the replacement.slurmd on that node.sinfo and squeue are responsive — problem is elsewhere.journalctl -u slurmctld not reviewed yet — panic / OOM will reproduce.slurm.conf was just edited — try scontrol reconfigure first.sacct fails, accounting fields show Unknown,
controller log spams Unable to contact slurmdbd. Restore slurmdbd before
considering controller restart.
https://slurm.schedmd.com/accounting.html ·
details.slurm.conf / topology.conf mtime > slurmctld start.
scontrol reconfigure first; restart is fallback.
https://slurm.schedmd.com/scontrol.html ·
details.Restart procedure / what's preserved:
Context: references/slurm-details.md § C.
scontrol update state=fail reason=... was issued with a reason that does not match
Action:Reboot or Action:Replace exactly. HyperPod silently ignores anything else.
Script detects near-misses on nodes in fail state.
Required strings (case-sensitive, no whitespace, no punctuation):
Action:RebootAction:ReplaceContext: references/slurm-details.md § Action reason-string validation.
--auto-resume=1 is an srun step option. It re-runs the step after HMA (the Health
Monitoring Agent) flags a node and Automatic node recovery replaces it.
Why it didn't restart the job:
sbatch not srun — per-step; sbatch directives are silently ignored.NodeRecovery is None — faulty nodes are labeled but not replaced.Link: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-auto-resume.html
Context: references/slurm-details.md § HyperPod auto-resume.
| Condition | Next skill |
|---|---|
Node returns to down shortly after a manual resume | hyperpod-node-debugger (hardware) |
slurmd logs contain CUDA / NVIDIA / XID errors | hyperpod-node-debugger § G |
Disk full or /dev/shm exhausted | hyperpod-node-debugger § I |
| Node unreachable via SSM | hyperpod-ssm |
Controller restart does not clear COMPLETING after 2 attempts | hyperpod-issue-report + AWS Support |
npx claudepluginhub awslabs/agent-plugins --plugin sagemaker-aiDiagnoses per-node issues on AWS HyperPod clusters (EKS or Slurm): unhealthy, unresponsive, stuck nodes. Covers EFA, GPU hardware (XID, ECC, NVLink, DCGM), Slurm node state, disk/memory pressure, lifecycle scripts, SSM agent, container runtime, kernel panics, pod networking. Read-only triage with suggested remediation commands.
Diagnoses HPC runtime and scheduler problems for failed or slow jobs on clusters, covering MPI/OpenMP/GPU layout, modules, CUDA/Kokkos, scratch paths, walltime, job arrays, restart strategy, and resource mismatch.
Provisions and manages on-demand/reserved GPU clusters (H100, H200, B200) on Together AI with Kubernetes or Slurm orchestration, shared storage, credentials, and scaling for ML/HPC workloads.