From NVIDIA BioNeMo Agent Toolkit
Submits Proteina-Complexa pipelines (binder search, monomer design, distributed training) to a remote SLURM cluster via bash launcher scripts. Always dry-runs before submitting and emits a replayable manifest.
How this skill is triggered — by the user, by Claude, or both
Slash command
/bionemo-agent-toolkit:complexa-slurmThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Drive `slurm_utils/launch_protein_binder_search.sh`,
Drive slurm_utils/launch_protein_binder_search.sh,
slurm_utils/launch_laproteina_design_pipeline.sh,
slurm_utils/launch_monomer_eval_from_pdb_dir.sh, and
slurm_utils/launch_laproteina_train.sh to submit Proteina-Complexa jobs to a
remote SLURM cluster. Probe the local + cluster environment first, gather the
right flags, always preview with --dry-run and surface the resolved sbatch
to the user before submission, then submit, capture the job IDs, and emit
slurm_manifest.json for replay.
No
complexaCLI involvement. SLURM submission is bash-script-only — there is nocomplexa slurmsubcommand. The launchers source.env, rsync the repo, generate sbatch scripts, and callsbatchover SSH. Don't try to replace them withssh ... sbatch <<EOF; you'll lose rsync gating, sweeper expansion, and the--dry-runsafety net.
--targets-file.
Drives all three Complexa design pipelines — protein binder (default),
ligand binder, AME / enzyme scaffolding — by passing the matching
configs/search_*_pipeline.yaml to the launcher.launch_laproteina_design_pipeline.sh,
default config design_monomer_pipeline.yaml) for unconditional protein
generation.nnodes_, ngpus_per_node_, ncpus_per_task_train_, run_name).--runtime.--sweeper) and per-config overrides
(--override) across the cluster.--skip-download).The local preflight (_shared/scripts/preflight.sh) covers GPU / ckpts / tools
on the local host. SLURM submission additionally needs cluster
reachability, an sbatch binary, and a live partition. Run both preflights.
bash .claude/skills/_shared/scripts/preflight.sh # local
bash .claude/skills/complexa-slurm/scripts/cluster_preflight.sh # cluster
cluster_preflight.sh writes cluster_preflight.json and checks:
.env Section 5 (CLUSTER_USER, CLUSTER_HOST, CLUSTER_ACCOUNT,
CLUSTER_PARTITION, CLUSTER_ROOT_REMOTE, CLUSTER_DATA_PATH,
CLUSTER_RUNTIME) is populated. Runtime-specific:
CLUSTER_CONTAINER_IMAGE (docker) or CLUSTER_UV_VENV + CLUSTER_CACHE_DIR
(uv).ssh -o BatchMode=yes -o ConnectTimeout=5 $CLUSTER_USER@$CLUSTER_HOST true
succeeds (passwordless SSH must already be set up).sbatch on $PATH.sinfo -p $CLUSTER_PARTITION -h returns at least one node.CLUSTER_CKPT_PATH and CLUSTER_SHARED_MODELS_PATH (if set) exist on the
cluster.Do not proceed to Step 2 if any required check fails — fix .env, SSH, or
permissions first. See reference/cluster_env.md for what each var means.
| Intent | Launch script |
|---|---|
| Complexa design pipeline (protein binder, ligand binder, AME) | slurm_utils/launch_protein_binder_search.sh (pass the matching configs/search_*_pipeline.yaml) |
| LaProteina unconditional monomer design | slurm_utils/launch_laproteina_design_pipeline.sh |
| Monomer evaluation from PDB dir | slurm_utils/launch_monomer_eval_from_pdb_dir.sh |
| Distributed training (fine-tune, RL, base training) | slurm_utils/launch_laproteina_train.sh |
Full flag matrices in reference/slurm_workloads.md.
Ask only what's missing from context. Suggested order:
22_DerF21, 02_PDL1, 39_7V11_LIGAND,
or M0096_1chm; or --targets-file path for batch mode.configs/training_local_latents.yaml; ask if the user has a different one..env default, or force docker / uv?--sweeper FILE? (binder only)--targets-file FILE and --num-runs N? (binder only)--on-cluster if the user is sitting on the
login node (skips SSH/rsync). Binder script only.--skip-download to leave outputs on the
cluster. Default is to rsync them back.Build the command and run it with --dry-run. Show the user the full output —
specifically the generated sbatch script body, the Hydra overrides, the
rsync target, and the resource block (--nodes, --gres=gpu, --time,
--partition). Get an explicit yes/no before submitting.
./slurm_utils/launch_protein_binder_search.sh --dry-run 22_DerF21
./slurm_utils/launch_laproteina_design_pipeline.sh --dry-run
./slurm_utils/launch_laproteina_train.sh --dry-run --config configs/training_local_latents.yaml
The dry-run logs "DRY RUN MODE — No changes will be made" and "Would create slurm_batch_script_*.sh with content: …". Surface that block verbatim.
Why this gate is non-negotiable: every real submission allocates an account
quota, schedules potentially many node-hours, may evict another user's job,
and creates a run directory on $CLUSTER_ROOT_REMOTE that the script refuses
to overwrite. A pre-submission read is the cheapest way to catch wrong
partitions, wrong run names, or a typo in a Hydra override.
Re-run the same command without --dry-run. The launch scripts source .env,
rsync code under a .rsync_lock/, write slurm_batch_script_*.sh to the run
directory on the cluster, and call sbatch via SSH.
./slurm_utils/launch_protein_binder_search.sh 22_DerF21
./slurm_utils/launch_laproteina_design_pipeline.sh
./slurm_utils/launch_laproteina_train.sh --config configs/training_local_latents.yaml
Watch stdout for Submitted batch job <ID> (the helper logs
Submitted <stage> [i/N] with Job ID: <ID>). Training is submitted with
--dependency=singleton repeated --num-jobs times (default 5) — record all
IDs. Binder search and LaProteina design pipelines submit stages sequentially
and wait_for_job between stages, so the caller blocks until the next stage's
IDs are available.
Take one snapshot at submission time and surface it to the user. Do not poll — re-invocation of this skill is cheap.
ssh "$CLUSTER_USER@$CLUSTER_HOST" "squeue -j <id1>,<id2>,... -u $CLUSTER_USER"
Log locations on the cluster (under the run directory
$CLUSTER_ROOT_REMOTE/<run_name>/):
| Script | Stage | Log path |
|---|---|---|
| binder search | generation | slurm_run_outputs/inf/slurm_<jobid>_<array>.{out,err} |
| binder search | filter | slurm_run_outputs/filter/slurm_<jobid>_<array>.{out,err} |
| binder search | evaluation | slurm_run_outputs/eval/slurm_<jobid>_<array>.{out,err} |
| binder search | analyze | slurm_run_outputs/agg/slurm_<jobid>_<array>.{out,err} |
| LaProteina design pipeline | gen/eval/agg | slurm_run_outputs/{gen,eval,agg}/... |
| training | all | slurm_run_outputs/slurm_<jobid>.{out,err} |
For follow-ups (job still running? failed?), have the user re-invoke this
skill, or ssh $CLUSTER_HOST and run squeue -u $CLUSTER_USER /
sacct -j <id> themselves.
When the user reports the job is done:
launch_*.sh auto-rsyncs results back when the pipeline
finishes. Binder search downloads to
<run_name>-<YYYY_MM_DD_HH>/{inference,evaluation_results} and aggregates
via script_utils/aggregate_successful_samples/aggregate_successful_samples.py.
LaProteina design pipeline downloads to
results_downloaded/<run_name>-<YYYY_MM_DD_HH>/.--skip-download was passed: results stay at
$CLUSTER_ROOT_REMOTE/<run_name>/ on the cluster. Pull manually with
rsync -az $CLUSTER_USER@$CLUSTER_HOST:$CLUSTER_ROOT_REMOTE/<run_name>/ ./local_results/.--on-cluster was passed: pipeline ran on the login node; results stay
under the current working directory on the cluster.Training does not auto-download — checkpoints live at
$CLUSTER_ROOT_REMOTE/<run_name>/ (or $CLUSTER_CKPT_PATH if configured).
slurm_manifest.jsonDrop a single manifest into ./complexa_slurm/ so the user has one file with
everything needed for replay.
{
"timestamp": "2026-05-15T18:32:01Z",
"workload": "binder_search",
"launch_cmd": "./slurm_utils/launch_protein_binder_search.sh --runtime docker 22_DerF21",
"cluster_host": "login.mycluster.example.com",
"cluster_partition": "gpu,compute",
"cluster_account": "my_slurm_account",
"runtime": "docker",
"run_name": "search_binder_pipeline-search-22_DerF21",
"remote_run_dir": "/lustre/.../runs/search_binder_pipeline-search-22_DerF21",
"sbatch_scripts": [
"slurm_batch_script_gen.sh",
"slurm_batch_script_filter.sh",
"slurm_batch_script_eval.sh",
"slurm_batch_script_analyze.sh"
],
"slurm_job_ids": {"generation": "1234567", "filter": "...", "evaluation": "...", "analyze": "..."},
"git_sha": "abc1234"
}
The launch scripts already write .githash into the remote run dir; mirror
that SHA into the local manifest.
| Workload | Per-job nodes | GPUs/node | CPUs/task | Walltime (default) | Notes |
|---|---|---|---|---|---|
| Binder search (any stage) | 1 | 1 (gpus-per-node 1) | from ncpus_ in pipeline YAML | 04:00:00 | Job array, one job per config |
| LaProteina design pipeline (any stage) | 1 | 1 | from ncpus_ in pipeline YAML | 02:00:00 | Job array, one job per config |
| Training | from nnodes_ (e.g. 12) | from ngpus_per_node_ (e.g. 8) | from ncpus_per_task_train_ | 04:00:00 | Singleton requeue chain (canonical: 12 × 8 = 96 H100 binder finetune) |
Defaults come from generate_array_slurm_header / generate_train_slurm_header
in slurm_utils/slurm_helper.sh. Full per-script breakdown in
reference/slurm_workloads.md.
| Symptom | Cause | Fix |
|---|---|---|
ssh: connect: Permission denied (publickey) | No passwordless SSH / wrong CLUSTER_SSH_KEY | ssh-copy-id, or set CLUSTER_SSH_KEY=/abs/path/key in .env |
Directory ... already exists on remote | A prior run with the same run_name is on the cluster | ssh $CLUSTER_HOST 'rm -rf $RUN_DIR', or change run_name in the YAML |
Container sqsh file not found on cluster | CLUSTER_CONTAINER_IMAGE path wrong | Verify with ssh $CLUSTER_HOST 'ls -lh <image>', or switch to a registry URL |
Missing required fields in <pipeline_config> | gen_njobs / eval_njobs / run_name not in YAML | Re-derive from the canonical pipeline YAML; do not edit a downstream copy |
INVALID_QOS / account ... not allowed in partition | CLUSTER_ACCOUNT lacks access to CLUSTER_PARTITION | Pick a partition the account is bound to; ssh $CLUSTER_HOST 'sshare -A $CLUSTER_ACCOUNT' |
All array jobs PENDING (Resources) indefinitely | Partition oversubscribed or wrong reservation | Try a different partition; check sinfo -p $CLUSTER_PARTITION |
Long list (preemption, container pulls, missing checkpoints, rsync failures,
targets-file format, --on-cluster vs default) lives in
reference/troubleshooting.md.
References:
.env Section 5
variables, defaults, what fails when each is missing.npx claudepluginhub nvidia-bionemo/bionemo-agent-toolkit --plugin bionemo-agent-toolkitRuns cartesian-product parameter sweeps over Proteina-Complexa design pipelines. Defines sweep YAML, generates configs, launches via SLURM, aggregates per-config metrics into summary CSV and manifest.
Generates correct SLURM sbatch job scripts with MPI/OpenMP layout guidance, resource validation, and conflict detection. Use when preparing cluster submissions or debugging job failures.
Generates and submits sbatch scripts for GPU compute jobs on Slurm clusters. Handles partition, GPU types (A100_40G, V100, A800), node selection, Python paths, and cluster rules.