Diagnostic and recovery guidance for swarm coordination issues. Use this skill when you encounter 'spawn failed', need to 'diagnose team', 'fix swarm', resolve 'status mismatch', perform 'recovery', troubleshoot kitty/tmux issues, or deal with session crashes, multiplexer problems, or teammate failures. Covers diagnostics, spawn failures, status mismatches, recovery procedures, and common error patterns.
This skill inherits all available tools. When active, it can use any tool Claude has access to.
examples/spawn-failure-recovery.mdThis skill provides comprehensive diagnostic and recovery procedures for swarm coordination issues.
# You try to spawn a teammate
/claude-swarm:swarm-spawn "backend-dev" "backend-developer" "sonnet" "..."
# Error: Could not find a valid kitty socket
# 1. Run diagnostics to identify the issue
/claude-swarm:swarm-diagnose my-team
# Output shows: kitty socket not found at expected location
# 2. Check kitty config
grep -E 'allow_remote_control|listen_on' ~/.config/kitty/kitty.conf
# 3. Fix: Add to kitty.conf if missing
# allow_remote_control yes
# listen_on unix:/tmp/kitty-$USER
# 4. Restart kitty completely and retry spawn
# 1. Check if teammates are actually alive
/claude-swarm:swarm-verify my-team
# Output: backend-dev: not found (session crashed)
# 2. Find status mismatches
/claude-swarm:swarm-reconcile my-team
# Output: backend-dev marked active but session missing - recommend removal
# 3. Resume the team (respawns offline members)
/claude-swarm:swarm-resume my-team
# After rebooting, team config shows active but all sessions are gone
# 1. Check current state
/claude-swarm:swarm-status my-team
# Shows: 3 members active, but multiplexer shows no sessions
# 2. Reconcile to auto-detect mismatches
/claude-swarm:swarm-reconcile my-team --auto-fix
# Automatically marks offline sessions as inactive
# 3. Resume team to respawn all members
/claude-swarm:swarm-resume my-team
Quick diagnostic rule: Always start with /claude-swarm:swarm-diagnose <team> - it runs all health checks and points you to the specific issue.
When using delegation mode (default), a spawned team-lead handles coordination. This affects how you troubleshoot.
| Issue Type | Who Should Diagnose | Commands |
|---|---|---|
| Team-lead unresponsive | You (orchestrator) | /swarm-diagnose, /swarm-status |
| Worker issues | Team-lead (first), then you | Ask team-lead to run /swarm-diagnose |
| Communication failures | Team-lead (first) | Ask team-lead to check and report |
| Task management issues | Team-lead | Team-lead manages tasks |
If team-lead is working, ask them to diagnose:
/claude-swarm:swarm-message team-lead "Please run /swarm-diagnose and report any issues"
# Or be more specific:
/claude-swarm:swarm-message team-lead "Worker backend-dev seems stuck. Can you verify they're alive and check their status?"
Why delegate diagnosis? Team-lead has full context of the team state and can both diagnose and fix issues directly.
If team-lead isn't responding, diagnose directly:
# 1. Check team status
/claude-swarm:swarm-status my-team
# 2. Is team-lead alive?
# Look for "team-lead" in status output - does window exist?
# 3. Run full diagnostics
/claude-swarm:swarm-diagnose my-team
# 4. If team-lead crashed, respawn them
/claude-swarm:swarm-reconcile my-team
/claude-swarm:swarm-spawn "team-lead" "team-lead" "sonnet" "You are the team-lead. Check /swarm-inbox for context. Resume coordination."
Intervene yourself when:
Let team-lead handle when:
# View raw team state (bypassing team-lead)
/claude-swarm:swarm-status my-team
/claude-swarm:task-list
# Diagnose directly
/claude-swarm:swarm-diagnose my-team
# Message workers directly (if team-lead down)
/claude-swarm:swarm-message backend-dev "Team-lead is unresponsive. What's your current status?"
# Broadcast to all (emergency)
/claude-swarm:swarm-broadcast "Team-lead is down. Please pause work and report status."
Swarm coordination involves multiple moving parts: multiplexers (tmux/kitty), Claude Code processes, file system state, and network communication. When issues arise, systematic diagnosis is essential.
First, identify the symptom category:
Always start with diagnostics before attempting fixes:
# Comprehensive health check - runs all diagnostics
/claude-swarm:swarm-diagnose <team-name>
# Check if teammates are actually alive
/claude-swarm:swarm-verify <team-name>
# Find and report status mismatches
/claude-swarm:swarm-reconcile <team-name>
# View current team state (members, tasks, multiplexer)
/claude-swarm:swarm-status <team-name>
What these commands check:
Issue Detected
│
├─ Can't spawn teammates?
│ └─ Run: /claude-swarm:swarm-diagnose <team>
│ ├─ "Multiplexer not found" → Install tmux/kitty
│ ├─ "Socket not found" → Check kitty config, restart kitty
│ ├─ "Duplicate name" → Use unique name or check existing teammates
│ └─ "Timeout" → Check system resources, retry
│
├─ Status shows teammates but they're not responding?
│ └─ Run: /claude-swarm:swarm-verify <team>
│ └─ Shows "not found" → Sessions crashed
│ └─ Run: /claude-swarm:swarm-reconcile <team>
│ └─ Then: /claude-swarm:swarm-resume <team>
│
├─ Messages not being received?
│ └─ Check: /claude-swarm:swarm-status <team>
│ ├─ Teammate shows "offline" → Respawn teammate
│ ├─ Wrong agent name used → Check exact names
│ └─ Teammate not checking inbox → Send reminder
│
└─ Task commands failing?
└─ Run: /claude-swarm:task-list
└─ Verify task ID exists, check status values
## Common Issues
### Spawn Failures
Spawn failures are the most common issue when creating swarm teams. Understanding the spawn process helps diagnose failures quickly.
**How spawning works**:
1. Validate team name and agent name (no path traversal, special chars)
2. Detect multiplexer (kitty or tmux)
3. For kitty: Find valid socket, create window with environment variables
4. For tmux: Create new session with environment variables
5. Launch Claude Code process with model and initial prompt
6. Register window/session and update config
7. Wait for Claude Code to become responsive
**Symptoms of spawn failure**:
- `spawn_teammate` or `/claude-swarm:swarm-spawn` returns error
- Error messages about multiplexer not found
- Session/window creation fails
- Timeout waiting for teammate to start
- Process starts but immediately crashes
**Immediate diagnostic steps**:
1. **Check error output** - The error message usually indicates root cause
2. **Run diagnostics**:
```bash
/claude-swarm:swarm-diagnose <team-name>
# For kitty users
kitten @ ls # Should list windows without error
# For tmux users
tmux list-sessions # Should list sessions without error
# Check Claude Code is working
claude --version # Should show version number
Troubleshooting workflow:
Spawn Command Fails
│
├─ Error mentions "multiplexer"?
│ └─ YES → See "Multiplexer Not Available" below
│
├─ Error mentions "socket"?
│ └─ YES → See "Kitty Socket Issues" below
│
├─ Error mentions "duplicate" or "already exists"?
│ └─ YES → See "Duplicate Agent Names" below
│
├─ Error mentions "timeout"?
│ └─ YES → See "Session Creation Timeout" below
│
├─ Error mentions "invalid" or "path traversal"?
│ └─ YES → See "Path Traversal Validation" below
│
└─ No clear error but spawn fails silently?
└─ Check: System resources, permissions, Claude Code installation
Common Causes:
Error:
Error: Neither tmux nor kitty is available
Solution:
# Install tmux (macOS)
brew install tmux
# Or install kitty
brew install --cask kitty
# Verify installation
which tmux # or: which kitty
Error:
Error: Agent name 'backend-dev' already exists in team
Solution:
# Use unique names
/claude-swarm:swarm-spawn "backend-dev-2" "backend-developer" "sonnet" "..."
# Or check existing teammates first
/claude-swarm:swarm-status <team-name>
Error (kitty):
Error: Could not find a valid kitty socket
Solution:
# 1. Verify kitty config has remote control enabled
grep -E 'allow_remote_control|listen_on' ~/.config/kitty/kitty.conf
# Should show:
# allow_remote_control yes
# listen_on unix:/tmp/kitty-$USER
# 2. Check socket exists (kitty appends -PID to path)
ls -la /tmp/kitty-$(whoami)-*
# 3. Test socket connectivity
kitten @ ls
# 4. Restart kitty completely if needed (not just reload)
# 5. Or manually set socket path
export KITTY_LISTEN_ON=unix:/tmp/kitty-$(whoami)-$KITTY_PID
Note: Kitty creates sockets at /tmp/kitty-$USER-$PID. The plugin auto-discovers the correct socket, but if you have multiple kitty instances, you may need to set KITTY_LISTEN_ON explicitly.
Deep dive on kitty socket discovery:
The spawn process tries sockets in this order:
$KITTY_LISTEN_ON environment variable (if set and valid)/tmp/kitty-$USER-$KITTY_PID (exact match for current kitty)/tmp/kitty-$USER-* sockets (newest first)/tmp/kitty-$USER (fallback)/tmp/mykitty and /tmp/kitty (alternative locations)Each socket is validated with kitten @ --to $socket ls before use. If validation fails, the search continues.
Multiple kitty instances troubleshooting:
If you have multiple kitty windows open:
# List all kitty sockets
ls -la /tmp/kitty-$(whoami)-*
# Example output:
# /tmp/kitty-user-12345 (kitty window 1)
# /tmp/kitty-user-67890 (kitty window 2)
# Test each socket
kitten @ --to unix:/tmp/kitty-user-12345 ls
kitten @ --to unix:/tmp/kitty-user-67890 ls
# Set the correct socket for your team-lead window
export KITTY_LISTEN_ON=unix:/tmp/kitty-$(whoami)-$KITTY_PID
Configuration file location varies:
~/.config/kitty/kitty.conf~/.config/kitty/kitty.conf or ~/Library/Preferences/kitty/kitty.confkitty --debug-config | grep "Config file"Common kitty config issues:
listen_on unix:/path, not listen_on /pathExample working kitty.conf:
# ~/.config/kitty/kitty.conf
allow_remote_control yes
listen_on unix:/tmp/kitty-$USER
# Note: $USER expands at kitty startup, then -$PID is appended automatically
Socket permission issues:
# Check socket permissions
ls -la /tmp/kitty-$(whoami)-*
# Should show: srw------- (socket, owner read-write-execute only)
# If permissions are wrong:
# 1. Kill kitty completely
# 2. Remove old sockets: rm /tmp/kitty-$(whoami)-*
# 3. Restart kitty (will recreate with correct permissions)
Error:
Error: Invalid team name (path traversal detected)
Solution:
# Use simple team names without special characters
# Good: "auth-team", "feature-x", "bugfix_123"
# Bad: "../other-team", "team/name", "team..name"
Error:
Error: Timeout waiting for teammate session to start
Solution:
# Retry once (may be transient)
/claude-swarm:swarm-spawn "agent-name" ...
# Check system resources
top # Look for high CPU/memory usage
# Verify multiplexer is responsive
tmux list-sessions # or: kitty @ ls
Recovery Steps:
Symptoms:
Diagnosis:
/claude-swarm:swarm-reconcile <team-name>
This will report:
Common Causes:
Detection:
# Config shows active, but session doesn't exist
/claude-swarm:swarm-verify <team-name>
# Output: "Error: Session swarm-team-agent not found"
Solution:
# Run reconcile to update status
/claude-swarm:swarm-reconcile <team-name>
# Respawn the teammate
/claude-swarm:swarm-spawn "agent-name" "agent-type" "model" "prompt"
# Or resume the team (respawns all offline)
/claude-swarm:swarm-resume <team-name>
Detection: User manually killed tmux/kitty session outside of cleanup command
Solution:
# Reconcile will detect and fix
/claude-swarm:swarm-reconcile <team-name>
# Respawn if needed
/claude-swarm:swarm-spawn "agent-name" ...
Detection: Sessions killed but config files remain
Solution:
# Run cleanup properly
/claude-swarm:swarm-cleanup <team-name> --force
# Or manually remove config
rm ~/.claude/teams/<team-name>/config.json
Symptoms:
Diagnosis:
# Check team status
/claude-swarm:swarm-status <team-name>
# Verify teammate is alive
/claude-swarm:swarm-verify <team-name>
# Check inbox manually
cat ~/.claude/teams/<team-name>/inboxes/<agent-name>.json
Common Causes:
Solution:
/claude-swarm:swarm-inbox regularlyError:
Error: Agent 'backend' not found in team
Solution:
# Check exact agent names
/claude-swarm:swarm-status <team-name>
# Use exact name from status output
/claude-swarm:swarm-message "backend-dev" "message" # Not "backend"
Symptoms: Inbox command fails or shows garbled output
Solution:
# Back up current inbox
cp ~/.claude/teams/<team-name>/inboxes/<agent>.json ~/.claude/teams/<team-name>/inboxes/<agent>.json.bak
# Reset inbox
echo '[]' > ~/.claude/teams/<team-name>/inboxes/<agent>.json
# Notify sender to resend messages
Symptoms:
Diagnosis:
# View current tasks
/claude-swarm:task-list
# Check task file directly
cat ~/.claude/tasks/<team-name>/tasks.json
Common Causes:
Error:
Error: Task #99 not found
Solution:
# List tasks to see valid IDs
/claude-swarm:task-list
# Use correct ID from list
/claude-swarm:task-update 3 --status "in-progress"
Error:
Error: Invalid status 'done'
Solution:
# Use valid status values:
# - pending
# - in-progress
# - blocked
# - in-review
# - completed
/claude-swarm:task-update 3 --status "completed" # Not "done"
Error:
Error: Agent 'frontend' not found in team
Solution:
# Check exact agent names
/claude-swarm:swarm-status <team-name>
# Use exact name
/claude-swarm:task-update 3 --assign "frontend-dev"
Symptoms:
Diagnosis:
# Check if team directory exists
ls -la ~/.claude/teams/<team-name>/
# Check permissions
ls -la ~/.claude/teams/
Common Causes:
Error:
Error: Team 'my-team' already exists
Solution:
# Choose different name
/claude-swarm:swarm-create "my-team-2" "description"
# Or cleanup old team first
/claude-swarm:swarm-cleanup "my-team" --force
Error:
Error: Permission denied creating ~/.claude/teams/my-team/
Solution:
# Fix permissions on Claude directory
chmod 700 ~/.claude/
chmod 700 ~/.claude/teams/
# Retry creation
/claude-swarm:swarm-create "my-team" "description"
Error:
Error: Invalid team name
Solution:
# Use alphanumeric with hyphens/underscores
# Good: "feature-auth", "bugfix_123", "team2"
# Bad: "../team", "team name", "team/123"
Choosing the right recovery strategy depends on the severity of the issue, how much work would be lost, and whether the team can continue working. This section provides decision-making guidance for recovery scenarios.
Problem Diagnosed
│
├─ Are teammates still working successfully?
│ └─ YES → Use Soft Recovery (minimal disruption)
│ ├─ 1-2 teammates offline → Respawn just those teammates
│ ├─ Status mismatch only → Run reconcile
│ └─ Communication issue → Fix inbox, notify teammates
│
├─ Is critical work in progress?
│ └─ YES → Evaluate data loss risk
│ ├─ Work saved to files/commits? → Safe to use Hard Recovery
│ ├─ Work only in memory/history? → Try Partial Recovery first
│ └─ Uncertain? → Ask teammates to save work first
│
├─ Is the team completely non-functional?
│ └─ YES → Assess what can be salvaged
│ ├─ Tasks/config readable? → Use Partial Recovery
│ ├─ Files corrupted? → Use Hard Recovery
│ └─ Everything broken? → Nuclear option (full reset)
│
└─ Is this a persistent/recurring issue?
└─ YES → After recovery, investigate root cause
├─ Check system resources (disk, memory, CPU)
├─ Review multiplexer logs
└─ Consider reducing team size
When to use:
What's preserved:
What's affected:
Step-by-step soft recovery:
/claude-swarm:swarm-status <team-name>
# Look for members showing "no window" with config "active"
/claude-swarm:swarm-reconcile <team-name>
# This marks offline sessions as offline in config
# Option A: Respawn specific teammate
/claude-swarm:swarm-spawn "agent-name" "agent-type" "model" "Continue where you left off: [context]"
# Option B: Resume entire team (respawns all offline)
/claude-swarm:swarm-resume <team-name>
/claude-swarm:swarm-verify <team-name>
# All teammates should show as active
# Via bash function
source "${CLAUDE_PLUGIN_ROOT}/lib/swarm-utils.sh" 1>/dev/null
broadcast_message "<team-name>" "Recovery complete. Team member [name] has been respawned. Continue your work."
Example soft recovery scenario:
Situation: 5-teammate team, 2 teammates crashed mid-work
1. $ /claude-swarm:swarm-status my-team
Output shows:
- team-lead: active (you)
- frontend-dev: active ✓
- backend-dev: active ✗ (no window)
- tester: active ✗ (no window)
- reviewer: active ✓
2. $ /claude-swarm:swarm-reconcile my-team
Output:
- Marked backend-dev as offline
- Marked tester as offline
3. $ /claude-swarm:swarm-resume my-team
Output:
- Respawning: backend-dev
- Respawning: tester
- Both spawned successfully
4. $ /claude-swarm:swarm-verify my-team
Output: All teammates active ✓
5. Message team: "backend-dev and tester were respawned after crash. Please continue your assigned tasks."
Result: Team back to full capacity in ~60 seconds, no data lost
When to use:
What's lost:
What's preserved:
Before hard recovery checklist:
# 1. Save task list for reference
/claude-swarm:task-list > tasks-backup.txt
# 2. Check for uncommitted work
git status
# 3. Ask teammates to commit their work (if any are responsive)
/claude-swarm:swarm-message "backend-dev" "Commit your work immediately, team restart needed"
# 4. Back up configs (optional)
cp ~/.claude/teams/<team-name>/config.json ~/config-backup.json
# 5. Document current state
/claude-swarm:swarm-status <team-name> > status-backup.txt
Step-by-step hard recovery:
/claude-swarm:swarm-cleanup <team-name> --force
# Check no sessions remain
tmux list-sessions | grep <team-name> # Should be empty
# or for kitty:
kitten @ ls | grep swarm-<team-name> # Should be empty
# Check team directory
ls ~/.claude/teams/<team-name>/
# Should not exist if --force was used
/claude-swarm:swarm-create <team-name> "Team description"
# Recreate each task manually
/claude-swarm:task-create "Implement API endpoints" "Full description..."
/claude-swarm:task-create "Write unit tests" "Test coverage for..."
# ... repeat for all tasks
/claude-swarm:swarm-spawn "backend-dev" "backend-developer" "sonnet" "You are the backend developer. Focus on: [task details]"
/claude-swarm:swarm-spawn "frontend-dev" "frontend-developer" "sonnet" "You are the frontend developer. Focus on: [task details]"
# ... repeat for all teammates
/claude-swarm:task-update 1 --assign "backend-dev"
/claude-swarm:task-update 2 --assign "frontend-dev"
/claude-swarm:swarm-verify <team-name>
/claude-swarm:swarm-status <team-name>
Timeline: Hard recovery typically takes 5-10 minutes for a 5-teammate team.
When to use:
Techniques:
When: Inbox file corrupted, messages malformed, inbox command errors
# Back up current inbox first
cp ~/.claude/teams/<team-name>/inboxes/<agent>.json ~/.claude/teams/<team-name>/inboxes/<agent>.json.bak
# Reset to empty inbox
echo '[]' > ~/.claude/teams/<team-name>/inboxes/<agent>.json
# Verify format
cat ~/.claude/teams/<team-name>/inboxes/<agent>.json
# Should output: []
# Notify affected teammate
/claude-swarm:swarm-message "<agent>" "Your inbox was reset due to corruption. Please check your backup if you need message history."
When: Task file has invalid status, corrupted JSON, missing fields
# Back up task file
cp ~/.claude/tasks/<team-name>/<id>.json ~/.claude/tasks/<team-name>/<id>.json.bak
# Fix manually with jq
jq '.status = "in-progress"' ~/.claude/tasks/<team-name>/<id>.json > /tmp/task-fixed.json
mv /tmp/task-fixed.json ~/.claude/tasks/<team-name>/<id>.json
# Or edit directly
# Edit the JSON file to fix the issue
# Verify task is valid
cat ~/.claude/tasks/<team-name>/<id>.json | jq '.'
# Should output valid JSON
When: One teammate crashed, others working fine
# 1. Check teammate is really offline
/claude-swarm:swarm-verify <team-name>
# 2. Update their status
/claude-swarm:swarm-reconcile <team-name>
# 3. Check their assigned tasks
/claude-swarm:task-list
# Note which tasks were assigned to this teammate
# 4. Respawn with context
/claude-swarm:swarm-spawn "<agent-name>" "<agent-type>" "<model>" "You crashed mid-work. Resume: [describe what they were doing, which files they were editing, what tasks to continue]"
# 5. Reassign their tasks
/claude-swarm:task-update <task-id> --assign "<agent-name>"
/claude-swarm:task-update <task-id> --comment "Teammate respawned, resuming work"
# 6. Notify teammate of their context
/claude-swarm:swarm-message "<agent-name>" "You were working on: [specific context]. Check Task #<id> for details."
When: Config shows wrong status, but files and sessions are fine
# Use reconcile for automatic fixing
/claude-swarm:swarm-reconcile <team-name> --auto-fix
# Or manual fix if you know the issue
# Edit config.json directly:
# 1. Back up: cp ~/.claude/teams/<team-name>/config.json ~/config-backup.json
# 2. Edit: jq '(.members[] | select(.name == "agent-name")) |= (.status = "active")' config.json > config-fixed.json
# 3. Replace: mv config-fixed.json ~/.claude/teams/<team-name>/config.json
| Symptom | Data Loss Risk | Recommended Strategy | Recovery Time |
|---|---|---|---|
| 1 teammate offline | None | Soft (respawn one) | 30 seconds |
| Multiple offline | None | Soft (resume team) | 1-2 minutes |
| Status mismatch only | None | Soft (reconcile) | 10 seconds |
| Inbox corruption | Messages lost | Partial (reset inbox) | 30 seconds |
| Task file corrupt | Comments lost | Partial (fix task) | 1-2 minutes |
| Config corrupt | History lost | Hard (recreate) | 5-10 minutes |
| Everything broken | All lost | Hard (full reset) | 10-15 minutes |
| Persistent failures | Depends | Diagnose root cause first | Varies |
Some issues require more than recovery:
Signs you need to investigate deeper:
Investigation steps:
# Check system resources
top
# Look for: high CPU usage, low free memory, swap usage
# Check disk space
df -h ~/.claude
# Ensure adequate free space (>1GB recommended)
# Check file descriptor limits
ulimit -n
# Should be >=256, ideally >=1024
# Check for zombie processes
ps aux | grep claude
# Kill any orphaned Claude Code processes
# Review system logs
# macOS: Console.app, filter for "claude" or "kitty"
# Linux: journalctl --user | grep claude
Prevention is significantly easier than recovery. Following these practices reduces issues by 80-90%.
Why this matters: Spawn failures may not be immediately obvious. A teammate might appear to spawn successfully but crash seconds later, or spawn without proper environment variables set.
Verification workflow:
# After spawning team, ALWAYS verify
/claude-swarm:swarm-verify <team-name>
# Expected output for healthy team:
# Verifying team 'my-team'...
# ✓ team-lead (team-lead) - session active
# ✓ backend-dev (backend-developer) - session active
# ✓ frontend-dev (frontend-developer) - session active
# All teammates verified successfully!
# Check detailed status
/claude-swarm:swarm-status <team-name>
What to look for:
If verification fails immediately after spawn:
# Wait 5-10 seconds for Claude Code to fully initialize
sleep 10
/claude-swarm:swarm-verify <team-name>
# If still failing, check what's wrong
/claude-swarm:swarm-diagnose <team-name>
Slash commands have built-in validation, error handling, and safer parameter parsing compared to direct bash function calls.
Comparison:
# Slash command (RECOMMENDED)
/claude-swarm:swarm-spawn "backend-dev" "backend-developer" "sonnet" "Implement API"
# Direct bash function (AVOID unless necessary)
source "${CLAUDE_PLUGIN_ROOT}/lib/swarm-utils.sh" 1>/dev/null
spawn_teammate "team" "backend-dev" "backend-developer" "sonnet" "Implement API"
Slash command advantages:
When bash functions are acceptable:
Never retry blindly - understand why it failed first:
# BAD: Blind retry loop
for i in {1..5}; do
/claude-swarm:swarm-spawn "agent" "worker" "sonnet" "prompt" && break
done
# GOOD: Diagnose then fix
if ! /claude-swarm:swarm-spawn "agent" "worker" "sonnet" "prompt"; then
echo "Spawn failed, diagnosing..."
/claude-swarm:swarm-diagnose <team-name>
# Read diagnostic output, fix the issue, then retry once
# Example: Install missing multiplexer, fix socket, etc.
# Retry after fix
/claude-swarm:swarm-spawn "agent" "worker" "sonnet" "prompt"
fi
Error handling best practices:
if ! /claude-swarm:swarm-spawn "agent" "worker" "sonnet" "prompt" 2> spawn-error.log; then
echo "Spawn failed. Error log:"
cat spawn-error.log
# Now you have error details for debugging
fi
# Don't wait forever for unresponsive operations
timeout 30s /claude-swarm:swarm-verify <team-name>
# Before spawning team, check prerequisites
if [[ "$(detect_multiplexer)" == "none" ]]; then
echo "Error: No multiplexer available. Install tmux or kitty first."
exit 1
fi
For long-running teams (multiple hours or days), periodic health checks prevent gradual degradation.
Recommended check frequency:
Health check script:
#!/bin/bash
# save as: health-check.sh
TEAM="$1"
echo "=== Health Check: $TEAM ==="
echo ""
# Check for status drift
echo "Checking for status mismatches..."
/claude-swarm:swarm-reconcile "$TEAM"
# Verify all teammates
echo ""
echo "Verifying teammate sessions..."
/claude-swarm:swarm-verify "$TEAM"
# Check task progress
echo ""
echo "Task summary..."
/claude-swarm:task-list | grep -E "in-progress|blocked"
# Done
echo ""
echo "Health check complete!"
Automated monitoring (for critical/long-running teams):
# Add to cron or run in background
while true; do
/claude-swarm:swarm-verify <team-name> || {
echo "Health check failed at $(date)"
/claude-swarm:swarm-diagnose <team-name>
# Send notification, page on-call, etc.
}
sleep 900 # Check every 15 minutes
done
Why proper cleanup matters:
Cleanup best practices:
# Standard cleanup (safe, preserves files for reference)
/claude-swarm:swarm-cleanup <team-name>
# This kills sessions but leaves:
# - Config files
# - Task files
# - Inbox files
# - Logs
# Force cleanup (removes everything)
/claude-swarm:swarm-cleanup <team-name> --force
# This kills sessions AND removes:
# - ~/.claude/teams/<team-name>/
# - ~/.claude/tasks/<team-name>/
When to use each:
What NOT to do:
# NEVER manually delete while sessions are running
rm -rf ~/.claude/teams/<team-name>/ # Leaves orphaned sessions!
# NEVER kill sessions without cleanup
tmux kill-session -t swarm-<team>-<agent> # Leaves config!
# ALWAYS use cleanup commands
/claude-swarm:swarm-cleanup <team-name>
Cleanup verification:
# After cleanup, verify nothing remains
tmux list-sessions | grep <team-name> # Should be empty
ls ~/.claude/teams/<team-name>/ # Should not exist (if --force used)
Why monitoring matters: Large teams (5+ teammates) can consume significant resources. Each Claude Code process uses:
Resource monitoring:
# Check total Claude Code memory usage
ps aux | grep claude | awk '{sum+=$4} END {print "Total memory: " sum "%"}'
# Count active Claude processes
ps aux | grep claude | wc -l
# Check file descriptor usage
lsof -p $(pgrep claude) | wc -l
# Monitor system load
uptime
# Load average should be below CPU core count
Resource limits:
| Team Size | RAM Needed | Recommended System |
|---|---|---|
| 2-3 teammates | 2-3 GB | 8GB RAM minimum |
| 4-6 teammates | 3-5 GB | 16GB RAM recommended |
| 7-10 teammates | 6-8 GB | 32GB RAM recommended |
| 10+ teammates | 10+ GB | Not recommended without testing |
When to scale back:
# Reduce team size gracefully
# 1. Finish critical tasks
# 2. Have teammates commit work
# 3. Kill non-essential teammates
/claude-swarm:swarm-cleanup <team-name> # Only kills sessions for specific agents
# 4. Consolidate work across fewer teammates
Problem: Respawned teammates don't know what they were doing
Solution: Provide comprehensive initial prompts
Bad initial prompt:
/claude-swarm:swarm-spawn "backend-dev" "backend-developer" "sonnet" "Work on the backend"
Good initial prompt:
/claude-swarm:swarm-spawn "backend-dev" "backend-developer" "sonnet" "You are the backend developer for team my-team. Your tasks: 1) Implement /api/users endpoint in src/api/users.ts, 2) Add database schema in migrations/. Current status: API routes defined, need implementation. Coordinate with frontend-dev for API contract. Check Task #3 for full requirements."
Initial prompt template:
You are the [ROLE] for team [TEAM_NAME].
Your assigned tasks:
1. [TASK_1] - [STATUS]
2. [TASK_2] - [STATUS]
Current state:
- [What's done]
- [What's in progress]
- [What's blocked/dependencies]
Key files:
- [FILE_1]: [Description]
- [FILE_2]: [Description]
Coordinate with:
- [TEAMMATE_1]: [for what]
- [TEAMMATE_2]: [for what]
First action: [Specific next step]
For teams lasting >1 hour, document the architecture:
# Create team docs
cat > ~/.claude/teams/<team-name>/README.md <<EOF
# Team: <team-name>
## Purpose
[What this team is building]
## Members
- team-lead: Orchestration, task assignment
- backend-dev: API implementation, database
- frontend-dev: UI components, styling
- tester: Test coverage, QA
## Task Breakdown
- Task #1: [Description] - assigned to backend-dev
- Task #2: [Description] - assigned to frontend-dev
- Task #3: [Description] - assigned to tester
## Dependencies
- Task #2 depends on Task #1 (API contract)
- Task #3 depends on Task #1, #2 (working features)
## Recovery Notes
- If backend-dev crashes: They were editing src/api/, check git status
- If frontend-dev crashes: They were in src/components/, state in localStorage
EOF
This documentation is invaluable for recovery scenarios.
Symptoms:
Diagnosis:
# Check Claude Code process resource usage
ps aux | grep claude | sort -k3 -r # Sort by CPU%
ps aux | grep claude | sort -k4 -r # Sort by memory%
# Check individual teammate resource usage
# Find PID of specific teammate:
ps aux | grep "CLAUDE_CODE_AGENT_NAME=backend-dev"
# Monitor live resource usage
top -pid $(pgrep -f "CLAUDE_CODE_AGENT_NAME=backend-dev")
Common causes and solutions:
# Solution: Reduce team size, use lighter models
# Replace opus with sonnet, sonnet with haiku for non-critical tasks
/claude-swarm:swarm-spawn "tester" "tester" "haiku" "Run existing tests"
# Solution: Periodic restarts for long-lived teammates (>4 hours)
# 1. Ask teammate to commit work
# 2. Kill and respawn
# 3. Reassign tasks
# Check disk I/O
iostat -x 1 5 # Run 5 samples, 1 second apart
# Look for high %util on disk with ~/.claude
# Solution: Move ~/.claude to faster disk (SSD)
# Or reduce concurrent file operations
Kitty slowness:
# Check kitty window count
kitten @ ls | jq '[.[].tabs[].windows[]] | length'
# If >50 windows total, kitty may slow down
# Solution: Use SWARM_KITTY_MODE=os-window for separate processes
export SWARM_KITTY_MODE=os-window
/claude-swarm:swarm-spawn ...
Tmux slowness:
# Check tmux session count
tmux list-sessions | wc -l
# If >20 sessions, consider cleanup
# Solution: Clean up old swarm sessions
for session in $(tmux list-sessions -F '#{session_name}' | grep swarm-); do
# Check if session is active in a team
# If not, kill it
tmux kill-session -t "$session"
done
Symptoms:
Solutions:
# 1. Reduce team size to stay under rate limits
# 2. Stagger teammate spawning (wait 10s between spawns)
for agent in backend frontend tester; do
/claude-swarm:swarm-spawn "$agent" ...
sleep 10
done
# 3. Use haiku model for lightweight tasks (lower API load)
/claude-swarm:swarm-spawn "tester" "tester" "haiku" "Run unit tests"
Teammate completely frozen:
# 1. Find the teammate's process
ps aux | grep "CLAUDE_CODE_AGENT_NAME=backend-dev"
# 2. Send SIGTERM (graceful shutdown)
kill <PID>
# 3. If still frozen after 30s, force kill
kill -9 <PID>
# 4. Clean up and respawn
/claude-swarm:swarm-reconcile <team-name>
/claude-swarm:swarm-spawn "backend-dev" ...
Multiplexer frozen:
# Kitty frozen
# 1. Try sending command
kitten @ ls
# If hangs, kill kitty: killall kitty
# Tmux frozen
# 1. Try listing sessions
tmux list-sessions
# If hangs, kill tmux server: tmux kill-server
When to use: Everything is completely broken, no recovery methods work, starting over is the only option.
WARNING: This destroys ALL team data across ALL teams. Only use as absolute last resort.
What gets destroyed:
Before nuking:
# 1. Save what you can
tar -czf ~/swarm-backup-$(date +%Y%m%d-%H%M%S).tar.gz ~/.claude/teams/ ~/.claude/tasks/
# 2. Document current state
/claude-swarm:swarm-list-teams > ~/teams-backup.txt
for team in $(cat ~/teams-backup.txt); do
/claude-swarm:swarm-status "$team" > ~/${team}-status.txt
/claude-swarm:task-list >> ~/${team}-tasks.txt
done
# 3. Notify any responsive teammates
# (They'll lose their work context)
Full reset procedure:
# 1. Kill all swarm sessions
tmux kill-server # Kills ALL tmux sessions
# or for kitty:
for window in $(kitten @ ls | jq -r '.[].tabs[].windows[] | select(.user_vars | keys[] | startswith("swarm_")) | .id'); do
kitten @ close-window --match "id:$window"
done
# 2. Remove all swarm data
rm -rf ~/.claude/teams/
rm -rf ~/.claude/tasks/
# 3. Verify cleanup
ls ~/.claude/teams/ # Should not exist
ls ~/.claude/tasks/ # Should not exist
# 4. Recreate directories with proper permissions
mkdir -p ~/.claude/teams/
mkdir -p ~/.claude/tasks/
chmod 700 ~/.claude/teams/
chmod 700 ~/.claude/tasks/
# 5. Start fresh with new team
/claude-swarm:swarm-create "new-team" "Fresh start after full reset"
# 6. Verify clean state
/claude-swarm:swarm-status "new-team"
After nuclear reset:
Recovery timeline: 15-30 minutes to rebuild team from scratch.
For deep investigation:
# List all tmux sessions
tmux list-sessions
# Attach to specific teammate session (view their work)
tmux attach-session -t swarm-<team>-<agent>
# Check socket status
ls -la ~/.claude/sockets/
# View raw config
cat ~/.claude/teams/<team-name>/config.json
# View raw tasks
cat ~/.claude/tasks/<team-name>/tasks.json
# View raw inbox
cat ~/.claude/teams/<team-name>/inboxes/<agent>.json
When debugging, these environment variables are set for spawned teammates:
| Variable | Description |
|---|---|
CLAUDE_CODE_TEAM_NAME | Current team name |
CLAUDE_CODE_AGENT_ID | Agent's unique UUID |
CLAUDE_CODE_AGENT_NAME | Agent name (e.g., "backend-dev") |
CLAUDE_CODE_AGENT_TYPE | Agent role type |
CLAUDE_CODE_TEAM_LEAD_ID | Team lead's UUID |
CLAUDE_CODE_AGENT_COLOR | Agent display color |
KITTY_LISTEN_ON | Kitty socket path (kitty only) |
User-configurable:
| Variable | Description | Default |
|---|---|---|
SWARM_MULTIPLEXER | Force "tmux" or "kitty" | Auto-detect |
SWARM_KITTY_MODE | Kitty spawn mode | split |
| Issue | Quick Fix |
|---|---|
| Spawn fails | Run /claude-swarm:swarm-diagnose |
| Status mismatch | Run /claude-swarm:swarm-reconcile |
| Session crashed | Run /claude-swarm:swarm-resume |
| Messages not received | Verify agent name, check inbox |
| Invalid task ID | Run /claude-swarm:task-list to see IDs |
| Team creation fails | Check permissions, use valid name |
| Kitty socket not found | Check listen_on in kitty.conf, restart kitty |
| Cleanup incomplete | Use --force flag |
For more detailed information, see the error-handling reference documentation.