Use when creating structured operational runbooks for human operators. Covers runbook organization, documentation patterns, and best practices for clear operational procedures.
Limited to specific tools
Additional assets for this skill
This skill is limited to using the following tools:
name: runbooks-structure description: Use when creating structured operational runbooks for human operators. Covers runbook organization, documentation patterns, and best practices for clear operational procedures. allowed-tools:
Creating clear, actionable runbooks for operational tasks, maintenance, and troubleshooting.
A runbook is step-by-step documentation for operational tasks:
# Service Name: Task/Issue
## Overview
Brief description of what this runbook covers.
## Prerequisites
- Required access/permissions
- Tools needed
- Knowledge required
## Steps
### 1. First Step
Detailed instructions for first action.
### 2. Second Step
Detailed instructions for second action.
## Validation
How to verify the task was completed successfully.
## Rollback (if applicable)
How to undo the changes if needed.
# [Service]: [Task/Issue Title]
**Last Updated:** 2025-01-15
**Owner:** Platform Team
**Severity:** High/Medium/Low
**Estimated Time:** 15 minutes
## Overview
Brief description of the problem or task this runbook addresses.
## When to Use This Runbook
- Alert fired: `high_cpu_usage`
- Customer report: slow response times
- Scheduled maintenance window
## Prerequisites
- [ ] VPN access to production network
- [ ] AWS console access (read/write)
- [ ] kubectl configured for production cluster
- [ ] Slack access to #incidents channel
## Context
### Architecture Overview
Brief explanation of relevant system architecture.
### Common Causes
- Database connection pool exhaustion
- Memory leaks in worker processes
- Third-party API rate limiting
## Diagnosis Steps
### 1. Check System Health
```bash
# Check pod status
kubectl get pods -n production
# Expected output: All pods Running
Decision Point: If pods are CrashLooping, proceed to step 2. Otherwise, skip to step 3.
# View recent logs
kubectl logs -n production deployment/api-server --tail=100
Look for:
# Connect to RDS metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DatabaseConnections \
--dimensions Name=DBInstanceIdentifier,Value=prod-db
Drain pod gracefully:
kubectl drain pod/api-server-abc123 --ignore-daemonsets
Verify new pod is healthy:
kubectl get pods -n production | grep api-server
Monitor for 5 minutes: Check Datadog dashboard for error rates.
Increase database connection pool:
# Edit configmap
kubectl edit configmap app-config -n production
# Change: DB_POOL_SIZE: "10"
# To: DB_POOL_SIZE: "20"
Restart deployment:
kubectl rollout restart deployment/api-server -n production
Monitor rollout:
kubectl rollout status deployment/api-server -n production
kubectl get pods -n productionIf the fix causes issues:
# Rollback to previous version
kubectl rollout undo deployment/api-server -n production
# Verify rollback
kubectl rollout status deployment/api-server -n production
| Date | Author | Changes |
|---|---|---|
| 2025-01-15 | Alice | Added validation steps |
| 2025-01-10 | Bob | Initial version |
## Runbook Organization
### Directory Structure
runbooks/ ├── README.md # Index of all runbooks ├── templates/ │ ├── troubleshooting.md # Template for troubleshooting │ ├── maintenance.md # Template for maintenance tasks │ └── incident-response.md # Template for incidents ├── services/ │ ├── api-gateway/ │ │ ├── high-latency.md │ │ ├── connection-errors.md │ │ └── scaling.md │ └── database/ │ ├── slow-queries.md │ └── replication-lag.md ├── infrastructure/ │ ├── kubernetes/ │ │ ├── pod-restart.md │ │ └── node-drain.md │ └── networking/ │ └── firewall-rules.md └── on-call/ ├── first-responder.md ├── escalation-guide.md └── common-alerts.md
## Best Practices
### Write for Clarity
```markdown
# Good: Clear, specific steps
## Restart the API Server
1. Check current pod status:
```bash
kubectl get pods -n production -l app=api-server
Delete the pod (it will auto-restart):
kubectl delete pod <pod-name> -n production
Verify new pod is running:
kubectl get pods -n production -l app=api-server
Expected: STATUS = Running
```markdown
# Bad: Vague, assumes knowledge
## Fix API Issues
1. Restart the thing
2. Check if it works
3. If not, try something else
## Diagnosis
1. Check if service is responding:
```bash
curl -f https://api.example.com/health
If succeeds: Service is up. Check application logs (Step 3). If fails: Service is down. Check pod status (Step 2).
Check pod status:
kubectl get pods -n production
If CrashLooping: Check logs for errors (Step 4). If Pending: Check node resources (Step 5). If Running: Service should be up. Check DNS (Step 6).
### Provide Expected Outputs
```markdown
## Check Database Connection
```bash
psql -h prod-db.example.com -U app -c "SELECT 1"
Expected output:
?column?
----------
1
(1 row)
If you see "connection refused":
### Use Checklists
```markdown
## Pre-Deployment Checklist
- [ ] Code reviewed and approved
- [ ] Tests passing in CI
- [ ] Database migrations tested
- [ ] Rollback plan documented
- [ ] Monitoring alerts configured
- [ ] On-call engineer notified
- [ ] Deploy window scheduled
# EMERGENCY: Database Down
**Time-Sensitive:** Complete within 5 minutes
## Immediate Actions (Do First)
1. **Page database team:**
Use PagerDuty incident: "Database outage"
2. **Notify stakeholders:**
Post in #incidents: "Database is down. Investigating."
3. **Enable maintenance mode:**
```bash
kubectl set env deployment/api-server MAINTENANCE_MODE=true
Update every 10 minutes in #incidents:
[HH:MM] Database still down. Database team investigating. Impact: All API requests failing. ETA: Unknown.
### Routine Maintenance
```markdown
# Monthly Database Maintenance
**Schedule:** First Sunday of each month, 2 AM UTC
**Duration:** ~2 hours
**Downtime:** None (rolling maintenance)
## Pre-Maintenance
**1 week before:**
- [ ] Announce maintenance window in #engineering
- [ ] Create calendar event
- [ ] Prepare rollback plan
**1 day before:**
- [ ] Verify backup is recent (< 24 hours)
- [ ] Test backup restoration (staging)
- [ ] Confirm on-call coverage
## Maintenance Steps
1. **Take snapshot:**
```bash
aws rds create-db-snapshot \
--db-instance-identifier prod-db \
--db-snapshot-identifier maintenance-$(date +%Y%m%d)
Apply updates: Follow RDS console update wizard
Monitor reboot: Watch CloudWatch metrics for 15 minutes
### Knowledge Transfer
```markdown
# New Engineer Onboarding: Deploy Process
## Learning Objectives
After completing this runbook, you will:
- Understand our deployment pipeline
- Know how to deploy to staging
- Know how to rollback a bad deploy
## Step 1: Understanding the Pipeline
Our deployment flow:
GitHub → CI (GitHub Actions) → ArgoCD → Kubernetes
**Exercise:** Review a recent deploy in ArgoCD
## Step 2: Deploying to Staging
**Shadowing:** Watch a senior engineer deploy first.
1. Create feature branch
2. Open pull request
3. Wait for CI to pass
4. Merge to main
5. Watch ArgoCD sync
**Practice:** Deploy your own change to staging.
## Step 3: Deploying to Production
**Requirements before production deploy:**
- [ ] Completed 3+ staging deploys
- [ ] Reviewed with team lead
- [ ] Read incident response runbook
**Hands-on:** Pair with team lead for first production deploy.
## Certification
- [ ] Completed 5 staging deploys
- [ ] Completed 1 production deploy (supervised)
- [ ] Can explain rollback procedure
**Certified by:** _________________ Date: _______
## Bad: No context
### Fix the API
Run this command:
```bash
kubectl delete pod api-server-abc
When to use: API returning 500 errors, logs show memory leak.
Why this works: Deletes pod, Kubernetes creates fresh pod with clean state.
kubectl delete pod api-server-abc
What to expect: ~30 seconds downtime while new pod starts.
### Don't Skip Validation Steps
```markdown
# Bad: No validation
## Deploy New Version
1. Update image tag
2. Apply changes
3. Done!
# Good: Include validation
## Deploy New Version
1. Update image tag in deployment.yaml
2. Apply changes:
```bash
kubectl apply -f deployment.yaml
Verify rollout:
kubectl rollout status deployment/api-server
Wait for "successfully rolled out"
Check pod health:
kubectl get pods -l app=api-server
All pods should show STATUS: Running
Verify endpoints:
curl https://api.example.com/health
Should return 200 OK
### Don't Assume Knowledge
```markdown
# Bad: Uses jargon without explanation
## Fix Pod
Check the ingress and make sure the service mesh is working.
# Good: Explains terms
## Fix Pod Connectivity
1. **Check ingress** (the load balancer that routes traffic to pods):
```bash
kubectl get ingress -n production
Verify service mesh (Istio, manages pod-to-pod communication):
kubectl get pods -n istio-system
All Istio control plane pods should be Running.
## Related Skills
- **troubleshooting-guides**: Creating diagnostic procedures
- **incident-response**: Handling production incidents