Structured workflow for disaster recovery planning, implementation, and testing including RTO/RPO definition, DR strategy selection, and failover procedures.
Inherits all available tools
Additional assets for this skill
This skill inherits all available tools. When active, it can use any tool Claude has access to.
name: ops-disaster-recovery description: | Structured workflow for disaster recovery planning, implementation, and testing including RTO/RPO definition, DR strategy selection, and failover procedures.
trigger: |
skip_when: |
This skill defines the structured process for disaster recovery planning and testing. Use it for comprehensive DR strategy development and validation.
| Phase | Focus | Output |
|---|---|---|
| 1. Business Impact | Define criticality and requirements | BIA document |
| 2. Strategy Selection | Choose appropriate DR strategy | DR strategy |
| 3. Architecture Design | Design DR infrastructure | DR architecture |
| 4. Runbook Development | Document failover procedures | DR runbooks |
| 5. Testing | Validate DR capabilities | Test report |
| 6. Maintenance | Keep DR current | Update schedule |
Classify services by business criticality:
| Tier | Definition | RTO | RPO | Example Services |
|---|---|---|---|---|
| Tier 1 | Critical - business cannot operate | <15 min | <1 min | Payment processing |
| Tier 2 | Important - significant impact | <1 hour | <15 min | Customer portal |
| Tier 3 | Standard - moderate impact | <4 hours | <1 hour | Internal tools |
| Tier 4 | Low - minimal impact | <24 hours | <24 hours | Dev environments |
## Business Impact Analysis
**Assessment Date:** YYYY-MM-DD
**Assessed By:** [name]
### Service Classification
| Service | Business Function | Revenue Impact | Tier | RTO | RPO |
|---------|------------------|----------------|------|-----|-----|
| payment-api | Process transactions | $X,XXX/hour | 1 | 15 min | 1 min |
| customer-portal | Customer access | $XXX/hour | 2 | 1 hour | 15 min |
| admin-tools | Internal operations | $0/hour | 3 | 4 hours | 1 hour |
### Data Classification
| Data Type | Classification | Backup Frequency | Retention |
|-----------|---------------|------------------|-----------|
| Transaction data | Critical | Continuous | 7 years |
| Customer data | Important | Hourly | 3 years |
| Application logs | Standard | Daily | 90 days |
### Dependencies
| Service | Dependencies | DR Impact |
|---------|--------------|-----------|
| payment-api | Database, payment-gateway | All must fail over together |
| customer-portal | Database, auth-service | Sequential failover possible |
| Strategy | RTO | RPO | Cost | Complexity | Best For |
|---|---|---|---|---|---|
| Backup & Restore | Hours | Hours | $ | Low | Tier 4 services |
| Pilot Light | 30-60 min | Minutes | $$ | Medium | Tier 3 services |
| Warm Standby | 10-30 min | Seconds-Minutes | $$$ | Medium-High | Tier 2 services |
| Hot Standby | <10 min | Seconds | $$$$ | High | Tier 1 services |
| Multi-Active | Near-zero | Near-zero | $$$$$ | Very High | Ultra-critical |
## DR Strategy Selection
### Requirements Summary
| Requirement | Value |
|-------------|-------|
| Target RTO | [X minutes/hours] |
| Target RPO | [X minutes/hours] |
| Budget | $[X,XXX]/month for DR |
| Compliance | [frameworks] |
### Strategy Decision
**Selected Strategy:** [Pilot Light / Warm Standby / Hot Standby]
**Rationale:**
1. RTO requirement of [X] achieved by [strategy]
2. RPO requirement of [X] achieved with [replication method]
3. Budget of $[X]/month supports [strategy] (~XX% of production cost)
4. Compliance requirement for [X] met with [features]
### Trade-offs Accepted
| Trade-off | Impact | Mitigation |
|-----------|--------|------------|
| Higher DR cost | +$X/month | Justified by RTO requirement |
| Manual failover steps | 5-10 min added | Automation planned Q2 |
| Component | Primary | DR | Replication |
|---|---|---|---|
| DNS | Route53 | Route53 | Global service |
| Load Balancer | ALB (us-east-1) | ALB (us-west-2) | Configuration sync |
| Compute | EKS (us-east-1) | EKS (us-west-2) | GitOps deployment |
| Database | Aurora (us-east-1) | Aurora Global (us-west-2) | Async replication |
| Storage | S3 (us-east-1) | S3 (us-west-2) | Cross-region replication |
| Secrets | Secrets Manager | Secrets Manager | Manual sync |
Primary Region (us-east-1) DR Region (us-west-2)
┌─────────────────────────┐ ┌─────────────────────────┐
│ │ │ │
│ ┌─────────────────┐ │ │ ┌─────────────────┐ │
│ │ ALB │ │ │ │ ALB │ │
│ └────────┬────────┘ │ │ └────────┬────────┘ │
│ │ │ │ │ (standby) │
│ ┌────────┴────────┐ │ │ ┌────────┴────────┐ │
│ │ EKS Cluster │ │ │ │ EKS Cluster │ │
│ │ (Active) │ │ │ │ (Standby) │ │
│ └────────┬────────┘ │ │ └────────┬────────┘ │
│ │ │ │ │ │
│ ┌────────┴────────┐ │ async │ ┌────────┴────────┐ │
│ │ Aurora │────┼────────►│ │ Aurora │ │
│ │ (Primary) │ │ │ │ (Replica) │ │
│ └─────────────────┘ │ │ └─────────────────┘ │
│ │ │ │
└─────────────────────────┘ └─────────────────────────┘
│ │
└───────────┬───────────────────┘
│
┌──────┴──────┐
│ Route53 │
│ (Global) │
└─────────────┘
## Failover Runbook: [Service Name]
**Version:** 1.0
**Last Updated:** YYYY-MM-DD
**Owner:** [team]
### Pre-Conditions
- [ ] DR region healthy (check dashboard)
- [ ] Replication lag <[X seconds/minutes]
- [ ] On-call personnel available
- [ ] Communication channels ready
### Failover Decision Criteria
| Criteria | Automatic | Manual |
|----------|-----------|--------|
| Primary region unavailable >5 min | Yes | - |
| Replication lag >15 min | - | Yes |
| Data corruption detected | - | Yes |
| Planned maintenance | - | Yes |
### Failover Steps
1. **Verify DR Readiness** (2 min)
```bash
# Check DR database status
aws rds describe-db-clusters --region us-west-2
# Check EKS cluster status
kubectl --context=dr get nodes
Stop Writes to Primary (1 min)
# Scale down primary services
kubectl --context=primary scale deployment/api --replicas=0
Promote DR Database (5 min)
# Promote Aurora replica
aws rds failover-global-cluster \
--global-cluster-identifier my-global-cluster \
--target-db-cluster-identifier dr-cluster
Activate DR Services (2 min)
# Scale up DR services
kubectl --context=dr scale deployment/api --replicas=10
Update DNS (1-5 min propagation)
# Update Route53 health check
aws route53 update-health-check \
--health-check-id xxx \
--disabled
Verify Service (5 min)
# Health check
curl https://api.example.com/health
# Synthetic transaction
./scripts/synthetic-test.sh
[If failover causes issues, steps to return to primary]
Internal:
DR failover initiated for [service] at [time UTC]. Estimated completion: [X minutes]. IC: [name]
External (if customer-facing):
We are currently experiencing issues with [service]. Our team is working to restore service. Status page: [url]
---
## Phase 5: Testing
### DR Test Types
| Test Type | Frequency | Scope | Impact |
|-----------|-----------|-------|--------|
| **Tabletop** | Quarterly | Full scenario walkthrough | None |
| **Component** | Monthly | Individual component failover | Minimal |
| **Partial** | Quarterly | Non-production failover | Low |
| **Full** | Annually | Production failover | Moderate |
### DR Test Template
```markdown
## DR Test Report
**Test Date:** YYYY-MM-DD
**Test Type:** [Tabletop/Component/Partial/Full]
**Scope:** [services tested]
### Test Objectives
1. Validate RTO of <[X minutes]
2. Validate RPO of <[X minutes]
3. Verify runbook accuracy
4. Identify gaps in DR readiness
### Test Results
| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| RTO | 15 min | 12 min | PASS |
| RPO | 1 min | 45 sec | PASS |
| Data integrity | 100% | 100% | PASS |
| Runbook accuracy | 100% | 85% | PARTIAL |
### Timeline
| Time | Action | Status |
|------|--------|--------|
| 10:00 | Test initiated | OK |
| 10:02 | Primary shutdown simulated | OK |
| 10:08 | DR database promoted | OK |
| 10:12 | DR services activated | OK |
| 10:15 | Service verified | OK |
### Issues Found
| Issue | Severity | Action Required |
|-------|----------|-----------------|
| Step 4 command incorrect | Medium | Update runbook |
| DNS propagation slower | Low | Reduce TTL |
### Lessons Learned
1. [Lesson 1]
2. [Lesson 2]
### Action Items
| Item | Owner | Due Date |
|------|-------|----------|
| Update runbook step 4 | @ops | YYYY-MM-DD |
| Reduce DNS TTL | @platform | YYYY-MM-DD |
| Activity | Frequency | Owner |
|---|---|---|
| Runbook review | Quarterly | Platform team |
| DR test | Per test schedule | SRE team |
| Replication monitoring | Daily (automated) | Monitoring |
| Cost review | Monthly | FinOps |
| Architecture review | Annually | Architecture team |
| Rationalization | Why It's WRONG | Required Action |
|---|---|---|
| "DR can be added later" | DR added later is rarely tested | DR is day-1 requirement |
| "Backups are good enough" | Backups != DR. RTO is hours vs minutes. | Design proper DR strategy |
| "Too expensive for DR" | DR cost << outage cost | Calculate business impact |
| "We'll figure it out during incident" | Panic != good decisions | Document runbooks NOW |
| "Tested last year, still good" | Systems change constantly | Test regularly |
For DR planning tasks, dispatch:
Task tool:
subagent_type: "infrastructure-architect"
model: "opus"
prompt: |
DR PLANNING REQUEST
Services: [services requiring DR]
RTO Requirement: [target]
RPO Requirement: [target]
Current State: [existing DR if any]
REQUEST: [design/review/test planning]