From agenticops
Automates incident response for CloudWatch/Prometheus alarms: classifies severity, retrieves runbooks, generates hypotheses, runs MCP diagnostics, and executes human-approved remediations. Pages on-call for SEV1.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agenticops:incident-responseclaude-opus-4-7This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
- CloudWatch Alarm 또는 Prometheus AlertManager가 임계 초과 알람을 발송했을 때
autopilot-deploy의 circuit breaker가 trip하여 rollback이 실행된 직후continuous-eval regression gate 실패가 연속 2회 발생했을 때/incident-response <alarm-id>를 호출하여 근본 원인 분석을 요청했을 때사용 제외:
@latest 금지, PyPI 버전 pin 필수)..omao/plans/runbooks/ 에 ${symptom}.md 형식으로 보관.autopilot-deploy의 상태 파일 (.omao/state/autopilot-deploy/) 접근 권한 — 배포 freeze에 사용.수신 알람은 즉시 다음 기준으로 분류됩니다.
| Severity | 기준 | 사람 개입 |
|---|---|---|
| SEV1 | Toxicity/PII leakage 양성, 데이터 유출, 프로덕션 전체 장애, 30% 이상 트래픽 에러 | 즉시 on-call page. Agent는 진단만 수행하고 remediation은 실행하지 않음. |
| SEV2 | 서비스 부분 장애, P99 latency 2× 이상 증가, circuit breaker trip, 특정 region 장애 | Agent가 drafted response 준비 → 사람 승인 후 실행. |
| SEV3 | 품질 regression (faithfulness -5pp 등), 비용 급증, 단일 agent 에러율 증가 | Agent가 drafted response 준비 → 사람 승인 후 실행. |
| SEV4 | 경고성 (log volume 증가, token 사용량 15% 증가 등) | Agent가 리포트만 생성하고 주간 리뷰 큐에 적재. |
알람 수신 시 payload를 파싱하여 severity를 확정합니다.
ALARM_ID="$1"
ALARM=$(aws cloudwatch describe-alarms --alarm-names "$ALARM_ID" --query 'MetricAlarms[0]' --output json)
# Or via MCP
# mcp__cloudwatch__get_alarm --name "$ALARM_ID"
SEVERITY=$(jq -r '.Tags[] | select(.Key=="severity") | .Value' <<< "$ALARM")
Severity가 확정되지 않으면 기본값 SEV3으로 처리하고 사람 확인을 요청합니다.
Symptom 키워드 기반으로 .omao/plans/runbooks/ 에서 대응 runbook을 검색합니다.
SYMPTOM=$(jq -r '.AlarmDescription' <<< "$ALARM" | sed 's/[^a-z0-9-]/-/g')
RUNBOOK=$(ls .omao/plans/runbooks/*.md | grep -i "$SYMPTOM" | head -1)
if [ -z "$RUNBOOK" ]; then
echo "No matching runbook. Proceeding with generic diagnostic flow."
fi
Runbook이 존재하면 해당 단계를 따르고 없으면 Step 3 generic diagnostic flow로 진행합니다.
Runbook의 "Possible Causes" 또는 generic 규칙 기반으로 3~5개 가설을 생성합니다. 각 가설은 진단 가능한 MCP 쿼리와 pair로 매핑되어야 합니다.
{
"hypotheses": [
{
"id": "H1",
"claim": "Retrieval index outdated after 2026-04-20 reindex job",
"diagnostic_query": "cloudwatch: /aws/lambda/reindex-job last 24h",
"confidence_prior": 0.4
},
{
"id": "H2",
"claim": "New model version v2.3.1 introduced context window truncation",
"diagnostic_query": "prometheus: agent_context_truncation_total{version='v2.3.1'}",
"confidence_prior": 0.3
},
{
"id": "H3",
"claim": "Vector DB (Milvus) slow query due to compaction backlog",
"diagnostic_query": "prometheus: milvus_compaction_queue_length",
"confidence_prior": 0.3
}
]
}
각 가설에 대응하는 MCP 쿼리를 병렬 실행합니다.
# Hypothesis H1: reindex job health
mcp__cloudwatch__filter_log_events \
--log-group /aws/lambda/reindex-job \
--start-time $(date -u -d '-24 hours' +%s)000 \
--filter-pattern "ERROR"
# Hypothesis H2: context truncation
mcp__prometheus__query_range \
--query 'rate(agent_context_truncation_total{version="v2.3.1"}[5m])' \
--start "$(date -u -d '-6 hours' +%s)" \
--end "$(date -u +%s)"
# Hypothesis H3: Milvus compaction
mcp__prometheus__query \
--query 'milvus_compaction_queue_length'
결과를 바탕으로 각 가설의 posterior confidence를 업데이트합니다. 최고 confidence 가설을 root cause 후보로 확정합니다.
SEV2/3의 경우 remediation 명령어를 .omao/state/incident/${id}/remediation.sh 에 drafted 파일로 생성합니다. 자동 실행하지 않습니다.
cat > .omao/state/incident/sev2-20260421-1023/remediation.sh <<'EOF'
#!/bin/bash
# Proposed remediation for SEV2 incident
# Root cause: Milvus compaction backlog (H3 confidence 0.82)
# Reviewer: please approve before execution
kubectl -n milvus exec milvus-proxy-0 -- milvus-cli \
--command "compact -collection=agent_kb"
# Verify
kubectl -n milvus exec milvus-proxy-0 -- milvus-cli \
--command "describe -collection=agent_kb" | grep "compaction_state"
EOF
echo "Drafted remediation at .omao/state/incident/sev2-20260421-1023/remediation.sh"
echo "Approve via: gh issue comment <issue-id> --body '/approve-remediation'"
SEV1의 경우:
# Page on-call immediately
curl -X POST "$PAGERDUTY_INCIDENT_URL" \
-H "Authorization: Token token=$PD_TOKEN" \
-d "$(jq -n --arg id "$ALARM_ID" --arg desc "$SEVERITY $SYMPTOM" \
'{incident:{type:"incident",title:$desc,service:{id:"PXXXXX",type:"service_reference"},urgency:"high"}}')"
# Freeze autopilot-deploy
echo '{"circuit_breaker_status":"tripped","reason":"SEV1 incident"}' \
> .omao/state/autopilot-deploy/freeze.json
모든 incident는 .omao/state/incident/${severity}-${timestamp}/ 디렉토리에 기록됩니다.
timeline.jsonl — 각 step 시작/완료 타임스탬프hypotheses.json — 생성된 가설과 posterior confidencediagnostic-results/ — MCP 쿼리 결과 rawremediation.sh — drafted 복구 명령 (SEV2/3)postmortem-draft.md — 사건 종료 후 자동 생성되는 post-mortem 초안Input: /incident-response rag-qa-error-rate-spike
Output (SEV2):
[12:35Z] Received alarm: rag-qa-error-rate-spike
[12:35Z] Severity: SEV2 (error rate 2.1× baseline for 5m)
[12:35Z] Runbook match: .omao/plans/runbooks/rag-qa-error-spike.md
[12:36Z] Generated 3 hypotheses
[12:38Z] Diagnostic MCP queries complete
[12:38Z] Root cause candidate: H3 (Milvus compaction backlog, confidence=0.82)
[12:39Z] Drafted remediation at .omao/state/incident/sev2-20260421-1235/remediation.sh
[12:39Z] AWAITING HUMAN APPROVAL. autopilot-deploy frozen.
Output (SEV1):
[14:02Z] Received alarm: pii-leak-detected
[14:02Z] Severity: SEV1 (PII token found in agent response)
[14:02Z] On-call paged: PagerDuty incident P-A1B2C3
[14:02Z] autopilot-deploy frozen for all agents
[14:02Z] Agent diagnosis continuing; no remediation will be auto-drafted for SEV1
[14:05Z] Diagnostic complete: see .omao/state/incident/sev1-20260421-1402/
[14:05Z] Human responder in control. Agent awaiting /release-sev1 command.
npx claudepluginhub aws-samples/sample-oh-my-aidlcops --plugin agenticopsManages SRE workflows in four modes: oncall alert triage, root cause diagnosis, preventive patrols, and self-improvement iteration using PagerDuty and infrastructure context.
Orchestrates multi-agent incident response workflows using SRE practices for detection, triage, observability analysis, mitigation, communication, resolution, and blameless postmortems.
Investigates monitoring alerts end-to-end by pulling metrics, logs, traces, and recent code changes to identify root causes. For on-call engineers handling alerts via Datadog, Grafana, or PagerDuty MCPs.