Monitoring strategies, distributed tracing, SLI/SLO design, and alerting patterns. Use when designing monitoring infrastructure, defining service level objectives, implementing distributed tracing, creating alert rules, building dashboards, or establishing incident response procedures. Covers the three pillars of observability and production readiness.
/plugin marketplace add rsmdt/the-startup/plugin install team@the-startupThis skill inherits all available tools. When active, it can use any tool Claude has access to.
references/monitoring-patterns.mdYou cannot fix what you cannot see. Observability is not about collecting data - it is about answering questions you have not thought to ask yet. Good observability turns every incident into a learning opportunity and every metric into actionable insight.
Numeric measurements aggregated over time. Best for understanding system behavior at scale.
Characteristics:
Types:
| Type | Use Case | Example |
|---|---|---|
| Counter | Cumulative values that only increase | Total requests, errors, bytes sent |
| Gauge | Values that go up and down | Current memory, active connections |
| Histogram | Distribution of values in buckets | Request latency, payload sizes |
| Summary | Similar to histogram, calculated client-side | Pre-computed percentiles |
Immutable records of discrete events. Best for understanding specific occurrences.
Characteristics:
Structure:
Required fields:
- timestamp: ISO 8601 format with timezone
- level: ERROR, WARN, INFO, DEBUG
- message: Human-readable description
- service: Service identifier
- trace_id: Correlation identifier
Context fields:
- user_id: Sanitized user identifier
- request_id: Request correlation
- duration_ms: Operation timing
- error_type: Classification for errors
Records of request flow across distributed systems. Best for understanding causality and latency.
Characteristics:
Components:
Quantitative measures of service behavior from the user perspective.
Common SLI categories:
| Category | Measures | Example SLI |
|---|---|---|
| Availability | Service is responding | % of successful requests |
| Latency | Response speed | % of requests < 200ms |
| Throughput | Capacity | Requests processed per second |
| Error Rate | Correctness | % of requests without errors |
| Freshness | Data currency | % of data < 1 minute old |
SLI specification:
SLI: Request Latency
Definition: Time from request received to response sent
Measurement: Server-side histogram at p50, p95, p99
Exclusions: Health checks, internal tooling
Data source: Application metrics
Target reliability levels for SLIs over a time window.
SLO formula:
SLO = (Good events / Total events) >= Target over Window
Example:
99.9% of requests complete successfully in < 200ms
measured over a 30-day rolling window
Setting SLO targets:
The allowed amount of unreliability within an SLO.
Calculation:
Error Budget = 1 - SLO Target
99.9% SLO = 0.1% error budget
= 43.2 minutes downtime per 30 days
= 8.64 seconds per day
Error budget policies:
Alert on user-visible symptoms, not internal causes.
Good alerts:
Poor alerts:
Detect fast burns quickly, slow burns before budget depletion.
Configuration:
Fast burn: 14.4x burn rate over 1 hour
- Fires in 1 hour if issue persists
- Catches severe incidents quickly
Slow burn: 3x burn rate over 3 days
- Fires before 30-day budget depletes
- Catches gradual degradation
Strategies:
Alert quality checklist:
Service Health Overview:
Deep-Dive Diagnostic:
Business Metrics:
| Panel | Purpose | Audience |
|---|---|---|
| SLO Status | Current reliability vs target | Everyone |
| Error Budget | Remaining budget and burn rate | Engineering |
| Request Rate | Traffic patterns and anomalies | Operations |
| Latency Distribution | p50, p95, p99 over time | Engineering |
| Error Breakdown | Errors by type and endpoint | Engineering |
| Dependency Health | Status of upstream services | Operations |
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
This skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
This skill should be used when the user asks to "create a hook", "add a PreToolUse/PostToolUse/Stop hook", "validate tool use", "implement prompt-based hooks", "use ${CLAUDE_PLUGIN_ROOT}", "set up event-driven automation", "block dangerous commands", or mentions hook events (PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart, SessionEnd, UserPromptSubmit, PreCompact, Notification). Provides comprehensive guidance for creating and implementing Claude Code plugin hooks with focus on advanced prompt-based hooks API.