This skill should be used when the user asks to "query Prometheus", "analyze Prometheus metrics", "check Prometheus alerts", "write PromQL", "interpret Prometheus data", "fetch metrics", or mentions Prometheus querying, alerting, or metrics analysis. Provides guidance for querying and interpreting Prometheus metrics for root cause analysis.
This skill inherits all available tools. When active, it can use any tool Claude has access to.
references/promql-cookbook.mdPrometheus is a time-series metrics collection and alerting system widely used for monitoring production systems. This skill provides guidance for querying Prometheus metrics, interpreting alert data, and using metrics for root cause analysis.
Apply this skill when:
Counter: Cumulative value that only increases (e.g., total requests, error count)
rate() or increase() to get per-second rate or total increasehttp_requests_totalGauge: Value that can go up or down (e.g., CPU usage, memory usage, queue depth)
avg_over_time()node_memory_usage_bytesHistogram: Distribution of values in buckets (e.g., request durations)
_sum, _count, and _bucket metricshttp_request_duration_secondsSummary: Similar to histogram but with pre-calculated quantiles
http_request_duration_seconds{quantile="0.95"}Metrics have format: metric_name{label1="value1", label2="value2"}
Example: http_requests_total{method="POST", status="500", service="api"}
Instant query (current value):
http_requests_total
Range query (over time):
http_requests_total[5m]
Filter by labels:
http_requests_total{status="500", service="api"}
Rate of increase (per-second rate):
rate(http_requests_total[5m])
Sum across dimensions:
sum(rate(http_requests_total[5m])) by (status)
Average:
avg(node_memory_usage_bytes) by (instance)
Max/Min:
max(http_request_duration_seconds) by (endpoint)
Count:
count(up == 0) # Count instances that are down
Error rate percentage:
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
95th percentile latency:
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
Request rate by endpoint:
sum(rate(http_requests_total[5m])) by (endpoint)
Memory usage percentage:
(node_memory_usage_bytes / node_memory_total_bytes) * 100
Database connection pool usage:
sum(db_connection_pool_active) / sum(db_connection_pool_max) * 100
Prometheus alerts contain:
Use Prometheus API to fetch alerts:
List active alerts:
GET /api/v1/alerts
Query alert rule:
GET /api/v1/rules
When investigating an alert:
Alert: HighErrorRate
Expression: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
Investigation queries:
Query error rate breakdown:
rate(http_requests_total{status=~"5.."}[5m]) by (endpoint)
Query total request rate:
rate(http_requests_total[5m])
Query error rate percentage:
(sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))) * 100
Check for correlated latency increase:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Signature: Metric jumps sharply at specific time
Possible causes:
Investigation:
Signature: Metric grows steadily over hours/days
Possible causes:
Investigation:
Signature: Metric spikes at regular intervals
Possible causes:
Investigation:
Signature: Metric suddenly drops to zero or very low value
Possible causes:
Investigation:
up metric)Signature: Metric fluctuates wildly
Possible causes:
Investigation:
Choose appropriate time ranges for investigation:
Incident detection (5-15 minutes):
rate(metric[5m])
Trend analysis (1-6 hours):
rate(metric[1h])
Long-term patterns (1-7 days):
avg_over_time(metric[1d])
Comparison with past:
# Current value
metric
# Value 1 week ago
metric offset 1w
Query related metrics together to understand full context:
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
# Latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Request rate
rate(http_requests_total[5m])
# CPU usage
rate(process_cpu_seconds_total[5m])
# Memory usage
process_resident_memory_bytes
Look for metrics that change together:
rate() for counters, not raw valuesThis skill works with:
The Thufir plugin includes Prometheus MCP server for querying:
Query instant value:
Use MCP tool: prometheus_query
Query: rate(http_requests_total[5m])
Query time range:
Use MCP tool: prometheus_query_range
Query: rate(http_requests_total[5m])
Start: 2025-12-19T14:00:00Z
End: 2025-12-19T15:00:00Z
Step: 15s
Fetch active alerts:
Use MCP tool: prometheus_alerts
For detailed PromQL patterns and advanced queries:
references/promql-cookbook.md - Common PromQL queries for RCA scenariosError rate: rate(http_requests_total{status=~"5.."}[5m])
Latency p95: histogram_quantile(0.95, rate(duration_bucket[5m]))
CPU usage: rate(process_cpu_seconds_total[5m])
Memory: process_resident_memory_bytes
Request rate: rate(http_requests_total[5m])
Time ranges: 5m (instant), 1h (trend), 1d (baseline)
Aggregations: sum, avg, max, min, count
Filters: {label="value"}, {label=~"regex"}
Use Prometheus metrics to provide objective, time-series evidence for root cause analysis. Correlate metrics with code changes and system events to identify precise incident causes.