Kafka monitoring and observability expert. Guides Prometheus + Grafana setup, JMX metrics, alerting rules, and dashboard configuration. Activates for kafka monitoring, prometheus, grafana, kafka metrics, jmx exporter, kafka observability, monitoring setup, kafka dashboards, alerting, kafka performance monitoring, metrics collection.
Inherits all available tools
Additional assets for this skill
This skill inherits all available tools. When active, it can use any tool Claude has access to.
Expert guidance for implementing comprehensive monitoring and observability for Apache Kafka using Prometheus and Grafana.
I activate when you need help with:
This plugin provides a complete monitoring stack:
plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.ymlplugins/specweave-kafka/monitoring/grafana/dashboards/plugins/specweave-kafka/monitoring/grafana/provisioning/dashboards/kafka.yml - Dashboard provisioning configdatasources/prometheus.yml - Prometheus datasource configFor Kafka running on VMs or bare metal (non-Kubernetes).
# Download JMX Prometheus agent JAR
cd /opt
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar
# Copy JMX Exporter config
cp plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml /opt/kafka-jmx-exporter.yml
Add JMX exporter to Kafka startup script:
# Edit Kafka startup (e.g., /etc/systemd/system/kafka.service)
[Service]
Environment="KAFKA_OPTS=-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-jmx-exporter.yml"
Or add to kafka-server-start.sh:
export KAFKA_OPTS="-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-jmx-exporter.yml"
# Restart Kafka broker
sudo systemctl restart kafka
# Verify JMX exporter is running (port 7071)
curl localhost:7071/metrics | grep kafka_server
# Expected output: kafka_server_broker_topic_metrics_bytesin_total{...} 12345
Add Kafka brokers to Prometheus config:
# prometheus.yml
scrape_configs:
- job_name: 'kafka'
static_configs:
- targets:
- 'kafka-broker-1:7071'
- 'kafka-broker-2:7071'
- 'kafka-broker-3:7071'
scrape_interval: 30s
# Reload Prometheus
sudo systemctl reload prometheus
# OR send SIGHUP
kill -HUP $(pidof prometheus)
# Verify scraping
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
For Kafka running on Kubernetes with Strimzi Operator.
# Create ConfigMap from JMX exporter config
kubectl create configmap kafka-metrics \
--from-file=kafka-metrics-config.yml=plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml \
-n kafka
# kafka-cluster.yaml (add metricsConfig section)
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: my-kafka-cluster
namespace: kafka
spec:
kafka:
version: 3.7.0
replicas: 3
# ... other config ...
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: kafka-metrics-config.yml
# Apply updated Kafka CR
kubectl apply -f kafka-cluster.yaml
# Verify metrics endpoint (wait for rolling restart)
kubectl exec -it kafka-my-kafka-cluster-0 -n kafka -- curl localhost:9404/metrics | grep kafka_server
# Add Prometheus Community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
# kafka-podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: kafka-metrics
namespace: kafka
labels:
app: strimzi
spec:
selector:
matchLabels:
strimzi.io/kind: Kafka
podMetricsEndpoints:
- port: tcp-prometheus
interval: 30s
# Apply PodMonitor
kubectl apply -f kafka-podmonitor.yaml
# Verify Prometheus is scraping Kafka
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open: http://localhost:9090/targets
# Should see kafka-metrics/* targets
If using Docker Compose for local development:
# docker-compose.yml (add to existing Kafka setup)
version: '3.8'
services:
# ... Kafka services ...
prometheus:
image: prom/prometheus:v2.48.0
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
grafana:
image: grafana/grafana:10.2.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
- ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
- grafana-data:/var/lib/grafana
volumes:
prometheus-data:
grafana-data:
# Start monitoring stack
docker-compose up -d prometheus grafana
# Access Grafana
# URL: http://localhost:3000
# Username: admin
# Password: admin
Dashboards are auto-provisioned if using kube-prometheus-stack:
# Create ConfigMaps for each dashboard
for dashboard in plugins/specweave-kafka/monitoring/grafana/dashboards/*.json; do
name=$(basename "$dashboard" .json)
kubectl create configmap "kafka-dashboard-$name" \
--from-file="$dashboard" \
-n monitoring \
--dry-run=client -o yaml | kubectl apply -f -
done
# Label ConfigMaps for Grafana auto-discovery
kubectl label configmap -n monitoring kafka-dashboard-* grafana_dashboard=1
# Grafana will auto-import dashboards (wait 30-60 seconds)
# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# URL: http://localhost:3000
# Username: admin
# Password: prom-operator (default kube-prometheus-stack password)
If auto-provisioning doesn't work:
# 1. Access Grafana UI
# 2. Go to: Dashboards → Import
# 3. Upload JSON files from:
# plugins/specweave-kafka/monitoring/grafana/dashboards/
# Or use Grafana API
for dashboard in plugins/specweave-kafka/monitoring/grafana/dashboards/*.json; do
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d @"$dashboard"
done
kafka-cluster-overview.json)Purpose: High-level cluster health
Key Metrics:
Use When: Checking overall cluster health
kafka-broker-metrics.json)Purpose: Per-broker performance
Key Metrics:
Use When: Investigating broker performance issues
kafka-consumer-lag.json)Purpose: Consumer lag monitoring
Key Metrics:
Use When: Troubleshooting slow consumers or lag spikes
kafka-topic-metrics.json)Purpose: Topic-level metrics
Key Metrics:
Use When: Analyzing topic throughput and hotspots
kafka-jvm-metrics.json)Purpose: JVM health monitoring
Key Metrics:
Use When: Investigating memory leaks or GC pauses
Create Prometheus alerting rules for critical Kafka metrics:
# kafka-alerts.yml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kafka-alerts
namespace: monitoring
spec:
groups:
- name: kafka.rules
interval: 30s
rules:
# CRITICAL: Under-Replicated Partitions
- alert: KafkaUnderReplicatedPartitions
expr: sum(kafka_server_replica_manager_under_replicated_partitions) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kafka has under-replicated partitions"
description: "{{ $value }} partitions are under-replicated. Data loss risk!"
# CRITICAL: Offline Partitions
- alert: KafkaOfflinePartitions
expr: kafka_controller_offline_partitions_count > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka has offline partitions"
description: "{{ $value }} partitions are offline. Service degradation!"
# CRITICAL: No Active Controller
- alert: KafkaNoActiveController
expr: kafka_controller_active_controller_count == 0
for: 1m
labels:
severity: critical
annotations:
summary: "No active Kafka controller"
description: "Cluster has no active controller. Cannot perform administrative operations!"
# WARNING: High Consumer Lag
- alert: KafkaConsumerLagHigh
expr: sum by (consumergroup) (kafka_consumergroup_lag) > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Consumer group {{ $labels.consumergroup }} has high lag"
description: "Lag is {{ $value }} messages. Consumers may be slow."
# WARNING: High CPU Usage
- alert: KafkaBrokerHighCPU
expr: os_process_cpu_load{job="kafka"} > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Broker {{ $labels.instance }} has high CPU usage"
description: "CPU usage is {{ $value | humanizePercentage }}. Consider scaling."
# WARNING: Low Heap Memory
- alert: KafkaBrokerLowHeapMemory
expr: jvm_memory_heap_used_bytes{job="kafka"} / jvm_memory_heap_max_bytes{job="kafka"} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Broker {{ $labels.instance }} has low heap memory"
description: "Heap usage is {{ $value | humanizePercentage }}. Risk of OOM!"
# WARNING: High GC Time
- alert: KafkaBrokerHighGCTime
expr: rate(jvm_gc_collection_time_ms_total{job="kafka"}[5m]) > 500
for: 5m
labels:
severity: warning
annotations:
summary: "Broker {{ $labels.instance }} spending too much time in GC"
description: "GC time is {{ $value }}ms/sec. Application pauses likely."
# Apply alerts (Kubernetes)
kubectl apply -f kafka-alerts.yml
# Verify alerts loaded
kubectl get prometheusrules -n monitoring
Symptoms: No Kafka metrics in Prometheus
Fix:
# 1. Verify JMX exporter is running
curl http://kafka-broker:7071/metrics
# 2. Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
# 3. Check Prometheus logs
kubectl logs -n monitoring prometheus-kube-prometheus-prometheus-0
# Common issues:
# - Firewall blocking port 7071
# - Incorrect scrape config
# - Kafka broker not running
Symptoms: Dashboards show "No data"
Fix:
# 1. Verify Prometheus datasource
# Grafana UI → Configuration → Data Sources → Prometheus → Test
# 2. Check if Kafka metrics exist in Prometheus
# Prometheus UI → Graph → Enter: kafka_server_broker_topic_metrics_bytesin_total
# 3. Verify dashboard queries match your Prometheus job name
# Dashboard panels use job="kafka" by default
# If your job name is different, update dashboard JSON
Symptoms: Consumer lag dashboard empty
Fix: Consumer lag metrics require Kafka Exporter (separate from JMX Exporter):
# Install Kafka Exporter (Kubernetes)
helm install kafka-exporter prometheus-community/prometheus-kafka-exporter \
--namespace monitoring \
--set kafkaServer={kafka-bootstrap:9092}
# Or run as Docker container
docker run -d -p 9308:9308 \
danielqsj/kafka-exporter \
--kafka.server=kafka:9092 \
--web.listen-address=:9308
# Add to Prometheus scrape config
scrape_configs:
- job_name: 'kafka-exporter'
static_configs:
- targets: ['kafka-exporter:9308']
# Check JMX exporter metrics
curl http://localhost:7071/metrics | grep -E "(kafka_server|kafka_controller)"
# Prometheus query examples
curl -g 'http://localhost:9090/api/v1/query?query=kafka_server_replica_manager_under_replicated_partitions'
# Grafana dashboard export
curl http://admin:admin@localhost:3000/api/dashboards/uid/kafka-cluster-overview | jq .dashboard > backup.json
# Reload Prometheus config
kill -HUP $(pidof prometheus)
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
Next Steps After Monitoring Setup: