Skip to content

Kafka Operations Troubleshooting

Quick troubleshooting guide for operational issues.


Quick Diagnosis

Health Check

#!/bin/bash
# quick-health-check.sh

BOOTSTRAP=${1:-"localhost:9092"}

echo "=== Kafka Quick Health Check ==="

# Connectivity
echo -n "Connectivity: "
kafka-broker-api-versions.sh --bootstrap-server $BOOTSTRAP > /dev/null 2>&1 \
  && echo "OK" || echo "FAILED"

# Offline partitions
OFFLINE=$(kafka-topics.sh --bootstrap-server $BOOTSTRAP \
  --describe --unavailable-partitions 2>/dev/null | grep -c "Topic:")
echo "Offline partitions: $OFFLINE"

# Under-replicated
URP=$(kafka-topics.sh --bootstrap-server $BOOTSTRAP \
  --describe --under-replicated-partitions 2>/dev/null | grep -c "Topic:")
echo "Under-replicated: $URP"

Common Issues

Issue: Under-Replicated Partitions

Symptoms: UnderReplicatedPartitions > 0

Quick Check:

kafka-topics.sh --bootstrap-server kafka:9092 \
  --describe --under-replicated-partitions

Common Causes:

Cause Check Resolution
Broker down Broker connectivity Restart broker
Slow broker Disk I/O, CPU Address bottleneck
Network issues ping, netstat Fix connectivity
Large messages Message size Increase replica.fetch.max.bytes

Issue: Consumer Lag Growing

Symptoms: Lag increasing over time

Quick Check:

kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group my-group

Common Causes:

Cause Check Resolution
Slow processing Consumer metrics Optimize processing
Too few consumers Consumer count Add consumers
Rebalancing Group state Fix rebalance storms
External dependency External latency Add buffering

Issue: Producer Timeouts

Symptoms: TimeoutException in producer

Quick Check:

# Check broker health
kafka-broker-api-versions.sh --bootstrap-server kafka:9092

# Check partition leaders
kafka-topics.sh --bootstrap-server kafka:9092 \
  --describe --topic my-topic

Common Causes:

Cause Check Resolution
Broker unavailable Broker status Restart broker
No leader Partition describe Wait for election
ISR too small min.insync.replicas Fix broker health
Network issues Connectivity Fix network

Issue: Frequent Rebalances

Symptoms: Consumers constantly rebalancing

Quick Check:

kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group my-group --state

Common Causes:

Cause Check Resolution
Session timeout Processing time Increase max.poll.interval.ms
Consumer crashes Consumer logs Fix stability
Heartbeat issues Network Check connectivity
GC pauses GC logs Tune JVM

Resolution:

# Increase timeouts
max.poll.interval.ms=600000
session.timeout.ms=45000
heartbeat.interval.ms=3000

# Use cooperative rebalancing
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor

# Static membership
group.instance.id=consumer-1

Issue: High Latency

Symptoms: Request latency exceeds SLA

Quick Check:

# Check broker metrics
# kafka.network:type=RequestMetrics,name=TotalTimeMs

Common Causes:

Cause Check Resolution
Disk I/O iostat Faster disks
Network netstat, ping Fix network
GC pauses GC logs Tune JVM
Request queue RequestQueueTimeMs More network threads

Issue: Disk Full

Symptoms: No space left on device

Quick Check:

df -h /var/kafka-logs
du -sh /var/kafka-logs/*

Resolution:

# Reduce retention
kafka-configs.sh --bootstrap-server kafka:9092 \
  --entity-type topics --entity-name big-topic \
  --alter --add-config retention.ms=86400000

# Delete old topics
kafka-topics.sh --bootstrap-server kafka:9092 \
  --delete --topic unused-topic

# Add storage
# Mount additional disk

Log Analysis

Find Errors

# Recent errors
grep "ERROR" /var/log/kafka/server.log | tail -50

# Specific exception
grep "OutOfMemoryError" /var/log/kafka/server.log

# Time-based search
grep "2024-01-15 10:" /var/log/kafka/server.log | grep ERROR

Key Patterns

Pattern Meaning Action
OutOfMemoryError Heap exhausted Increase heap
KafkaStorageException Disk failure Check disk
NotLeaderOrFollower Stale metadata Usually transient
ISR shrunk Replica fell behind Check broker

Emergency Procedures

Broker Won't Start

# Check for port conflict
netstat -tlnp | grep 9092

# Check log for errors
tail -100 /var/log/kafka/server.log | grep -E "ERROR|FATAL"

# Verify disk space
df -h /var/kafka-logs

# Check file descriptors
ulimit -n

Offline Partitions

# Identify offline partitions
kafka-topics.sh --bootstrap-server kafka:9092 \
  --describe --unavailable-partitions

# Check replica status
kafka-topics.sh --bootstrap-server kafka:9092 \
  --describe --topic affected-topic

# If broker recoverable, restart it
# If not, consider unclean election (data loss risk)
kafka-leader-election.sh --bootstrap-server kafka:9092 \
  --election-type unclean \
  --topic affected-topic --partition 0

Consumer Group Stuck

# Check group state
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group stuck-group --state

# Force delete (consumers must be stopped)
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --delete --group stuck-group

# Recreate with offset reset if needed