Kafka Operations
Operational procedures for managing Apache Kafka clusters in production environments.
Operations Overview
Kafka operations encompass cluster management, monitoring, performance optimization, and maintenance activities required to run Kafka reliably at scale.

Operations Categories
| Category |
Activities |
| Cluster Management |
Broker lifecycle, partition management, configuration |
| Monitoring |
Metrics collection, alerting, dashboards |
| Performance |
Tuning, capacity planning, benchmarking |
| Maintenance |
Upgrades, rolling restarts, log management |
| Security |
Authentication, authorization, encryption |
| Backup/DR |
Replication, disaster recovery, data migration |
Cluster Management
Broker Operations
| Operation |
Command/Procedure |
| Start broker |
kafka-server-start.sh config/server.properties |
| Stop broker |
kafka-server-stop.sh or graceful shutdown |
| Check broker status |
kafka-broker-api-versions.sh --bootstrap-server host:9092 |
| List brokers |
kafka-metadata.sh --snapshot /path/to/metadata --command "brokers" (KRaft) |
Adding a Broker
- Configure new broker with unique
broker.id
- Start broker—it joins cluster automatically
- Reassign partitions to include new broker:
# Generate reassignment plan
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
--topics-to-move-json-file topics.json \
--broker-list "1,2,3,4" \
--generate
# Execute reassignment
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
--reassignment-json-file reassignment.json \
--execute
# Verify progress
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
--reassignment-json-file reassignment.json \
--verify
Removing a Broker
- Reassign all partitions away from broker
- Verify no partitions remain on broker
- Stop broker
- (KRaft) Unregister broker:
kafka-metadata.sh --snapshot /path/to/metadata \
--command "unregister --id 4"
Rolling Restart

Rolling restart script pattern:
for broker in broker1 broker2 broker3; do
echo "Restarting $broker"
# Check cluster health
kafka-topics.sh --bootstrap-server kafka:9092 \
--describe --under-replicated-partitions
# Stop broker
ssh $broker "kafka-server-stop.sh"
# Wait for controlled shutdown
sleep 30
# Start broker
ssh $broker "kafka-server-start.sh -daemon config/server.properties"
# Wait for broker to rejoin
sleep 60
# Verify ISR
until kafka-topics.sh --bootstrap-server kafka:9092 \
--describe --under-replicated-partitions | grep -q "^$"; do
sleep 10
done
echo "$broker restarted successfully"
done
→ Cluster Management Guide
Essential Commands
| Tool |
Purpose |
kafka-topics.sh |
Topic management |
kafka-configs.sh |
Configuration management |
kafka-consumer-groups.sh |
Consumer group management |
kafka-reassign-partitions.sh |
Partition reassignment |
kafka-acls.sh |
ACL management |
kafka-metadata.sh |
KRaft metadata inspection |
kafka-dump-log.sh |
Log segment inspection |
kafka-topics.sh
# List topics
kafka-topics.sh --bootstrap-server kafka:9092 --list
# Describe topic
kafka-topics.sh --bootstrap-server kafka:9092 \
--describe --topic orders
# Create topic
kafka-topics.sh --bootstrap-server kafka:9092 \
--create --topic orders \
--partitions 12 --replication-factor 3
# Delete topic
kafka-topics.sh --bootstrap-server kafka:9092 \
--delete --topic orders
# Find problematic partitions
kafka-topics.sh --bootstrap-server kafka:9092 \
--describe --under-replicated-partitions
kafka-topics.sh --bootstrap-server kafka:9092 \
--describe --unavailable-partitions
kafka-consumer-groups.sh
# List consumer groups
kafka-consumer-groups.sh --bootstrap-server kafka:9092 --list
# Describe group
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
--describe --group order-processor
# Reset offsets (requires group to be inactive)
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
--group order-processor \
--topic orders \
--reset-offsets --to-earliest --execute
# Reset to specific offset
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
--group order-processor \
--topic orders:0 \
--reset-offsets --to-offset 1000 --execute
# Reset to timestamp
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
--group order-processor \
--all-topics \
--reset-offsets --to-datetime 2024-01-15T10:00:00.000 --execute
kafka-configs.sh
# Describe broker config
kafka-configs.sh --bootstrap-server kafka:9092 \
--entity-type brokers --entity-name 1 --describe
# Alter broker config
kafka-configs.sh --bootstrap-server kafka:9092 \
--entity-type brokers --entity-name 1 \
--alter --add-config log.cleaner.threads=4
# Describe topic config
kafka-configs.sh --bootstrap-server kafka:9092 \
--entity-type topics --entity-name orders --describe
# Alter topic config
kafka-configs.sh --bootstrap-server kafka:9092 \
--entity-type topics --entity-name orders \
--alter --add-config retention.ms=86400000
→ CLI Tools Reference
Monitoring
Key Metrics Categories
| Category |
Metrics |
| Throughput |
Messages in/out, bytes in/out per broker/topic |
| Latency |
Request latency (produce, fetch, metadata) |
| Availability |
Under-replicated partitions, offline partitions |
| Consumer health |
Consumer lag, commit rate |
| Resource utilization |
CPU, memory, disk, network |
Critical Alerts
| Metric |
Condition |
Severity |
| Under-replicated partitions |
> 0 for 5 min |
Warning |
| Offline partitions |
> 0 |
Critical |
| Controller count |
≠ 1 |
Critical |
| Consumer lag |
Growing continuously |
Warning |
| Request queue time |
> 100ms |
Warning |
| Disk usage |
> 80% |
Warning |
| Disk usage |
> 90% |
Critical |
JMX Metrics
# Broker metrics
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
# Request metrics
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer
# Partition metrics
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
kafka.controller:type=KafkaController,name=OfflinePartitionsCount
kafka.controller:type=KafkaController,name=ActiveControllerCount
# Consumer lag (via consumer group command or external tools)
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=*,topic=*,partition=*
→ Monitoring Guide
Capacity Planning

Network Bandwidth Calculation
Inbound = Producer throughput
Outbound = (Replication factor - 1) × Inbound # Replication
+ Consumer count × Inbound # Consumption
+ MirrorMaker × Inbound # If applicable
+ Connect × relevant throughput # If applicable
Total broker bandwidth = Inbound + Outbound
Example:
- Ingest: 100 MB/s
- Replication factor: 3
- Consumer groups: 5
Outbound = (3-1) × 100 + 5 × 100 = 200 + 500 = 700 MB/s
Total per broker = 100 + 700 = 800 MB/s = 6.4 Gbps
Storage Calculation
Storage per broker = (Daily ingest × Retention days × RF) / Broker count
+ Compaction overhead (if applicable)
Tuning Parameters
| Component |
Key Parameters |
| Broker |
num.io.threads, num.network.threads, socket.send.buffer.bytes |
| Producer |
batch.size, linger.ms, compression.type, buffer.memory |
| Consumer |
fetch.min.bytes, fetch.max.wait.ms, max.poll.records |
| OS |
vm.swappiness=1, net.core.rmem_max, file descriptor limits |
| JVM |
Heap size, GC settings (G1GC recommended) |
→ Performance Guide
Maintenance
Routine Tasks
| Task |
Frequency |
Purpose |
| Monitor disk usage |
Continuous |
Prevent disk full |
| Check under-replicated partitions |
Continuous |
Detect issues early |
| Review consumer lag |
Continuous |
Ensure consumers keep up |
| Log rotation |
Daily |
Manage log files |
| Certificate renewal |
Before expiry |
Maintain TLS |
| Configuration backup |
Weekly |
Disaster recovery |
Upgrade Procedure
- Prepare
- Review release notes
- Test in non-production
-
Plan rollback strategy
-
Rolling Upgrade
# For each broker:
# 1. Stop broker
# 2. Upgrade binaries
# 3. Update configuration if needed
# 4. Start broker
# 5. Wait for ISR recovery
# 6. Verify cluster health
-
Upgrade Protocol Version (after all brokers upgraded)
inter.broker.protocol.version=3.6
log.message.format.version=3.6
-
Verify
- Check cluster health
- Verify producer/consumer functionality
- Monitor for issues
→ Maintenance Guide
Backup and Disaster Recovery
Backup Strategies
| Strategy |
Method |
RPO |
RTO |
| MirrorMaker 2 |
Active replication to DR site |
Near-zero |
Minutes |
| Topic backup |
Consume and store to object storage |
Hours |
Hours |
| Filesystem backup |
Snapshot log directories |
Hours |
Hours |
| Configuration backup |
Export configs and ACLs |
N/A |
Minutes |
MirrorMaker 2
# mm2.properties
clusters=source,target
source.bootstrap.servers=source-kafka:9092
target.bootstrap.servers=target-kafka:9092
source->target.enabled=true
source->target.topics=.*
# Replication settings
replication.factor=3
checkpoints.topic.replication.factor=3
heartbeats.topic.replication.factor=3
offset-syncs.topic.replication.factor=3
# Consumer offset sync
sync.group.offsets.enabled=true
Disaster Recovery Planning
| Consideration |
Recommendation |
| RPO requirement |
Determines replication lag tolerance |
| RTO requirement |
Determines failover automation level |
| Data consistency |
Understand potential message loss during failover |
| Consumer offset handling |
Plan for offset translation or reset |
| Testing |
Regular DR drills |
→ Backup/Restore Guide
Troubleshooting
Common Issues
| Issue |
Symptoms |
Investigation |
| Under-replicated partitions |
ISR < RF |
Check broker health, network, disk I/O |
| Consumer lag growing |
Lag increasing |
Check consumer health, throughput, processing time |
| High produce latency |
Slow acks |
Check acks setting, ISR health, disk I/O |
| Broker OOM |
Broker crashes |
Review heap size, page cache usage |
| Disk full |
Write failures |
Check retention, add storage, rebalance |
| Connection failures |
Client errors |
Check network, authentication, quotas |
Diagnostic Commands
# Check cluster health
kafka-topics.sh --bootstrap-server kafka:9092 \
--describe --under-replicated-partitions
kafka-topics.sh --bootstrap-server kafka:9092 \
--describe --unavailable-partitions
# Check consumer groups
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
--describe --group my-group
# Inspect log segments
kafka-dump-log.sh --files /var/kafka-logs/orders-0/00000000000000000000.log \
--print-data-log
# Check controller
kafka-metadata.sh --snapshot /var/kafka-logs/__cluster_metadata-0/00000000000000000000.log \
--command "describe"
→ Troubleshooting Guide