KRaft: Kafka Raft Consensus¶

KRaft (Kafka Raft) is Kafka's built-in consensus protocol that replaced Apache ZooKeeper for metadata management. Introduced in Kafka 2.8 and production-ready since Kafka 3.3, KRaft simplifies Kafka's architecture by eliminating the external ZooKeeper dependency.

Why KRaft Replaced ZooKeeper¶

Architectural Simplification¶

Benefits of KRaft¶

Aspect	ZooKeeper Mode	KRaft Mode
Operational complexity	Two systems to manage	Single system
Scaling	ZK limits (~200K partitions)	Millions of partitions
Recovery time	Minutes (controller failover)	Seconds
Metadata propagation	Pull-based, eventually consistent	Push-based, faster
Security	Separate auth for ZK	Unified Kafka security
Monitoring	Two metric systems	Single metric system

Raft Consensus Protocol¶

KRaft implements the Raft consensus algorithm for leader election and log replication among controllers.

Raft Fundamentals¶

Leader Election¶

When the controller leader fails, remaining controllers elect a new leader:

Election rules:

Term — Logical clock incremented on each election
Vote — Each controller votes once per term
Log completeness — Only vote for candidates with up-to-date logs
Majority — Candidate needs (n/2) + 1 votes to become leader

Log Replication¶

The leader replicates metadata log entries to followers:

Commit rules:

Entry is committed when replicated to a majority of controllers
Only entries from the current term can be committed directly
Committing an entry commits all prior entries

Quorum and Fault Tolerance¶

Controllers	Quorum	Tolerated Failures
1	1	0 (no fault tolerance)
3	2	1
5	3	2
7	4	3

Recommendation: Use 3 controllers for most deployments. Use 5 for large clusters requiring higher availability.

Metadata Log¶

All cluster metadata is stored in a replicated log, not in an external system.

Log Structure¶

Record Types¶

Category	Record Types
Cluster	`FeatureLevelRecord`, `ZkMigrationStateRecord`
Brokers	`RegisterBrokerRecord`, `UnregisterBrokerRecord`, `BrokerRegistrationChangeRecord`
Topics	`TopicRecord`, `RemoveTopicRecord`
Partitions	`PartitionRecord`, `PartitionChangeRecord`, `RemovePartitionRecord`
Configuration	`ConfigRecord`, `RemoveConfigRecord`
Security	`ClientQuotaRecord`, `UserScramCredentialRecord`, `AccessControlEntryRecord`
Producers	`ProducerIdsRecord`

Snapshots¶

To prevent unbounded log growth, controllers periodically create snapshots:

Snapshot configuration:

# Minimum records between snapshots
metadata.log.max.record.bytes.between.snapshots=20971520

# Maximum time between snapshots
metadata.log.max.snapshot.interval.ms=3600000

Storage Location¶

# Metadata log directory structure
/var/kafka-logs/__cluster_metadata-0/
├── 00000000000000000000.log      # Log segment
├── 00000000000000000000.index    # Offset index
├── 00000000000000000000.timeindex
├── 00000000000000001000.log      # Next segment
├── 00000000000000001000-checkpoint.snapshot  # Snapshot
└── leader-epoch-checkpoint

Controller Communication¶

Controller-to-Controller (Raft)¶

Controllers communicate using the Raft protocol over a dedicated listener:

# Controller listener configuration
controller.listener.names=CONTROLLER
listeners=CONTROLLER://0.0.0.0:9093

# Inter-controller security
listener.security.protocol.map=CONTROLLER:SSL

Controller-to-Broker (Push)¶

Unlike ZooKeeper mode (where brokers pulled metadata), KRaft controllers push metadata updates to brokers:

Key difference from ZooKeeper:

Aspect	ZooKeeper Mode	KRaft Mode
Propagation	Watch-based, async	Fetch-based, controlled
Consistency	Eventual	Offset-based, ordered
Latency	Variable	Predictable

Quorum Configuration¶

Controller Quorum Voters¶

# Define the controller quorum
controller.quorum.voters=1@controller1:9093,2@controller2:9093,3@controller3:9093

# Format: node.id@host:port
# All controllers must have the same voter list

Deployment Modes¶

Combined Mode (Development/Small Clusters)¶

Controllers and brokers run in the same JVM:

# server.properties for combined mode
process.roles=broker,controller
node.id=1
controller.quorum.voters=1@node1:9093,2@node2:9093,3@node3:9093

Isolated Mode (Production/Large Clusters)¶

Dedicated controller nodes separate from brokers:

# controller.properties (controller-only nodes)
process.roles=controller
node.id=1
controller.quorum.voters=1@ctrl1:9093,2@ctrl2:9093,3@ctrl3:9093

# broker.properties (broker-only nodes)
process.roles=broker
node.id=101
controller.quorum.voters=1@ctrl1:9093,2@ctrl2:9093,3@ctrl3:9093

Controller Sizing¶

Cluster Size	Controller Count	Controller Resources
Development	1 (no HA)	1 CPU, 1GB RAM
Small (< 10 brokers)	3 (combined mode)	2 CPU, 4GB RAM
Medium (10-50 brokers)	3 (isolated)	4 CPU, 8GB RAM
Large (50+ brokers)	5 (isolated)	8 CPU, 16GB RAM

Failover Behavior¶

Controller Leader Failure¶

Failover characteristics:

Metric	Typical Value
Detection time	1-2 heartbeat intervals (2-4 seconds)
Election time	50-200 milliseconds
Total failover	2-5 seconds

Broker Behavior During Failover¶

Key point: Broker data operations (produce/consume) continue during controller failover. Only metadata operations (topic creation, leader election) are temporarily blocked.

Split-Brain Prevention¶

Raft's quorum requirement prevents split-brain:

Migration from ZooKeeper¶

Migration Modes¶

Migration Steps¶

Deploy controller quorum:

# Format controller storage
kafka-storage.sh format -t $(kafka-storage.sh random-uuid) \
  -c controller.properties

Enable migration mode:

# In broker server.properties
zookeeper.metadata.migration.enable=true
controller.quorum.voters=1@ctrl1:9093,2@ctrl2:9093,3@ctrl3:9093

Start controllers and migrate:

# Controllers will sync metadata from ZooKeeper
# Monitor migration progress
kafka-metadata.sh --snapshot /var/kafka-logs/__cluster_metadata-0/*.log \
  --command "describe"

Restart brokers in KRaft mode:

# Remove ZK config, enable KRaft
process.roles=broker
controller.quorum.voters=1@ctrl1:9093,2@ctrl2:9093,3@ctrl3:9093
# Remove: zookeeper.connect=...

Finalize migration:

# After all brokers migrated
kafka-metadata.sh --bootstrap-controller ctrl1:9093 \
  --command "finalize-migration"

Rollback Capability¶

During migration, rollback is possible until finalization:

Phase	Rollback Possible	Data Safe
Controllers deployed	Yes	Yes
Dual-write mode	Yes	Yes
Brokers migrated	Yes (restart with ZK)	Yes
Migration finalized	No	N/A

Troubleshooting¶

Common Issues¶

Issue	Symptom	Resolution
No leader elected	`LEADER_NOT_AVAILABLE` errors	Check controller connectivity, verify quorum voters
Metadata out of sync	Brokers have stale topic info	Check broker fetch lag from controllers
Controller OOM	Controller crashes	Increase heap, check for partition explosion
Slow elections	Long failover time	Check network latency between controllers

Diagnostic Commands¶

# Check controller quorum status
kafka-metadata.sh --snapshot /var/kafka-logs/__cluster_metadata-0/*.log \
  --command "quorum"

# View current controller leader
kafka-metadata.sh --bootstrap-controller ctrl1:9093 \
  --command "describe" | grep -i leader

# Check metadata log lag
kafka-metadata.sh --snapshot /var/kafka-logs/__cluster_metadata-0/*.log \
  --command "log" | tail -20

# Verify broker registration
kafka-broker-api-versions.sh --bootstrap-server broker1:9092

Key Metrics¶

Metric	Description	Alert Condition
`kafka.controller:type=KafkaController,name=ActiveControllerCount`	Active controllers	≠ 1
`kafka.controller:type=ControllerEventManager,name=EventQueueSize`	Pending controller events	> 1000
`kafka.server:type=MetadataLoader,name=CurrentMetadataOffset`	Broker metadata offset	Lag > 1000
`kafka.raft:type=RaftMetrics,name=CommitLatencyAvg`	Raft commit latency	> 100ms

Summary¶

Aspect	Detail
What	Built-in consensus replacing ZooKeeper
Protocol	Raft (leader election, log replication)
Storage	`__cluster_metadata` topic
Quorum	3 or 5 controllers recommended
Failover	2-5 seconds typical
Deployment	Combined (small) or isolated (large) mode

Cluster Management - Controller operations
Fault Tolerance - Failure handling
Brokers - Broker internals
Architecture Overview - System architecture