KRaft: Kafka Raft Consensus¶
KRaft (Kafka Raft) is Kafka's built-in consensus protocol that replaced Apache ZooKeeper for metadata management. Introduced in Kafka 2.8 and production-ready since Kafka 3.3, KRaft simplifies Kafka's architecture by eliminating the external ZooKeeper dependency.
Why KRaft Replaced ZooKeeper¶
Architectural Simplification¶
Benefits of KRaft¶
| Aspect | ZooKeeper Mode | KRaft Mode |
|---|---|---|
| Operational complexity | Two systems to manage | Single system |
| Scaling | ZK limits (~200K partitions) | Millions of partitions |
| Recovery time | Minutes (controller failover) | Seconds |
| Metadata propagation | Pull-based, eventually consistent | Push-based, faster |
| Security | Separate auth for ZK | Unified Kafka security |
| Monitoring | Two metric systems | Single metric system |
Raft Consensus Protocol¶
KRaft implements the Raft consensus algorithm for leader election and log replication among controllers.
Raft Fundamentals¶
Leader Election¶
When the controller leader fails, remaining controllers elect a new leader:
Election rules:
- Term — Logical clock incremented on each election
- Vote — Each controller votes once per term
- Log completeness — Only vote for candidates with up-to-date logs
- Majority — Candidate needs
(n/2) + 1votes to become leader
Log Replication¶
The leader replicates metadata log entries to followers:
Commit rules:
- Entry is committed when replicated to a majority of controllers
- Only entries from the current term can be committed directly
- Committing an entry commits all prior entries
Quorum and Fault Tolerance¶
| Controllers | Quorum | Tolerated Failures |
|---|---|---|
| 1 | 1 | 0 (no fault tolerance) |
| 3 | 2 | 1 |
| 5 | 3 | 2 |
| 7 | 4 | 3 |
Recommendation: Use 3 controllers for most deployments. Use 5 for large clusters requiring higher availability.
Metadata Log¶
All cluster metadata is stored in a replicated log, not in an external system.
Log Structure¶
Record Types¶
| Category | Record Types |
|---|---|
| Cluster | FeatureLevelRecord, ZkMigrationStateRecord |
| Brokers | RegisterBrokerRecord, UnregisterBrokerRecord, BrokerRegistrationChangeRecord |
| Topics | TopicRecord, RemoveTopicRecord |
| Partitions | PartitionRecord, PartitionChangeRecord, RemovePartitionRecord |
| Configuration | ConfigRecord, RemoveConfigRecord |
| Security | ClientQuotaRecord, UserScramCredentialRecord, AccessControlEntryRecord |
| Producers | ProducerIdsRecord |
Snapshots¶
To prevent unbounded log growth, controllers periodically create snapshots:
Snapshot configuration:
# Minimum records between snapshots
metadata.log.max.record.bytes.between.snapshots=20971520
# Maximum time between snapshots
metadata.log.max.snapshot.interval.ms=3600000
Storage Location¶
# Metadata log directory structure
/var/kafka-logs/__cluster_metadata-0/
├── 00000000000000000000.log # Log segment
├── 00000000000000000000.index # Offset index
├── 00000000000000000000.timeindex
├── 00000000000000001000.log # Next segment
├── 00000000000000001000-checkpoint.snapshot # Snapshot
└── leader-epoch-checkpoint
Controller Communication¶
Controller-to-Controller (Raft)¶
Controllers communicate using the Raft protocol over a dedicated listener:
# Controller listener configuration
controller.listener.names=CONTROLLER
listeners=CONTROLLER://0.0.0.0:9093
# Inter-controller security
listener.security.protocol.map=CONTROLLER:SSL
Controller-to-Broker (Push)¶
Unlike ZooKeeper mode (where brokers pulled metadata), KRaft controllers push metadata updates to brokers:
Key difference from ZooKeeper:
| Aspect | ZooKeeper Mode | KRaft Mode |
|---|---|---|
| Propagation | Watch-based, async | Fetch-based, controlled |
| Consistency | Eventual | Offset-based, ordered |
| Latency | Variable | Predictable |
Quorum Configuration¶
Controller Quorum Voters¶
# Define the controller quorum
controller.quorum.voters=1@controller1:9093,2@controller2:9093,3@controller3:9093
# Format: node.id@host:port
# All controllers must have the same voter list
Deployment Modes¶
Combined Mode (Development/Small Clusters)¶
Controllers and brokers run in the same JVM:
# server.properties for combined mode
process.roles=broker,controller
node.id=1
controller.quorum.voters=1@node1:9093,2@node2:9093,3@node3:9093
Isolated Mode (Production/Large Clusters)¶
Dedicated controller nodes separate from brokers:
# controller.properties (controller-only nodes)
process.roles=controller
node.id=1
controller.quorum.voters=1@ctrl1:9093,2@ctrl2:9093,3@ctrl3:9093
# broker.properties (broker-only nodes)
process.roles=broker
node.id=101
controller.quorum.voters=1@ctrl1:9093,2@ctrl2:9093,3@ctrl3:9093
Controller Sizing¶
| Cluster Size | Controller Count | Controller Resources |
|---|---|---|
| Development | 1 (no HA) | 1 CPU, 1GB RAM |
| Small (< 10 brokers) | 3 (combined mode) | 2 CPU, 4GB RAM |
| Medium (10-50 brokers) | 3 (isolated) | 4 CPU, 8GB RAM |
| Large (50+ brokers) | 5 (isolated) | 8 CPU, 16GB RAM |
Failover Behavior¶
Controller Leader Failure¶
Failover characteristics:
| Metric | Typical Value |
|---|---|
| Detection time | 1-2 heartbeat intervals (2-4 seconds) |
| Election time | 50-200 milliseconds |
| Total failover | 2-5 seconds |
Broker Behavior During Failover¶
Key point: Broker data operations (produce/consume) continue during controller failover. Only metadata operations (topic creation, leader election) are temporarily blocked.
Split-Brain Prevention¶
Raft's quorum requirement prevents split-brain:
Migration from ZooKeeper¶
Migration Modes¶
Migration Steps¶
- Deploy controller quorum:
# Format controller storage
kafka-storage.sh format -t $(kafka-storage.sh random-uuid) \
-c controller.properties
- Enable migration mode:
# In broker server.properties
zookeeper.metadata.migration.enable=true
controller.quorum.voters=1@ctrl1:9093,2@ctrl2:9093,3@ctrl3:9093
- Start controllers and migrate:
# Controllers will sync metadata from ZooKeeper
# Monitor migration progress
kafka-metadata.sh --snapshot /var/kafka-logs/__cluster_metadata-0/*.log \
--command "describe"
- Restart brokers in KRaft mode:
# Remove ZK config, enable KRaft
process.roles=broker
controller.quorum.voters=1@ctrl1:9093,2@ctrl2:9093,3@ctrl3:9093
# Remove: zookeeper.connect=...
- Finalize migration:
# After all brokers migrated
kafka-metadata.sh --bootstrap-controller ctrl1:9093 \
--command "finalize-migration"
Rollback Capability¶
During migration, rollback is possible until finalization:
| Phase | Rollback Possible | Data Safe |
|---|---|---|
| Controllers deployed | Yes | Yes |
| Dual-write mode | Yes | Yes |
| Brokers migrated | Yes (restart with ZK) | Yes |
| Migration finalized | No | N/A |
Troubleshooting¶
Common Issues¶
| Issue | Symptom | Resolution |
|---|---|---|
| No leader elected | LEADER_NOT_AVAILABLE errors |
Check controller connectivity, verify quorum voters |
| Metadata out of sync | Brokers have stale topic info | Check broker fetch lag from controllers |
| Controller OOM | Controller crashes | Increase heap, check for partition explosion |
| Slow elections | Long failover time | Check network latency between controllers |
Diagnostic Commands¶
# Check controller quorum status
kafka-metadata.sh --snapshot /var/kafka-logs/__cluster_metadata-0/*.log \
--command "quorum"
# View current controller leader
kafka-metadata.sh --bootstrap-controller ctrl1:9093 \
--command "describe" | grep -i leader
# Check metadata log lag
kafka-metadata.sh --snapshot /var/kafka-logs/__cluster_metadata-0/*.log \
--command "log" | tail -20
# Verify broker registration
kafka-broker-api-versions.sh --bootstrap-server broker1:9092
Key Metrics¶
| Metric | Description | Alert Condition |
|---|---|---|
kafka.controller:type=KafkaController,name=ActiveControllerCount |
Active controllers | ≠ 1 |
kafka.controller:type=ControllerEventManager,name=EventQueueSize |
Pending controller events | > 1000 |
kafka.server:type=MetadataLoader,name=CurrentMetadataOffset |
Broker metadata offset | Lag > 1000 |
kafka.raft:type=RaftMetrics,name=CommitLatencyAvg |
Raft commit latency | > 100ms |
Summary¶
| Aspect | Detail |
|---|---|
| What | Built-in consensus replacing ZooKeeper |
| Protocol | Raft (leader election, log replication) |
| Storage | __cluster_metadata topic |
| Quorum | 3 or 5 controllers recommended |
| Failover | 2-5 seconds typical |
| Deployment | Combined (small) or isolated (large) mode |
Related Documentation¶
- Cluster Management - Controller operations
- Fault Tolerance - Failure handling
- Brokers - Broker internals
- Architecture Overview - System architecture