Kafka Architecture¶
Apache Kafka is a distributed commit log designed for high-throughput, fault-tolerant, real-time data streaming.
Architectural Overview¶
Kafka's architecture consists of brokers forming a cluster, with data organized into topics and partitions. Producers write to topic partitions, and consumers read from them.
Core Components¶
| Component | Description |
|---|---|
| Broker | Server that stores data and serves client requests |
| Controller | Broker responsible for cluster coordination (leader election, partition assignment) |
| Topic | Named category of records; logical grouping of related events |
| Partition | Ordered, immutable sequence of records within a topic |
| Replica | Copy of a partition for fault tolerance |
| Producer | Client that publishes records to topics |
| Consumer | Client that subscribes to topics and processes records |
| Consumer Group | Set of consumers that coordinate to consume a topic |
Brokers¶
Brokers are the servers that form a Kafka cluster. Each broker:
- Stores partition data on disk
- Handles produce and fetch requests from clients
- Replicates data to other brokers
- Participates in cluster coordination
Broker Architecture¶
Key Broker Configurations¶
| Configuration | Default | Description |
|---|---|---|
broker.id |
-1 (auto) | Unique identifier for this broker |
log.dirs |
/tmp/kafka-logs | Directories for partition data |
num.network.threads |
3 | Threads for network I/O |
num.io.threads |
8 | Threads for disk I/O |
socket.send.buffer.bytes |
102400 | Socket send buffer |
socket.receive.buffer.bytes |
102400 | Socket receive buffer |
socket.request.max.bytes |
104857600 | Maximum request size (100MB) |
Controller¶
The controller is a broker elected to manage cluster-wide operations. In KRaft mode, a quorum of controllers handles this; in ZooKeeper mode, a single controller is elected.
Controller Responsibilities¶
| Responsibility | Description |
|---|---|
| Leader election | Elect new partition leaders when leaders fail |
| Partition assignment | Assign partitions to brokers |
| Broker registration | Track broker membership in cluster |
| Topic management | Create and delete topics |
| Metadata propagation | Distribute cluster metadata to all brokers |
KRaft vs ZooKeeper¶
Kafka is transitioning from ZooKeeper to KRaft (Kafka Raft) for cluster coordination.
| Aspect | ZooKeeper Mode | KRaft Mode |
|---|---|---|
| External dependency | ZooKeeper cluster required | None |
| Metadata storage | Split (ZK + broker logs) | Unified (__cluster_metadata topic) |
| Failover time | Seconds to minutes | Milliseconds to seconds |
| Partition limit | ~200K practical limit | Millions of partitions |
| Operational complexity | Two systems to manage | Single system |
| Version | All versions | Kafka 3.3+ (production ready) |
Partitions and Replication¶
Partitions¶
A topic is divided into partitions—ordered, append-only logs. Partitions enable:
- Parallelism: Multiple consumers can read different partitions concurrently
- Ordering: Records within a partition maintain strict order
- Scalability: Partitions can be distributed across brokers
Replication¶
Each partition has multiple replicas distributed across brokers. One replica is the leader; others are followers.
Replication Concepts¶
| Concept | Description |
|---|---|
| Leader | Replica that handles all reads and writes for a partition |
| Follower | Replica that replicates from leader; can become leader if current leader fails |
| ISR (In-Sync Replicas) | Set of replicas that are fully caught up with leader |
| High Watermark | Offset up to which all ISR have replicated; consumers can only read up to HW |
| LEO (Log End Offset) | Latest offset in the leader's log |
Replication Factor¶
The replication factor determines how many copies of each partition exist:
| RF | Behavior | Use Case |
|---|---|---|
| 1 | No redundancy; data loss on broker failure | Development only |
| 2 | Survives 1 broker failure | Limited production use |
| 3 | Survives 2 broker failures; quorum possible | Production standard |
| 4+ | Higher durability; rarely needed | Critical data |
Storage Engine¶
Kafka stores data in log segments on disk. The storage design prioritizes sequential I/O for maximum throughput.
Log Segment Files¶
| File | Purpose |
|---|---|
.log |
Message data (key, value, headers, metadata) |
.index |
Offset-to-position index for efficient seeking |
.timeindex |
Timestamp-to-offset index for time-based seeking |
.txnindex |
Transaction index (for transactional messages) |
.snapshot |
Producer state snapshots |
Retention Policies¶
| Policy | Configuration | Behavior |
|---|---|---|
| Time-based | retention.ms |
Delete segments older than threshold |
| Size-based | retention.bytes |
Delete oldest segments when partition exceeds size |
| Compaction | cleanup.policy=compact |
Keep only latest value per key |
Performance Architecture¶
Kafka achieves high throughput through several design choices:
Sequential I/O¶
Zero-Copy Transfers¶
Kafka uses sendfile() to transfer data directly from disk to network, bypassing user-space copies.
TLS Disables Zero-Copy
When TLS encryption is enabled, zero-copy is not possible because data must be encrypted in user space. This can reduce throughput by 30-50% depending on CPU and workload.
Batching¶
Producers batch messages before sending, and consumers fetch in batches:
| Batching Point | Configuration | Benefit |
|---|---|---|
| Producer | batch.size, linger.ms |
Amortize network overhead, enable compression |
| Broker | Internal batching | Efficient disk writes |
| Consumer | fetch.min.bytes, fetch.max.wait.ms |
Reduce fetch requests |
OS Page Cache¶
Kafka relies on the OS page cache rather than managing its own cache:
| Benefit | Description |
|---|---|
| Automatic management | OS handles cache eviction |
| Warm restarts | Cache survives broker restarts |
| Memory efficiency | Avoids double-buffering in JVM |
| Read-ahead | OS prefetches sequential reads |
Fault Tolerance¶
Kafka survives failures at multiple levels:
| Failure | Kafka Response |
|---|---|
| Single broker | Leader election; ISR continues serving |
| Multiple brokers | Service continues if enough replicas remain |
| Rack failure | Rack-aware placement ensures cross-rack replicas |
| Network partition | min.insync.replicas prevents split-brain |
| Disk failure | Affected partitions relocate to healthy disks |
Data Durability Configuration¶
| Setting | Value | Behavior |
|---|---|---|
acks=all |
Producer waits for all ISR | Strongest durability |
min.insync.replicas=2 |
Require 2 replicas for writes | Prevents single-replica writes |
unclean.leader.election.enable=false |
Only ISR can become leader | Prevents data loss on failover |
Related Documentation¶
- Topics and Partitions - Partitions, leaders, ISR, replication
- Transaction Coordinator - Exactly-once semantics, two-phase commit
- Brokers - Broker architecture and configuration
- KRaft - KRaft consensus mode
- Replication - ISR, leader election, durability
- Storage Engine - Log segments, indexes, compaction
- Performance Internals - Zero-copy, batching, tuning
- Fault Tolerance - Failure scenarios and recovery
- Topology - Rack awareness, network design