Topics and Partitions¶
Topics and partitions form the foundation of Kafka's distributed architecture. A topic is a logical channel for records; partitions provide horizontal scalability and fault tolerance through distributed, replicated logs.
Topic Architecture¶
A topic is a named, append-only log that organizes related records. Unlike traditional message queues where messages are deleted after consumption, Kafka topics retain records based on configurable retention policies.
Topic Properties¶
| Property | Description |
|---|---|
| Name | Unique identifier within the cluster; immutable after creation |
| Partition count | Number of partitions; can be increased but not decreased |
| Replication factor | Number of replicas per partition; set at creation |
| Retention | How long records are kept (time or size based) |
| Cleanup policy | Delete old segments or compact by key |
Topic Naming Constraints¶
| Constraint | Rule |
|---|---|
| Length | 1-249 characters |
| Characters | [a-zA-Z0-9._-] only |
| Reserved | Names starting with __ are internal topics |
| Collision | . and _ are interchangeable in metrics; avoid mixing |
Internal Topics
Topics prefixed with __ are managed by Kafka internally and must not be modified directly:
__consumer_offsets- Consumer group offset storage__transaction_state- Transaction coordinator state__cluster_metadata- KRaft metadata (KRaft mode only)
Why Partitions Exist¶
Partitions exist because a single machine has fundamental physical limitations. Understanding these limitations explains why distributed systems like Kafka partition data across multiple nodes.
The Single-Broker Bottleneck¶
A Kafka broker running on a single server faces hard limits imposed by hardware:
Disk I/O Throughput
Kafka writes are sequential (append-only), which is optimal for disk performance. However, even sequential I/O has limits:
| Storage Type | Sequential Write | Sequential Read | Notes |
|---|---|---|---|
| HDD (7200 RPM) | 80-160 MB/s | 80-160 MB/s | Seek time negligible for sequential; still limited by rotational speed |
| SATA SSD | 400-550 MB/s | 500-550 MB/s | Limited by SATA interface (6 Gbps theoretical max) |
| NVMe SSD | 2,000-7,000 MB/s | 3,000-7,000 MB/s | PCIe bandwidth limited; enterprise drives sustain higher throughput |
| NVMe (RAID 0) | 10,000+ MB/s | 15,000+ MB/s | Multiple drives; reliability trade-offs |
With a replication factor of 3, each produced record must be written to three brokers. The leader writes to its own disk and the followers each write to theirs. Disk I/O on followers can become a bottleneck for ISR advancement.
Network Bandwidth
Network interfaces impose hard ceilings on data transfer:
| Interface | Theoretical Max | Practical Throughput | Notes |
|---|---|---|---|
| 1 GbE | 125 MB/s | 100-110 MB/s | Common in older deployments; often the bottleneck |
| 10 GbE | 1,250 MB/s | 1,000-1,100 MB/s | Standard for modern Kafka deployments |
| 25 GbE | 3,125 MB/s | 2,500-2,800 MB/s | High-performance deployments |
| 100 GbE | 12,500 MB/s | 10,000-11,000 MB/s | Large-scale deployments; requires matching infrastructure |
With replication factor 3, producing 100 MB/s to a topic generates approximately:
- 100 MB/s inbound to the leader (producer writes)
- 200 MB/s outbound from the leader (two followers fetching)
- 100 MB/s inbound to each follower
- Consumer fetch traffic on top of replication
A single 10 GbE interface can become saturated with moderate production rates when accounting for replication overhead and consumer traffic.
CPU Processing
CPU becomes a bottleneck primarily in these scenarios:
| Operation | CPU Cost | When It Matters |
|---|---|---|
| Compression | High | Producer-side compression (LZ4, Snappy, Zstd) reduces network/disk but costs CPU |
| Decompression | Medium-High | Broker decompresses for validation (if message format conversion needed) or timestamping |
| TLS encryption | High | TLS 1.3 handshakes and bulk encryption; 20-40% throughput reduction typical |
| CRC32 checksums | Low | Every record batch verified; hardware-accelerated on modern CPUs |
| Zero-copy disabled | High | TLS prevents zero-copy (transferTo); data copied through userspace |
TLS Performance Impact
TLS encryption typically reduces Kafka throughput by 20-40% compared to plaintext. However, raw encryption speed is not the bottleneck on modern hardware.
Modern CPUs with AES-NI achieve 4-7 GB/s AES-256-GCM throughput per core—far exceeding typical Kafka broker throughput. The actual overhead comes from architectural changes required for encryption:
- Zero-copy bypass: Without TLS, Kafka uses Linux
sendfile()to transfer data directly from page cache to NIC without copying through userspace. TLS requires data to be copied to userspace for encryption, then back to kernel space for transmission. This adds 2-3 memory copies per request. - Memory bandwidth pressure: The extra copies consume memory bandwidth that would otherwise be available for actual data transfer.
- Syscall overhead: More context switches between kernel and userspace for each data transfer.
- Handshake overhead: TLS session establishment for new connections (mitigated by connection pooling and session resumption).
The solution is not faster CPUs but reducing the copy overhead. Kernel TLS (kTLS, Linux 4.13+) can offload TLS to the kernel, restoring partial zero-copy capability. For latency-sensitive workloads, evaluate network-layer encryption (IPsec, WireGuard, encrypted overlay networks) which preserves application-layer zero-copy.
Memory Constraints
Each partition consumes broker memory for:
| Resource | Per-Partition Cost | Notes |
|---|---|---|
| Page cache utilization | Variable | OS caches recent segments; partitions compete for cache space |
| Index files (mmap) | ~10 MB typical | .index and .timeindex mapped into memory |
| Log segment overhead | ~1-2 MB | Buffers, file handles, metadata structures |
| Replica fetcher threads | Shared | Each source broker requires a fetcher thread |
With thousands of partitions per broker, memory overhead becomes significant. Kafka's recommendation is to limit partitions to approximately 4,000 per broker (though this varies with hardware).
The Single-Consumer Bottleneck¶
Within a consumer group, Kafka assigns each partition to exactly one consumer. This design provides ordering guarantees but creates a parallelism ceiling:
Consumer throughput limits:
| Bottleneck | Typical Limit | Mitigation |
|---|---|---|
| Processing logic | Varies | Optimize code; async I/O; batch processing |
| Network fetch | 100+ MB/s | Increase fetch.max.bytes; tune max.partition.fetch.bytes |
| Deserialization | Varies | Use efficient formats (Avro, Protobuf); avoid JSON for high throughput |
| Downstream writes | Often the limit | Database insertion, API calls often slower than Kafka reads |
| Commit overhead | Minor | Async commits; less frequent commits for higher throughput |
If a single consumer can process 50 MB/s but the topic receives 200 MB/s, four partitions (and four consumers) are needed to keep up. The partition count sets the maximum consumer parallelism.
How Partitions Solve These Problems¶
| Physical Limitation | How Partitions Help |
|---|---|
| Disk I/O ceiling | Each partition can be on a different broker with its own disks; total throughput = sum of all brokers |
| Network bandwidth ceiling | Traffic distributed across brokers; no single NIC handles all data |
| CPU ceiling | Compression, TLS, and CRC work distributed across brokers |
| Memory ceiling | Partition working sets distributed; page cache pressure spread |
| Consumer processing ceiling | More partitions = more consumers in parallel |
| Storage capacity | Each broker contributes its disk capacity to the cluster |
| Single point of failure | Replicas on different brokers; partition remains available if one broker fails |
Partition Structure¶
Each partition is stored as a directory containing log segments:
Partition Files¶
| File Extension | Purpose |
|---|---|
.log |
Record data (keys, values, headers, timestamps) |
.index |
Sparse offset-to-file-position index |
.timeindex |
Sparse timestamp-to-offset index |
.txnindex |
Transaction abort index |
.snapshot |
Producer state snapshot for idempotence |
leader-epoch-checkpoint |
Leader epoch history |
Offset Semantics¶
Offsets are 64-bit integers assigned sequentially within each partition:
| Concept | Description |
|---|---|
| Base offset | First offset in a segment (used in filename) |
| Log End Offset (LEO) | Next offset to be assigned (last offset + 1) |
| High Watermark (HW) | Last offset replicated to all ISR; consumers read up to HW |
| Last Stable Offset (LSO) | HW excluding uncommitted transactions |
Partition Leaders and Replicas¶
Each partition has one leader and zero or more follower replicas. The leader handles all read and write requests; followers replicate data from the leader.
Leader Responsibilities¶
| Responsibility | Description |
|---|---|
| Handle produces | Accept and persist records from producers |
| Handle fetches | Serve records to consumers and followers |
| Maintain ISR | Track which followers are in-sync |
| Advance HW | Update high watermark as followers catch up |
Follower Responsibilities¶
| Responsibility | Description |
|---|---|
| Fetch from leader | Continuously replicate new records |
| Maintain LEO | Track local log end offset |
| Report position | Include LEO in fetch requests for HW calculation |
Replica States¶
In-Sync Replicas (ISR)¶
The ISR is the set of replicas that are fully caught up with the leader. Only ISR members are eligible to become leader if the current leader fails.
| Concept | Description |
|---|---|
| ISR membership | Replicas within replica.lag.time.max.ms of the leader |
| min.insync.replicas | Minimum ISR size required for acks=all produces |
| ISR shrinkage | Slow replicas removed; affects write availability |
For detailed ISR mechanics, membership criteria, and configuration, see Replication: In-Sync Replicas.
Leader Election¶
When a partition leader fails, Kafka elects a new leader from the ISR. The controller (or KRaft quorum) coordinates this process.
| Election Type | Description |
|---|---|
| Clean election | New leader chosen from ISR; no data loss |
| Unclean election | Out-of-sync replica becomes leader; potential data loss |
| Preferred election | Leadership rebalanced to preferred replica |
For the complete leader election protocol, leader epochs, and configuration, see Replication: Leader Election.
High Watermark¶
The high watermark (HW) is the offset up to which all ISR replicas have replicated. Consumers can only read records up to the HW.
| Concept | Description |
|---|---|
| High Watermark (HW) | Last offset replicated to all ISR members |
| Log End Offset (LEO) | Next offset to be written (leader's latest) |
| Consumer visibility | Consumers read only up to HW |
Read-Your-Writes
A producer may not immediately read its own writes. The record becomes visible only after HW advances (all ISR have replicated).
For high watermark advancement mechanics and the replication protocol, see Replication: High Watermark.
Partition Assignment¶
When topics are created or partitions are added, the controller assigns partitions to brokers.
Assignment Goals¶
| Goal | Strategy |
|---|---|
| Balance | Distribute partitions evenly across brokers |
| Rack awareness | Place replicas in different racks |
| Minimize movement | Prefer keeping existing assignments |
Rack-Aware Assignment¶
Configuration¶
| Configuration | Default | Description |
|---|---|---|
broker.rack |
null | Rack identifier for this broker |
default.replication.factor |
1 | Default RF for auto-created topics |
num.partitions |
1 | Default partition count for auto-created topics |
Partition Count Selection¶
Choosing the right partition count requires understanding workload requirements and system constraints.
Calculating Required Partitions¶
For throughput requirements:
Partitions needed = Target throughput / Per-partition throughput
Per-partition throughput depends on the bottleneck:
| Component | Typical Per-Partition Limit | Determining Factors |
|---|---|---|
| Producer to single partition | 10-50 MB/s | Batch size, linger.ms, compression, network RTT |
| Consumer from single partition | 50-100+ MB/s | Consumer processing speed, fetch size, downstream latency |
| Broker disk I/O | Shared across partitions | Total broker throughput / partition count |
Example calculation:
- Target throughput: 500 MB/s production
- Single producer batch throughput to one partition: ~30 MB/s (with compression, acks=all)
- Minimum partitions for throughput: 500 / 30 ≈ 17 partitions
However, consumer parallelism may require more:
- Consumer processing rate: 25 MB/s per consumer
- Consumers needed: 500 / 25 = 20 consumers
- Partitions needed: at least 20 (one per consumer)
For ordering requirements:
Records with the same key are routed to the same partition. If strict ordering across a key space is required:
| Ordering Scope | Partition Strategy |
|---|---|
| Per-entity ordering (e.g., per user) | Use entity ID as key; Kafka guarantees order within partition |
| Global ordering (all records) | Single partition only; limits throughput to single-partition maximum |
| No ordering requirement | Partition for throughput; use round-robin or random keys |
Trade-offs of Partition Count¶
| Factor | More Partitions | Fewer Partitions |
|---|---|---|
| Maximum throughput | Higher (parallel I/O, more consumers) | Lower (single-broker ceiling) |
| Consumer parallelism | Higher (one consumer per partition max) | Lower |
| Ordering granularity | Finer (per-partition ordering) | Coarser (broader ordering scope) |
| End-to-end latency | Can increase (more batching coordination) | Can decrease (simpler path) |
| Broker memory | Higher (~1-2 MB per partition-replica) | Lower |
| File handles | Higher (3 file handles per segment per partition) | Lower |
| Controller overhead | Higher (more metadata, slower elections) | Lower |
| Rebalance time | Longer (more partition movements) | Shorter |
| Recovery time | Longer (more partitions to recover) | Shorter |
| Availability during failures | Higher (smaller blast radius per partition) | Lower (more data affected per partition) |
Guidelines by Workload¶
| Workload Characteristics | Suggested Approach |
|---|---|
| Low throughput (<10 MB/s), ordering important | 3-6 partitions; focus on key distribution |
| Medium throughput (10-100 MB/s) | 6-30 partitions; balance throughput and operational overhead |
| High throughput (100-500 MB/s) | 30-100 partitions; ensure sufficient consumers |
| Very high throughput (500+ MB/s) | 100+ partitions; may require multiple clusters for extreme scale |
| Strict global ordering | 1 partition; accept throughput ceiling |
| Unknown future growth | Start with more partitions (can't reduce); 12-30 is often reasonable |
Partition Count Cannot Be Decreased
Once a topic is created, the partition count can only be increased, not decreased. Increasing partitions also breaks key-based ordering guarantees for existing keys (keys may hash to different partitions).
Partition Count Upper Bounds
Kafka recommends limiting partitions per broker to approximately 4,000 and partitions per cluster to approximately 200,000 (as of Kafka 3.x). These limits relate to:
- Controller metadata management overhead
- ZooKeeper/KRaft watch overhead
- Leader election time during broker failures
- Memory consumption for indexes and buffers
See KIP-578 for partition limit configuration.
Topic and Partition Metadata¶
Kafka maintains metadata about topics and partitions in the controller. Clients fetch this metadata to discover partition leaders.
Metadata Contents¶
| Metadata | Description |
|---|---|
| Topic list | All topics in cluster |
| Partition count | Number of partitions per topic |
| Replica assignment | Which brokers host each partition |
| Leader | Current leader for each partition |
| ISR | In-sync replicas for each partition |
| Controller | Current controller broker |
Metadata Refresh¶
| Trigger | Description |
|---|---|
metadata.max.age.ms |
Periodic refresh (default: 5 minutes) |
NOT_LEADER_OR_FOLLOWER error |
Immediate refresh |
UNKNOWN_TOPIC_OR_PARTITION error |
Immediate refresh |
| New producer/consumer | Initial fetch |
Performance Benchmarks¶
Actual throughput varies significantly based on hardware, configuration, and workload. The following provides reference points from common deployment scenarios:
Single-Broker Throughput (Reference)¶
| Configuration | Production Throughput | Notes |
|---|---|---|
| HDD, 1 GbE, no TLS | 50-80 MB/s | Network often the bottleneck |
| SSD, 10 GbE, no TLS | 200-400 MB/s | Disk and CPU become factors |
| NVMe, 10 GbE, no TLS | 300-500 MB/s | CPU and network limits |
| NVMe, 25 GbE, no TLS | 500-800 MB/s | High-end configuration |
| Any config + TLS | 20-40% reduction | TLS overhead; varies with hardware AES support |
| Any config + compression | CPU-dependent | LZ4/Snappy: minor overhead; Zstd: higher CPU, better ratio |
These figures assume:
- Replication factor 3 with
acks=all - Reasonable batch sizes (16-64 KB)
- Modern server-class hardware
- No other significant workloads on the brokers
Factors That Reduce Throughput¶
| Factor | Impact | Mitigation |
|---|---|---|
| Small messages (<100 bytes) | Overhead dominates; lower MB/s | Batch more aggressively; consider message aggregation |
| High partition count | More memory, more file handles | Stay within broker limits; balance across cluster |
acks=all with slow followers |
Latency increases; throughput may drop | Ensure followers on fast storage; monitor ISR |
| TLS without AES-NI | Severe CPU bottleneck | Use hardware with AES-NI support |
| Page cache pressure | More disk reads | Add RAM; reduce partition count per broker |
| Cross-datacenter replication | RTT affects acks=all latency |
Use async replication (MirrorMaker 2) for cross-DC |
Version Compatibility¶
| Feature | Minimum Version |
|---|---|
| Rack-aware assignment | 0.10.0 |
| Leader epoch | 0.11.0 |
| Follower fetching (KIP-392) | 2.4.0 |
| Tiered storage | 3.6.0 (early access) |
| KRaft mode | 3.3.0 (production) |
Configuration Reference¶
Topic Configuration¶
| Configuration | Default | Description |
|---|---|---|
retention.ms |
604800000 (7d) | Time-based retention |
retention.bytes |
-1 (unlimited) | Size-based retention per partition |
segment.bytes |
1073741824 (1GB) | Log segment size |
segment.ms |
604800000 (7d) | Time before rolling segment |
cleanup.policy |
delete | delete, compact, or both |
min.insync.replicas |
1 | Minimum ISR for acks=all |
unclean.leader.election.enable |
false | Allow non-ISR leader election |
Broker Configuration¶
| Configuration | Default | Description |
|---|---|---|
num.partitions |
1 | Default partitions for new topics |
default.replication.factor |
1 | Default RF for new topics |
replica.lag.time.max.ms |
30000 | Max lag before ISR removal |
replica.fetch.max.bytes |
1048576 | Max bytes per replica fetch |
broker.rack |
null | Rack identifier |
Related Documentation¶
- Replication - ISR mechanics, leader election protocol, high watermark, and acknowledgment levels
- Storage Engine - Log segments and compaction
- Fault Tolerance - Failure scenarios
- Brokers - Broker architecture and configuration