Skip to content

Kafka Performance Internals

Deep dive into Kafka's performance architecture and optimization techniques.


Performance Design Principles

uml diagram


Sequential I/O

Write Path

Kafka appends all writes to the end of log segments, achieving sequential disk access.

uml diagram

Performance Comparison

Access Pattern HDD Performance SSD Performance
Random 4KB ~100 IOPS ~100K IOPS
Sequential ~100 MB/s ~500 MB/s
Kafka advantage 1000x better 5x better

Zero-Copy Transfers

Traditional Copy Path

uml diagram

Zero-Copy Path (sendfile)

uml diagram

Implementation

// Kafka uses FileChannel.transferTo()
// which maps to sendfile() system call
fileChannel.transferTo(position, count, socketChannel);

Limitations

Condition Zero-Copy Available
Plaintext Yes
TLS/SSL enabled No (encryption requires user-space)
Compression Only for already-compressed data

TLS Impact

Enabling TLS disables zero-copy, potentially reducing throughput by 30-50%.


Batching

Producer Batching

uml diagram

Batching Configuration

Parameter Default Effect
batch.size 16384 Maximum bytes per batch
linger.ms 0 Time to wait for more records
buffer.memory 33554432 Total memory for batching

Batching Benefits

Without batching (1000 messages):
  1000 network round-trips
  1000 small disk writes
  High overhead per message

With batching (1000 messages in 10 batches):
  10 network round-trips
  10 larger disk writes
  Amortized overhead

Consumer Fetch Batching

# Minimum data to fetch
fetch.min.bytes=1

# Maximum time to wait
fetch.max.wait.ms=500

# Maximum data per request
fetch.max.bytes=52428800

Compression

Compression Algorithms

Algorithm Compression Ratio CPU Usage Speed
none 1.0x None Fastest
gzip ~5-8x High Slow
snappy ~2-3x Low Fast
lz4 ~3-4x Low Fastest
zstd ~4-6x Medium Fast

Compression Flow

uml diagram

Configuration

# Producer compression
compression.type=lz4

# Topic-level compression (broker will recompress if different)
compression.type=producer  # Keep producer compression
compression.type=gzip      # Force specific compression

Compression Selection Guide

Use Case Recommended
High throughput, low latency lz4
Balanced zstd
Maximum compression gzip
Minimal CPU snappy
Already compressed data none

Request Pipelining

In-Flight Requests

uml diagram

# Maximum in-flight requests
max.in.flight.requests.per.connection=5

Ordering Considerations

Setting Ordering Throughput
max.in.flight=1 Guaranteed Lower
max.in.flight=5 May reorder on retry Higher
max.in.flight=5 + idempotent Guaranteed Higher

Thread Model

Broker Threads

uml diagram

Thread Configuration

Parameter Default Recommendation
num.network.threads 3 CPU cores / 4
num.io.threads 8 CPU cores
num.replica.fetchers 1 2-4 for high partition counts
num.recovery.threads.per.data.dir 1 2-4 for faster recovery

Page Cache Optimization

Warm Cache Benefits

uml diagram

Optimal Memory Allocation

Total RAM: 64 GB

JVM Heap: 6 GB (enough for metadata)
OS/System: 2 GB
Page Cache: 56 GB (for log data)

If hourly throughput = 50 GB
Page cache covers ~1 hour of data

Benchmarking

Producer Performance Test

# Throughput test
kafka-producer-perf-test.sh \
  --topic test-topic \
  --num-records 10000000 \
  --record-size 1024 \
  --throughput -1 \
  --producer-props \
    bootstrap.servers=kafka:9092 \
    batch.size=65536 \
    linger.ms=10 \
    compression.type=lz4

Consumer Performance Test

# Throughput test
kafka-consumer-perf-test.sh \
  --bootstrap-server kafka:9092 \
  --topic test-topic \
  --messages 10000000 \
  --threads 4

End-to-End Latency Test

kafka-run-class.sh kafka.tools.EndToEndLatency \
  kafka:9092 \
  test-topic \
  10000 \
  all \
  1024

Performance Metrics

Key Metrics

Metric Healthy Range
RequestsPerSec Depends on workload
TotalTimeMs (P99) < 100 ms
RequestQueueTimeMs < 10 ms
LocalTimeMs < 50 ms
RemoteTimeMs < 50 ms
ThrottleTimeMs 0

Bottleneck Identification

Symptom Likely Bottleneck
High RequestQueueTimeMs Network threads saturated
High LocalTimeMs Disk I/O or I/O threads saturated
High RemoteTimeMs Replication lag
High ResponseQueueTimeMs Network threads saturated