Kafka Cluster Scaling¶

This document covers the architecture and mechanisms for scaling Apache Kafka clusters. Kafka supports horizontal scaling through broker addition, partition expansion, and dynamic rebalancing.

Scaling Overview¶

Scalability Dimensions¶

Dimension	Scaling Mechanism	Considerations
Storage	Add brokers, expand disks	Requires partition reassignment
Produce throughput	Add partitions	May affect ordering guarantees
Consume throughput	Add partitions + consumers	Max consumers = partitions
Message rate	Add brokers + partitions	Rebalancing overhead

Horizontal Scaling Model¶

Broker Scaling¶

Scaling Characteristics¶

Operation	Automatic	Data Movement	Impact
Add broker	❌	Required	None until reassignment
Remove broker	❌	Required	Must drain first
Add partitions	✅	None	Immediate
Add consumers	✅	None	Triggers rebalance

Broker Addition

Adding a broker does not automatically rebalance partitions. Explicit partition reassignment is required to distribute load to the new broker.

Partition Scaling¶

Adding Partitions¶

Partition Scaling Constraints¶

Constraint	Description
Increase only	Partition count cannot be decreased
Key distribution	Adding partitions changes key routing
Empty partitions	New partitions start with no data
Consumer scaling	More partitions enable more consumers

Key Distribution Impact

Adding partitions changes the key-to-partition mapping. Messages with the same key may route to different partitions after expansion. This affects ordering guarantees for keyed messages.

Partition Count Formula¶

optimal_partitions = max(
    target_throughput / per_partition_throughput,
    max_expected_consumers
)

Factor	Guideline
Per-partition throughput	~10 MB/s typical
Consumer parallelism	1 partition per consumer thread
Broker capacity	~4000 partitions per broker (depending on hardware)
Cluster limit	Monitor total partition count

Broker Lifecycle¶

Adding a Broker¶

Broker Addition Process¶

Deploy new broker - Configure with unique broker.id
Start broker - Broker registers with controller
Verify registration - Confirm broker appears in cluster
Generate reassignment plan - Compute partition moves
Execute reassignment - Move partitions with throttling
Verify balance - Confirm even distribution

# Step 1: Start new broker (broker.id=4)
kafka-server-start.sh config/server.properties

# Step 2: Verify registration
kafka-broker-api-versions.sh --bootstrap-server kafka:9092 | grep "id:4"

# Step 3: Generate reassignment plan
cat > topics.json << 'EOF'
{"topics": [{"topic": "orders"}, {"topic": "events"}], "version": 1}
EOF

kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
  --topics-to-move-json-file topics.json \
  --broker-list "1,2,3,4" \
  --generate > reassignment.json

# Step 4: Execute with throttle
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
  --reassignment-json-file reassignment.json \
  --throttle 50000000 \
  --execute

# Step 5: Monitor progress
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
  --reassignment-json-file reassignment.json \
  --verify

Removing a Broker¶

Broker Removal Process¶

# Step 1: Generate plan moving all partitions off broker 4
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
  --topics-to-move-json-file all-topics.json \
  --broker-list "1,2,3" \
  --generate > removal-plan.json

# Step 2: Execute reassignment
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
  --reassignment-json-file removal-plan.json \
  --throttle 100000000 \
  --execute

# Step 3: Wait for completion
kafka-reassign-partitions.sh --bootstrap-server kafka:9092 \
  --reassignment-json-file removal-plan.json \
  --verify

# Step 4: Verify no partitions on broker 4
kafka-topics.sh --bootstrap-server kafka:9092 --describe | grep "Broker: 4"

# Step 5: Stop broker
kafka-server-stop.sh

Data Loss Risk

Stopping a broker before completing partition reassignment results in data loss if the broker holds the only replica of any partition.

Scaling Patterns¶

Capacity Planning¶

Metrics for Scaling Decisions¶

Metric	Threshold	Action
Disk utilization	> 70%	Add brokers or storage
CPU utilization	> 70% sustained	Add brokers
Network throughput	> 70% capacity	Add brokers
Partition count per broker	> 4000	Add brokers
Consumer lag	Increasing	Add partitions + consumers
Request latency (p99)	> SLA	Scale horizontally

Scaling Strategy Decision Tree¶

Rack-Aware Scaling¶

Rack Distribution¶

Rack-Aware Configuration¶

# Broker configuration
broker.rack=rack-a

# Topic creation with rack awareness
kafka-topics.sh --bootstrap-server kafka:9092 \
  --create \
  --topic orders \
  --partitions 6 \
  --replication-factor 3
# Automatically distributes across racks

Scaling with Rack Awareness¶

Consideration	Recommendation
New broker placement	Add to least-populated rack
Minimum racks	RF racks for full fault tolerance
Balanced racks	Equal brokers per rack
Reassignment	Maintain rack diversity

Scaling Automation¶

Auto-Scaling Considerations¶

Aspect	Challenge
Partition reassignment	Requires data movement
Consumer rebalancing	Temporary pause
State migration	Kafka Streams applications
Connection overhead	Client reconnection

Kubernetes Scaling¶

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: kafka-cluster
spec:
  kafka:
    replicas: 3  # Scale by changing this
    storage:
      type: persistent-claim
      size: 100Gi
    config:
      auto.create.topics.enable: false
      default.replication.factor: 3
      min.insync.replicas: 2

Kubernetes Scaling

Kubernetes-based Kafka operators (Strimzi, Confluent) automate broker addition but typically require manual partition reassignment for optimal balance.

Scaling Best Practices¶

Pre-Scaling Checklist¶

Check	Purpose
Current partition distribution	Identify imbalance
Under-replicated partitions	Ensure cluster healthy
Active reassignments	Wait for completion
Consumer lag	Note current state
Disk space on all brokers	Ensure room for data movement

During Scaling¶

Practice	Rationale
Use throttling	Prevent saturation during reassignment
Scale incrementally	Add one broker at a time
Monitor continuously	Detect issues early
Test rollback	Ensure recovery path exists

Post-Scaling Verification¶

# Verify partition balance
kafka-topics.sh --bootstrap-server kafka:9092 --describe | \
  awk '/Leader:/ {leaders[$NF]++} END {for (b in leaders) print b": "leaders[b]}'

# Verify no under-replicated partitions
kafka-topics.sh --bootstrap-server kafka:9092 \
  --describe --under-replicated-partitions

# Verify consumer group health
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --all-groups

Partition Reassignment - Detailed reassignment procedures
Consumer Rebalancing - Consumer scaling
Cluster Management - Cluster operations
Performance Internals - Performance tuning