Skip to content

Kafka Consumer Group Rebalancing

Consumer group rebalancing redistributes partition assignments among group members. This document covers the rebalance protocol, assignment strategies, and mechanisms to minimize rebalance impact.


Rebalancing Overview

Why Rebalancing Occurs

Trigger Description
Consumer join New consumer joins group
Consumer leave Consumer leaves gracefully
Consumer failure Heartbeat timeout
Subscription change Topic subscription modified
Partition change Partitions added to subscribed topic
Session timeout Consumer missed heartbeat deadline
Max poll exceeded Processing time exceeded max.poll.interval.ms

Rebalance Impact

uml diagram

Impact Description
Processing pause All consumers stop processing during rebalance
Increased latency Messages delayed during rebalance
Duplicate processing At-least-once semantics may cause reprocessing
State loss In-memory state may need rebuilding

Rebalance Protocol

Eager Rebalancing (Traditional)

uml diagram

Cooperative Rebalancing (Incremental)

uml diagram

Protocol Comparison

Aspect Eager Cooperative
Revocation All partitions revoked Only moved partitions revoked
Processing Full stop during rebalance Continues for stable partitions
Rebalance count Single rebalance May require multiple rounds
Complexity Simple More complex
Kafka version All versions 2.4+

Rebalance Triggers

Heartbeat Mechanism

uml diagram

Timeout Configuration

Configuration Default Description
session.timeout.ms 45000 Time to detect consumer failure
heartbeat.interval.ms 3000 Heartbeat frequency
max.poll.interval.ms 300000 Max time between poll() calls

Relationship:

session.timeout.ms > heartbeat.interval.ms * 3

Typical: heartbeat = session_timeout / 3

Timeout Scenarios

Scenario Trigger Result
Network partition No heartbeat Consumer removed after session.timeout
Long processing poll() delayed Consumer removed after max.poll.interval
Consumer crash No heartbeat Consumer removed after session.timeout
Graceful shutdown LeaveGroup Immediate rebalance

Assignment Strategies

Range Assignor

Assigns partitions in ranges per topic.

uml diagram

RoundRobin Assignor

Distributes partitions round-robin across consumers.

uml diagram

Sticky Assignor

Preserves existing assignments while balancing.

Characteristic Description
Stickiness Minimizes partition movement
Balance Ensures even distribution
Cooperative Supports incremental rebalancing
# Configuration
partition.assignment.strategy=org.apache.kafka.clients.consumer.StickyAssignor

CooperativeSticky Assignor

Combines sticky assignment with cooperative rebalancing.

# Recommended for Kafka 2.4+
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor

Strategy Comparison

Strategy Balance Stickiness Cooperative Use Case
Range Good Simple, few topics
RoundRobin Best Many topics
Sticky Good Stateful processing
CooperativeSticky Good Production default

Static Membership

Overview

Static membership assigns a persistent identity to consumers, reducing unnecessary rebalances.

uml diagram

Configuration

# Consumer configuration
group.instance.id=consumer-instance-1
session.timeout.ms=300000  # 5 minutes (can be longer with static membership)

Static vs Dynamic Membership

Aspect Dynamic Static
Consumer restart Triggers rebalance No rebalance (within timeout)
Session timeout Short (45s typical) Can be longer (5-30 min)
Member ID Assigned by coordinator Derived from instance ID
Deployment Rolling restarts cause churn Smooth rolling restarts

Static Membership Benefits

Benefit Description
Reduced rebalances Transient failures don't trigger rebalance
Faster recovery Same assignment on rejoin
Rolling deployments One consumer at a time without full rebalance
Stable state Kafka Streams state stores remain local

Rebalance Optimization

Minimize Rebalance Frequency

# Increase session timeout for static membership
session.timeout.ms=300000

# Use cooperative rebalancing
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor

# Static membership
group.instance.id=my-consumer-instance-1

Minimize Rebalance Duration

# Faster heartbeat
heartbeat.interval.ms=1000

# Shorter JoinGroup timeout
max.poll.interval.ms=30000  # Reduce if processing is fast

# Pre-warm consumer before starting poll
# (Initialize resources before subscribing)

Handle Long Processing

// Pattern: Pause partitions during long processing
consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
    // Pause to prevent rebalance
    consumer.pause(consumer.assignment());

    // Long processing
    processRecord(record);

    // Resume
    consumer.resume(consumer.assignment());
}

Rebalance Listener

ConsumerRebalanceListener

consumer.subscribe(Arrays.asList("orders"), new ConsumerRebalanceListener() {
    @Override
    public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
        // Called BEFORE rebalance
        // Commit offsets for revoked partitions
        Map<TopicPartition, OffsetAndMetadata> offsets = new HashMap<>();
        for (TopicPartition partition : partitions) {
            offsets.put(partition, new OffsetAndMetadata(currentOffset(partition)));
        }
        consumer.commitSync(offsets);

        // Flush any buffered data
        flushState(partitions);
    }

    @Override
    public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
        // Called AFTER rebalance
        // Initialize state for new partitions
        for (TopicPartition partition : partitions) {
            initializeState(partition);
        }
    }

    @Override
    public void onPartitionsLost(Collection<TopicPartition> partitions) {
        // Called when partitions lost without revoke (cooperative)
        // State may already be invalid
        log.warn("Partitions lost: {}", partitions);
    }
});

Cooperative Rebalance Listener

// With cooperative rebalancing
@Override
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
    // Only revoked partitions, not all assigned
    // Other partitions continue processing
    commitOffsetsForPartitions(partitions);
}

@Override
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
    // Only newly assigned partitions
    // Existing partitions unchanged
    initializeStateForPartitions(partitions);
}

Monitoring Rebalances

Key Metrics

Metric Description Alert
rebalance-latency-avg Average rebalance duration > 30s
rebalance-latency-max Maximum rebalance duration > 60s
rebalance-total Total rebalance count Increasing rapidly
last-rebalance-seconds-ago Time since last rebalance -
failed-rebalance-total Failed rebalances > 0

Diagnosing Rebalance Issues

# Check consumer group state
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group my-group

# Monitor group membership changes
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group my-group --members

# Check for rebalance in logs
grep -i "rebalance\|revoke\|assign" /var/log/kafka/consumer.log

Common Issues

Issue Symptom Solution
Frequent rebalances High rebalance-total Enable static membership
Long rebalances High rebalance-latency Reduce group size, use cooperative
Consumer timeouts Members leaving Increase max.poll.interval.ms
Uneven assignment Imbalanced lag Check assignment strategy

Scaling Consumers

Adding Consumers

uml diagram

Scaling Limits

Constraint Description
Max consumers = partitions Extra consumers idle
Adding partitions Changes key distribution
Consumer group size Larger groups = longer rebalances

Scaling Best Practices

Practice Rationale
Plan partition count for growth Avoid adding partitions later
Use static membership Smooth scaling operations
Scale gradually One consumer at a time
Monitor during scale Detect issues early

Consumer Group States

State Machine

uml diagram

State Descriptions

State Description
Empty No active members
PreparingRebalance Waiting for members to join
CompletingRebalance Waiting for SyncGroup
Stable Normal operation
Dead Group being deleted

Version Compatibility

Feature Kafka Version
Basic rebalancing 0.9.0+
Sticky assignor 0.11.0+
Static membership 2.3.0+
Cooperative rebalancing 2.4.0+
Incremental cooperative 2.5.0+