Skip to content

AxonOps Kafka Consumer Groups Dashboard Metrics Mapping

Overview

The Kafka Consumer Groups Dashboard provides monitoring of consumer group lag across topics and partitions. This dashboard is essential for tracking consumer performance, identifying consumption bottlenecks, and ensuring consumers are keeping up with producers.

Metrics Mapping

Dashboard Metric Description Attributes
Consumer Group Metrics
kaf_consumer_group Consumer group lag per partition client-id={client-id}

Note: Consumer group metrics are typically collected from Kafka's consumer group command-line tools or APIs rather than JMX, as they represent cluster-wide state rather than individual broker metrics.

Query Examples

Consumer Group Lag

// Consumer group lag by group, topic, and partition
sum(kaf_consumer_group{Topic='$topic',GroupID='$groupid'}) by (GroupID, Topic, Partition)

// Total lag for a consumer group across all partitions
sum(kaf_consumer_group{GroupID='$groupid'}) by (GroupID)

// Lag for specific topic
sum(kaf_consumer_group{Topic='$topic'}) by (GroupID, Partition)

Panel Organization

Overview Section

  • Empty row for spacing/organization

Consumer Groups

  • Consumer Group Lag (detailed view by GroupID, Topic, and Partition)

Filters

  • groupid: Filter by specific consumer group ID(s)

  • topic: Filter by specific topic(s)

Best Practices

Lag Monitoring

  • Monitor lag trends over time, not just absolute values
  • Set alerts for increasing lag trends
  • Consider normal lag during consumer restarts

Performance Analysis

  • High lag indicates consumers can't keep up with producers
  • Compare lag across partitions to identify imbalances
  • Monitor lag spikes during peak traffic

Consumer Group Health

  • Zero lag doesn't always mean healthy consumption
  • Check for stalled consumers (lag not changing)
  • Monitor consumer group state (active, rebalancing, dead)

Troubleshooting High Lag

  • Check consumer processing time
  • Verify consumer parallelism matches partition count
  • Look for rebalancing issues
  • Check for consumer errors or failures

Capacity Planning

  • Use lag trends for scaling decisions
  • Add consumers when lag consistently increases
  • Monitor lag during traffic peaks

Partition Assignment

  • Ensure even distribution of partitions to consumers
  • Monitor for partition ownership changes
  • Check for idle consumers (no partitions assigned)

Alert Configuration

  • Alert on lag threshold (e.g., > 100k messages)
  • Alert on lag growth rate
  • Alert on consumer group state changes
  • Different thresholds for different topics/groups