Skip to content

AxonOps Kafka Overview Dashboard Metrics Mapping

This document maps the metrics used in the AxonOps Kafka Overview dashboard.

Dashboard Overview

The Kafka Overview dashboard provides a comprehensive view of Kafka cluster health, including controller status, partition health, replication status, network throughput, and consumer group coordination. It serves as the primary dashboard for monitoring overall Kafka cluster performance and health.

Metrics Mapping

Controller Metrics

Dashboard Metric Description Attributes
kaf_KafkaController_ActiveControllerCount Number of active controllers in the cluster (should be 1) rack, host_id
kaf_KafkaController_OfflinePartitionsCount Number of partitions without an active leader dc, rack, host_id
kaf_KafkaController_PreferredReplicaImbalanceCount Number of partitions where preferred replica is not the leader dc, rack, host_id
kaf_ControllerStats_UncleanLeaderElectionsPerSec Rate of unclean leader elections function (MeanRate), rack, host_id

Replica Manager Metrics

Dashboard Metric Description Attributes
kaf_ReplicaManager_UnderMinIsrPartitionCount Partitions with fewer than minimum in-sync replicas dc, rack, host_id
kaf_ReplicaManager_UnderReplicatedPartitions Number of under-replicated partitions dc, rack, host_id
kaf_ReplicaManager_PartitionCount Total number of partitions on the broker dc, rack, host_id
kaf_ReplicaManager_LeaderCount Number of partitions for which this broker is the leader rack, host_id
kaf_ReplicaManager_IsrShrinksPerSec Rate of ISR shrinks function (MeanRate), rack, host_id
kaf_ReplicaManager_IsrExpandsPerSec Rate of ISR expansions function (MeanRate), rack, host_id

Broker Topic Metrics

Dashboard Metric Description Attributes
kaf_BrokerTopicMetrics_BytesInPerSec Incoming byte rate axonfunction (rate), rack, host_id, topic, node_type
kaf_BrokerTopicMetrics_BytesOutPerSec Outgoing byte rate axonfunction (rate), rack, host_id, topic, node_type
kaf_BrokerTopicMetrics_MessagesInPerSec Incoming message rate axonfunction (rate), rack, host_id, topic

Network Metrics

Dashboard Metric Description Attributes
kaf_socket_server_metrics_ Socket server connection metrics function (connection_count), rack, host_id
kaf_KafkaRequestHandlerPool_RequestHandlerAvgIdlePercent Request handler idle percentage function (OneMinuteRate), rack, host_id

Group Coordinator Metrics

Dashboard Metric Description Attributes
kaf_GroupMetadataManager_NumGroups Total number of consumer groups rack, host_id
kaf_GroupMetadataManager_NumGroupsStable Number of stable consumer groups rack, host_id
kaf_GroupMetadataManager_NumGroupsPreparingRebalance Groups preparing to rebalance rack, host_id
kaf_GroupMetadataManager_NumGroupsDead Number of dead consumer groups rack, host_id
kaf_GroupMetadataManager_NumGroupsCompletingRebalance Groups completing rebalance rack, host_id
kaf_GroupMetadataManager_NumGroupsEmpty Number of empty consumer groups rack, host_id

Request Metrics

Dashboard Metric Description Attributes
kaf_RequestMetrics_RequestsPerSec Request rate per second axonfunction (rate), function (Count), request, rack, host_id
kaf_RequestMetrics_TotalTimeMs Total request processing time request (Fetch), function (percentiles), rack, host_id

Query Examples

Healthcheck Queries

// Active Controllers (should be 1)
sum(kaf_KafkaController_ActiveControllerCount{host_id!=""})

// Under min insync replicas partitions
kaf_ReplicaManager_UnderMinIsrPartitionCount{dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}

// Under Replicated Partitions
kaf_ReplicaManager_UnderReplicatedPartitions{dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}

// Offline Partitions
kaf_KafkaController_OfflinePartitionsCount{dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}

Network Throughput

// Cluster network throughput - Bytes in
sum(kaf_BrokerTopicMetrics_BytesInPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id', topic!='',node_type='$node_type'})

// Cluster network throughput - Bytes out
sum(kaf_BrokerTopicMetrics_BytesOutPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id', topic!='',node_type='$node_type'})

// Incoming Messages
sum(kaf_BrokerTopicMetrics_MessagesInPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id', topic=''})

Group Coordinator

// Consumer groups per coordinator
kaf_GroupMetadataManager_NumGroups{rack=~'$rack',host_id=~'$host_id'}

// Consumer groups by state
sum(kaf_GroupMetadataManager_NumGroupsStable{rack=~'$rack',host_id=~'$host_id'})
sum(kaf_GroupMetadataManager_NumGroupsPreparingRebalance{rack=~'$rack',host_id=~'$host_id'})
sum(kaf_GroupMetadataManager_NumGroupsDead{rack=~'$rack',host_id=~'$host_id'})

Request Rates

// Total Request Per Sec
sum(kaf_RequestMetrics_RequestsPerSec{axonfunction='rate',function='Count',rack=~'$rack',host_id=~'$host_id'}) by (host_id)

// Metadata Request Per Sec
sum(kaf_RequestMetrics_RequestsPerSec{axonfunction='rate',function='Count',request='Metadata',rack=~'$rack',host_id=~'$host_id'}) by (host_id)

Panel Organization

Healthcheck Section

  • Active Controllers - Counter showing cluster controller status

  • Brokers Online - Number of active brokers

  • Online Partitions - Total partition count

  • Offline Partitions - Partitions without leaders

  • Preferred Replica Imbalance - Leader distribution health

  • Under Replicated Partitions - Replication lag indicator

  • Connections - Total client connections

  • Under min insync replicas partitions - Critical replication status

  • Unclean Leader Election Rate - Data loss risk indicator

  • Cluster network throughput - Overall I/O performance

  • Incoming Messages - Message ingestion rate

  • Cluster Connections - Connection trend

General Section

  • Broker Count - Total brokers in cluster

  • Active Controller - Controller assignment over time

  • Request Handler Avg Idle Percent - Request handler capacity

  • Under Replicated Partitions - Replication health trends

  • Unclean Leader Elections Per Sec - Data integrity monitoring

  • In-sync replicas Shrinks vs Expands - ISR stability

Group Coordinator Section

  • Consumer groups number per coordinator - Group distribution

  • No consumer groups per state - Group lifecycle monitoring

Request Rate Section

  • Total Request Per Sec - Overall request load

  • Metadata Request Per Sec - Metadata request patterns

Filters

  • rack - Filter by rack location

  • node (host_id) - Filter by specific Kafka broker

  • topic - Filter by Kafka topic

  • node type - Filter by node type

  • percentile - Select latency percentile (for request metrics)

  • groupBy - Dynamic grouping (topic, host_id)

Understanding the Metrics

Critical Health Indicators

  • Active Controllers: Must be exactly 1. More or less indicates cluster issues

  • Offline Partitions: Should be 0. Any value > 0 means data unavailability

  • Under Replicated Partitions: Should be 0. Indicates replication lag

  • Under Min ISR: Critical - indicates potential data loss risk

Performance Indicators

  • Network Throughput: Monitor for capacity planning

  • Request Handler Idle %: Lower values indicate high load

  • ISR Shrinks/Expands: Frequent changes indicate instability

Consumer Group Health

Group States:

  • Stable: Normal operating state
  • Rebalancing: Temporary during membership changes
  • Dead: Groups that need cleanup
  • Empty: Groups without active members

Best Practices

Monitoring Guidelines

Set Alerts for:

  • Active Controllers ≠ 1
  • Offline Partitions > 0
  • Under Replicated Partitions > 0
  • Unclean Leader Elections > 0

Regular Checks:

  • Network throughput trends
  • Consumer group stability
  • Request rate patterns

Troubleshooting

No Active Controller:

  • Check ZooKeeper connectivity
  • Review controller logs
  • Verify network partitions

High Under-Replicated Partitions:

  • Check broker health
  • Verify network bandwidth
  • Review replica lag settings

Consumer Group Issues:

  • Monitor rebalance frequency
  • Check consumer lag
  • Verify coordinator load

Data Resolution

  • Most metrics use low resolution for efficiency
  • Rate metrics use axonfunction='rate' for accurate per-second calculations
  • Percentile metrics available for latency measurements

Units

  • Bytes: Network throughput (bytes/sec)

  • short: Counts and rates

  • percent: Utilization metrics (0-100)

  • rps: Requests per second

Notes

  • Empty topic filter (topic='') shows aggregate metrics
  • host_id!="" ensures only active brokers are counted
  • The node_type filter allows monitoring mixed clusters
  • ISR metrics use MeanRate for smoothed values