AxonOps Kafka Overview Dashboard Metrics Mapping¶
This document maps the metrics used in the AxonOps Kafka Overview dashboard.
Dashboard Overview¶
The Kafka Overview dashboard provides a comprehensive view of Kafka cluster health, including controller status, partition health, replication status, network throughput, and consumer group coordination. It serves as the primary dashboard for monitoring overall Kafka cluster performance and health.
Metrics Mapping¶
Controller Metrics¶
| Dashboard Metric | Description | Attributes |
|---|---|---|
kaf_KafkaController_ActiveControllerCount |
Number of active controllers in the cluster (should be 1) | rack, host_id |
kaf_KafkaController_OfflinePartitionsCount |
Number of partitions without an active leader | dc, rack, host_id |
kaf_KafkaController_PreferredReplicaImbalanceCount |
Number of partitions where preferred replica is not the leader | dc, rack, host_id |
kaf_ControllerStats_UncleanLeaderElectionsPerSec |
Rate of unclean leader elections | function (MeanRate), rack, host_id |
Replica Manager Metrics¶
| Dashboard Metric | Description | Attributes |
|---|---|---|
kaf_ReplicaManager_UnderMinIsrPartitionCount |
Partitions with fewer than minimum in-sync replicas | dc, rack, host_id |
kaf_ReplicaManager_UnderReplicatedPartitions |
Number of under-replicated partitions | dc, rack, host_id |
kaf_ReplicaManager_PartitionCount |
Total number of partitions on the broker | dc, rack, host_id |
kaf_ReplicaManager_LeaderCount |
Number of partitions for which this broker is the leader | rack, host_id |
kaf_ReplicaManager_IsrShrinksPerSec |
Rate of ISR shrinks | function (MeanRate), rack, host_id |
kaf_ReplicaManager_IsrExpandsPerSec |
Rate of ISR expansions | function (MeanRate), rack, host_id |
Broker Topic Metrics¶
| Dashboard Metric | Description | Attributes |
|---|---|---|
kaf_BrokerTopicMetrics_BytesInPerSec |
Incoming byte rate | axonfunction (rate), rack, host_id, topic, node_type |
kaf_BrokerTopicMetrics_BytesOutPerSec |
Outgoing byte rate | axonfunction (rate), rack, host_id, topic, node_type |
kaf_BrokerTopicMetrics_MessagesInPerSec |
Incoming message rate | axonfunction (rate), rack, host_id, topic |
Network Metrics¶
| Dashboard Metric | Description | Attributes |
|---|---|---|
kaf_socket_server_metrics_ |
Socket server connection metrics | function (connection_count), rack, host_id |
kaf_KafkaRequestHandlerPool_RequestHandlerAvgIdlePercent |
Request handler idle percentage | function (OneMinuteRate), rack, host_id |
Group Coordinator Metrics¶
| Dashboard Metric | Description | Attributes |
|---|---|---|
kaf_GroupMetadataManager_NumGroups |
Total number of consumer groups | rack, host_id |
kaf_GroupMetadataManager_NumGroupsStable |
Number of stable consumer groups | rack, host_id |
kaf_GroupMetadataManager_NumGroupsPreparingRebalance |
Groups preparing to rebalance | rack, host_id |
kaf_GroupMetadataManager_NumGroupsDead |
Number of dead consumer groups | rack, host_id |
kaf_GroupMetadataManager_NumGroupsCompletingRebalance |
Groups completing rebalance | rack, host_id |
kaf_GroupMetadataManager_NumGroupsEmpty |
Number of empty consumer groups | rack, host_id |
Request Metrics¶
| Dashboard Metric | Description | Attributes |
|---|---|---|
kaf_RequestMetrics_RequestsPerSec |
Request rate per second | axonfunction (rate), function (Count), request, rack, host_id |
kaf_RequestMetrics_TotalTimeMs |
Total request processing time | request (Fetch), function (percentiles), rack, host_id |
Query Examples¶
Healthcheck Queries¶
// Active Controllers (should be 1)
sum(kaf_KafkaController_ActiveControllerCount{host_id!=""})
// Under min insync replicas partitions
kaf_ReplicaManager_UnderMinIsrPartitionCount{dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}
// Under Replicated Partitions
kaf_ReplicaManager_UnderReplicatedPartitions{dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}
// Offline Partitions
kaf_KafkaController_OfflinePartitionsCount{dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}
Network Throughput¶
// Cluster network throughput - Bytes in
sum(kaf_BrokerTopicMetrics_BytesInPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id', topic!='',node_type='$node_type'})
// Cluster network throughput - Bytes out
sum(kaf_BrokerTopicMetrics_BytesOutPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id', topic!='',node_type='$node_type'})
// Incoming Messages
sum(kaf_BrokerTopicMetrics_MessagesInPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id', topic=''})
Group Coordinator¶
// Consumer groups per coordinator
kaf_GroupMetadataManager_NumGroups{rack=~'$rack',host_id=~'$host_id'}
// Consumer groups by state
sum(kaf_GroupMetadataManager_NumGroupsStable{rack=~'$rack',host_id=~'$host_id'})
sum(kaf_GroupMetadataManager_NumGroupsPreparingRebalance{rack=~'$rack',host_id=~'$host_id'})
sum(kaf_GroupMetadataManager_NumGroupsDead{rack=~'$rack',host_id=~'$host_id'})
Request Rates¶
// Total Request Per Sec
sum(kaf_RequestMetrics_RequestsPerSec{axonfunction='rate',function='Count',rack=~'$rack',host_id=~'$host_id'}) by (host_id)
// Metadata Request Per Sec
sum(kaf_RequestMetrics_RequestsPerSec{axonfunction='rate',function='Count',request='Metadata',rack=~'$rack',host_id=~'$host_id'}) by (host_id)
Panel Organization¶
Healthcheck Section¶
-
Active Controllers - Counter showing cluster controller status
-
Brokers Online - Number of active brokers
-
Online Partitions - Total partition count
-
Offline Partitions - Partitions without leaders
-
Preferred Replica Imbalance - Leader distribution health
-
Under Replicated Partitions - Replication lag indicator
-
Connections - Total client connections
-
Under min insync replicas partitions - Critical replication status
-
Unclean Leader Election Rate - Data loss risk indicator
-
Cluster network throughput - Overall I/O performance
-
Incoming Messages - Message ingestion rate
-
Cluster Connections - Connection trend
General Section¶
-
Broker Count - Total brokers in cluster
-
Active Controller - Controller assignment over time
-
Request Handler Avg Idle Percent - Request handler capacity
-
Under Replicated Partitions - Replication health trends
-
Unclean Leader Elections Per Sec - Data integrity monitoring
-
In-sync replicas Shrinks vs Expands - ISR stability
Group Coordinator Section¶
-
Consumer groups number per coordinator - Group distribution
-
No consumer groups per state - Group lifecycle monitoring
Request Rate Section¶
-
Total Request Per Sec - Overall request load
-
Metadata Request Per Sec - Metadata request patterns
Filters¶
-
rack - Filter by rack location
-
node (
host_id) - Filter by specific Kafka broker -
topic - Filter by Kafka topic
-
node type - Filter by node type
-
percentile - Select latency percentile (for request metrics)
-
groupBy - Dynamic grouping (topic, host_id)
Understanding the Metrics¶
Critical Health Indicators¶
-
Active Controllers: Must be exactly 1. More or less indicates cluster issues
-
Offline Partitions: Should be 0. Any value > 0 means data unavailability
-
Under Replicated Partitions: Should be 0. Indicates replication lag
-
Under Min ISR: Critical - indicates potential data loss risk
Performance Indicators¶
-
Network Throughput: Monitor for capacity planning
-
Request Handler Idle %: Lower values indicate high load
-
ISR Shrinks/Expands: Frequent changes indicate instability
Consumer Group Health¶
Group States:
- Stable: Normal operating state
- Rebalancing: Temporary during membership changes
- Dead: Groups that need cleanup
- Empty: Groups without active members
Best Practices¶
Monitoring Guidelines¶
Set Alerts for:
- Active Controllers ≠ 1
- Offline Partitions > 0
- Under Replicated Partitions > 0
- Unclean Leader Elections > 0
Regular Checks:
- Network throughput trends
- Consumer group stability
- Request rate patterns
Troubleshooting¶
No Active Controller:
- Check ZooKeeper connectivity
- Review controller logs
- Verify network partitions
High Under-Replicated Partitions:
- Check broker health
- Verify network bandwidth
- Review replica lag settings
Consumer Group Issues:
- Monitor rebalance frequency
- Check consumer lag
- Verify coordinator load
Data Resolution¶
- Most metrics use
lowresolution for efficiency - Rate metrics use
axonfunction='rate'for accurate per-second calculations - Percentile metrics available for latency measurements
Units¶
-
Bytes: Network throughput (bytes/sec)
-
short: Counts and rates
-
percent: Utilization metrics (0-100)
-
rps: Requests per second
Notes¶
- Empty topic filter (
topic='') shows aggregate metrics host_id!=""ensures only active brokers are counted- The
node_typefilter allows monitoring mixed clusters - ISR metrics use
MeanRatefor smoothed values