AxonOps Kafka ZooKeeper Dashboard Metrics Mapping¶
Overview¶
The Kafka ZooKeeper Dashboard monitors the health and performance of ZooKeeper ensemble used by Kafka for cluster coordination (in non-KRaft mode). It tracks connections, request latency, node statistics, and session management to ensure ZooKeeper is functioning properly.
Metrics Mapping¶
| Dashboard Metric | Description | Attributes |
|---|---|---|
| ZooKeeper Health Metrics | ||
zk_NumAliveConnections |
Number of active client connections | port={port} |
zk_NodeCount |
Total number of znodes | port={port} |
zk_WatchCount |
Total number of watches | port={port} |
zk_OutstandingRequests |
Number of queued requests | port={port} |
| Request Latency Metrics | ||
zk_MinRequestLatency |
Minimum request latency | port={port} |
zk_AvgRequestLatency |
Average request latency | port={port} |
zk_MaxRequestLatency |
Maximum request latency | port={port} |
| Packet Metrics | ||
zk_PacketsSent |
Number of packets sent | port={port} |
zk_PacketsReceived |
Number of packets received | port={port} |
| Kafka-Reported ZooKeeper Metrics | ||
kaf_ZooKeeperClientMetrics_ZooKeeperRequestLatencyMs |
ZooKeeper request latency from Kafka perspective | - |
kaf_SessionExpireListener_ZooKeeperExpiresPerSec |
Rate of ZooKeeper session expirations | - |
kaf_SessionExpireListener_ZooKeeperAuthFailuresPerSec |
Rate of ZooKeeper authentication failures | - |
kaf_SessionExpireListener_ZooKeeperSyncConnectsPerSec |
Rate of ZooKeeper connections | - |
kaf_SessionExpireListener_ZooKeeperDisconnectsPerSec |
Rate of ZooKeeper disconnections | - |
Query Examples¶
Health Check Metrics¶
// Alive connections
zk_NumAliveConnections{rack='$rack',host_id=~'$host_id'}
// Total znode count
sum(zk_NodeCount{host_id=~'$host_id',type='kafka',node_type='zookeeper'})
// Total watch count
sum(zk_WatchCount{host_id=~'$host_id',type='kafka',node_type='zookeeper'})
// Outstanding requests
sum(zk_OutstandingRequests{host_id=~'$host_id',type='kafka',node_type='zookeeper'})
Request Latency¶
// Minimum request latency
zk_MinRequestLatency{host_id=~'$host_id',type='kafka',node_type='zookeeper'}
// Average request latency
zk_AvgRequestLatency{host_id=~'$host_id',node_type='zookeeper',type='kafka'}
// Maximum request latency
zk_MaxRequestLatency{host_id=~'$host_id',node_type='zookeeper',type='kafka'}
// Kafka-reported ZooKeeper latency
kaf_ZooKeeperClientMetrics_ZooKeeperRequestLatencyMs{rack=~'$rack',host_id=~'$host_id'}
Traffic Metrics¶
// Packets sent rate
sum(zk_PacketsSent{host_id=~'$host_id', axonfunction='rate', type='kafka',node_type='zookeeper'})
// Packets received rate
sum(zk_PacketsReceived{host_id=~'$host_id', axonfunction='rate', type='kafka',node_type='zookeeper'})
// Znode creation rate
avg(zk_NodeCount{host_id=~'$host_id', axonfunction='rate',type='kafka',node_type='zookeeper'})
Connection Management¶
// Session expiration rate
kaf_SessionExpireListener_ZooKeeperExpiresPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id'}
// Authentication failure rate
kaf_SessionExpireListener_ZooKeeperAuthFailuresPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id'}
// Connection rate
kaf_SessionExpireListener_ZooKeeperSyncConnectsPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id'}
// Disconnection rate
kaf_SessionExpireListener_ZooKeeperDisconnectsPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id'}
Panel Organization¶
Overview Section
- Empty row for spacing/organization
Health Check
- Alive Connections
- Outstanding Requests
- Number of Watchers
- Number of ZNodes
Request Latency
- Packets (sent/received rates)
- Znode Creation Rate
- Request Latency - Minimum
- Request Latency - Average
- Request Latency - Maximum
- Kafka Reported Request Latency
Connections
- Zookeeper expired connections per sec
- Zookeeper auth failures per sec
- Zookeeper disconnect per sec
- Zookeeper connections per sec
Filters¶
-
host_id: Filter by specific ZooKeeper node
-
rack: Filter by rack location
Best Practices¶
Health Monitoring
- Monitor alive connections for capacity planning
- Outstanding requests should remain low
- High watch count may impact performance
- Monitor znode count growth
Latency Analysis
- Average latency should be below tickTime
- High max latency indicates potential issues
- Compare ZK-reported vs Kafka-reported latency
Connection Management
- Monitor session expirations for client issues
- Auth failures indicate security problems
- High disconnect rate suggests network issues
Performance Tuning
- Adjust tickTime based on latency requirements
- Monitor packet rates for network saturation
- Balance connections across ensemble members
Troubleshooting
- High outstanding requests: Check ZK performance
- Session expirations: Review session timeout settings
- Auth failures: Check SASL/ACL configurations
Capacity Planning
- Monitor znode growth rate
- Track connection count trends
- Plan for watch count scaling
ZooKeeper Ensemble Health
- Ensure all ensemble members are responsive
- Monitor for leader elections
- Check fsync latency on ZK data directory