AxonOps Kafka Performance Dashboard Metrics Mapping¶
Overview¶
The Kafka Performance Dashboard provides detailed insights into Kafka broker performance, including request processing times, throughput metrics, thread utilization, and queue sizes. This dashboard is essential for identifying performance bottlenecks and optimizing Kafka cluster performance.
Metrics Mapping¶
| Dashboard Metric | Description | Attributes |
|---|---|---|
| Request Timing Metrics | ||
kaf_RequestMetrics_TotalTimeMs |
Total time to process requests (produce/fetch) | request={Produce,Fetch,FetchFollower} |
kaf_RequestMetrics_RequestQueueTimeMs |
Time requests spend in request queue | request={Fetch,FetchFollower} |
| Throughput Metrics | ||
kaf_BrokerTopicMetrics_MessagesInPerSec |
Rate of messages received per second | topic={topic} |
kaf_BrokerTopicMetrics_BytesInPerSec |
Rate of bytes received per second | topic={topic} |
kaf_BrokerTopicMetrics_BytesOutPerSec |
Rate of bytes sent per second | topic={topic} |
| Queue Metrics | ||
kaf_RequestChannel_RequestQueueSize |
Current size of request queue | - |
kaf_RequestChannel_ResponseQueueSize |
Current size of response queue | processor={id} |
| Thread Utilization Metrics | ||
kaf_SocketServer_NetworkProcessorAvgIdlePercent |
Average idle percentage of network threads | - |
kaf_KafkaRequestHandlerPool_RequestHandlerAvgIdlePercent |
Average idle percentage of request handler threads | - |
| Purgatory Metrics | ||
kaf_DelayedOperationPurgatory_PurgatorySize |
Number of delayed operations in purgatory | delayedOperation={Produce,Fetch} |
Query Examples¶
Request Processing Time¶
// Total time for produce requests (selected percentile)
kaf_RequestMetrics_TotalTimeMs{request='Produce',function='$percentile',rack=~'$rack',host_id=~'$host_id'}
// Total time for fetch requests
kaf_RequestMetrics_TotalTimeMs{request='Fetch',function='$percentile',rack=~'$rack',host_id=~'$host_id'}
// Total time for follower fetch requests
kaf_RequestMetrics_TotalTimeMs{request='FetchFollower',function='$percentile',rack=~'$rack',host_id=~'$host_id'}
Message Throughput¶
// Messages per second per broker
sum(kaf_BrokerTopicMetrics_MessagesInPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id',topic=~'$topic'}) by (host_id)
// Bytes in per second per broker
sum(kaf_BrokerTopicMetrics_BytesInPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id',topic=~'$topic'}) by (host_id)
// Bytes out per second per broker
sum(kaf_BrokerTopicMetrics_BytesOutPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id',topic=~'$topic'}) by (host_id)
Thread Utilization¶
// Network processor idle percentage
kaf_SocketServer_NetworkProcessorAvgIdlePercent{rack=~'$rack',host_id=~'$host_id'} * 100
// Request handler idle percentage
kaf_KafkaRequestHandlerPool_RequestHandlerAvgIdlePercent{function='OneMinuteRate',rack=~'$rack',host_id=~'$host_id'} * 100
Queue Sizes¶
// Request queue size
kaf_RequestChannel_RequestQueueSize{rack=~'$rack',host_id=~'$host_id'}
// Response queue size
kaf_RequestChannel_ResponseQueueSize{processor='', rack=~'$rack',host_id=~'$host_id'}
Request Queue Time¶
// Fetch request queue time
kaf_RequestMetrics_RequestQueueTimeMs{request='Fetch',function=~'$percentile',rack=~'$rack',host_id=~'$host_id'}
// Follower fetch request queue time
kaf_RequestMetrics_RequestQueueTimeMs{request='FetchFollower',function=~'$percentile',rack=~'$rack',host_id=~'$host_id'}
Purgatory Sizes¶
// Producer purgatory size
kaf_DelayedOperationPurgatory_PurgatorySize{delayedOperation='Produce',rack=~'$rack',host_id=~'$host_id'}
// Fetch purgatory size
kaf_DelayedOperationPurgatory_PurgatorySize{delayedOperation='Fetch', rack=~'$rack',host_id=~'$host_id'}
Panel Organization¶
Overview Section
- Empty row for spacing/organization
Throughput
- Total time (produce/fetch) by percentile
- Messages In Per Broker
- Bytes In Per Broker
- Bytes Out Per Broker
Purgatory
- Producer Purgatory Size
- Fetch Purgatory Size
Request Queue
- Request Queue Fetch Follower Requests Time
- Request Queue Fetch Requests Time
Thread Utilization
- Request Queue Size
- Response Queue Size
- Network Processor Avg Idle Percent
- Request Handler Avg Idle Percent
Filters¶
-
rack: Filter by rack location
-
host_id: Filter by specific host/broker
-
percentile: Select percentile for latency metrics (50th, 95th, 99th, etc.)
-
topic: Filter metrics by specific topics
Best Practices¶
Request Latency Monitoring
- Monitor 99th percentile latencies to catch outliers
- High total time indicates performance issues
- Compare produce vs fetch latencies
Throughput Monitoring
- Balance bytes in/out across brokers
- Monitor message rates for capacity planning
- Identify hot partitions or uneven load distribution
Queue Monitoring
- Request queue size should remain low
- High queue sizes indicate thread pool saturation
- Monitor queue time to identify bottlenecks
Thread Utilization
- Network processor idle % should be > 30%
- Request handler idle % should be > 30%
- Low idle percentages indicate need for more threads
Purgatory Monitoring
- High purgatory sizes indicate delayed operations
- Producer purgatory: waiting for replication
- Fetch purgatory: waiting for data availability
Performance Tuning
- Adjust thread pools based on utilization
- Optimize batch sizes for better throughput
- Monitor and tune request timeouts