Skip to content

AxonOps Kafka Performance Dashboard Metrics Mapping

Overview

The Kafka Performance Dashboard provides detailed insights into Kafka broker performance, including request processing times, throughput metrics, thread utilization, and queue sizes. This dashboard is essential for identifying performance bottlenecks and optimizing Kafka cluster performance.

Metrics Mapping

Dashboard Metric Description Attributes
Request Timing Metrics
kaf_RequestMetrics_TotalTimeMs Total time to process requests (produce/fetch) request={Produce,Fetch,FetchFollower}
kaf_RequestMetrics_RequestQueueTimeMs Time requests spend in request queue request={Fetch,FetchFollower}
Throughput Metrics
kaf_BrokerTopicMetrics_MessagesInPerSec Rate of messages received per second topic={topic}
kaf_BrokerTopicMetrics_BytesInPerSec Rate of bytes received per second topic={topic}
kaf_BrokerTopicMetrics_BytesOutPerSec Rate of bytes sent per second topic={topic}
Queue Metrics
kaf_RequestChannel_RequestQueueSize Current size of request queue -
kaf_RequestChannel_ResponseQueueSize Current size of response queue processor={id}
Thread Utilization Metrics
kaf_SocketServer_NetworkProcessorAvgIdlePercent Average idle percentage of network threads -
kaf_KafkaRequestHandlerPool_RequestHandlerAvgIdlePercent Average idle percentage of request handler threads -
Purgatory Metrics
kaf_DelayedOperationPurgatory_PurgatorySize Number of delayed operations in purgatory delayedOperation={Produce,Fetch}

Query Examples

Request Processing Time

// Total time for produce requests (selected percentile)
kaf_RequestMetrics_TotalTimeMs{request='Produce',function='$percentile',rack=~'$rack',host_id=~'$host_id'}

// Total time for fetch requests
kaf_RequestMetrics_TotalTimeMs{request='Fetch',function='$percentile',rack=~'$rack',host_id=~'$host_id'}

// Total time for follower fetch requests
kaf_RequestMetrics_TotalTimeMs{request='FetchFollower',function='$percentile',rack=~'$rack',host_id=~'$host_id'}

Message Throughput

// Messages per second per broker
sum(kaf_BrokerTopicMetrics_MessagesInPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id',topic=~'$topic'}) by (host_id)

// Bytes in per second per broker
sum(kaf_BrokerTopicMetrics_BytesInPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id',topic=~'$topic'}) by (host_id)

// Bytes out per second per broker
sum(kaf_BrokerTopicMetrics_BytesOutPerSec{axonfunction='rate',rack=~'$rack',host_id=~'$host_id',topic=~'$topic'}) by (host_id)

Thread Utilization

// Network processor idle percentage
kaf_SocketServer_NetworkProcessorAvgIdlePercent{rack=~'$rack',host_id=~'$host_id'} * 100

// Request handler idle percentage
kaf_KafkaRequestHandlerPool_RequestHandlerAvgIdlePercent{function='OneMinuteRate',rack=~'$rack',host_id=~'$host_id'} * 100

Queue Sizes

// Request queue size
kaf_RequestChannel_RequestQueueSize{rack=~'$rack',host_id=~'$host_id'}

// Response queue size
kaf_RequestChannel_ResponseQueueSize{processor='', rack=~'$rack',host_id=~'$host_id'}

Request Queue Time

// Fetch request queue time
kaf_RequestMetrics_RequestQueueTimeMs{request='Fetch',function=~'$percentile',rack=~'$rack',host_id=~'$host_id'}

// Follower fetch request queue time
kaf_RequestMetrics_RequestQueueTimeMs{request='FetchFollower',function=~'$percentile',rack=~'$rack',host_id=~'$host_id'}

Purgatory Sizes

// Producer purgatory size
kaf_DelayedOperationPurgatory_PurgatorySize{delayedOperation='Produce',rack=~'$rack',host_id=~'$host_id'}

// Fetch purgatory size
kaf_DelayedOperationPurgatory_PurgatorySize{delayedOperation='Fetch', rack=~'$rack',host_id=~'$host_id'}

Panel Organization

Overview Section

  • Empty row for spacing/organization

Throughput

  • Total time (produce/fetch) by percentile
  • Messages In Per Broker
  • Bytes In Per Broker
  • Bytes Out Per Broker

Purgatory

  • Producer Purgatory Size
  • Fetch Purgatory Size

Request Queue

  • Request Queue Fetch Follower Requests Time
  • Request Queue Fetch Requests Time

Thread Utilization

  • Request Queue Size
  • Response Queue Size
  • Network Processor Avg Idle Percent
  • Request Handler Avg Idle Percent

Filters

  • rack: Filter by rack location

  • host_id: Filter by specific host/broker

  • percentile: Select percentile for latency metrics (50th, 95th, 99th, etc.)

  • topic: Filter metrics by specific topics

Best Practices

Request Latency Monitoring

  • Monitor 99th percentile latencies to catch outliers
  • High total time indicates performance issues
  • Compare produce vs fetch latencies

Throughput Monitoring

  • Balance bytes in/out across brokers
  • Monitor message rates for capacity planning
  • Identify hot partitions or uneven load distribution

Queue Monitoring

  • Request queue size should remain low
  • High queue sizes indicate thread pool saturation
  • Monitor queue time to identify bottlenecks

Thread Utilization

  • Network processor idle % should be > 30%
  • Request handler idle % should be > 30%
  • Low idle percentages indicate need for more threads

Purgatory Monitoring

  • High purgatory sizes indicate delayed operations
  • Producer purgatory: waiting for replication
  • Fetch purgatory: waiting for data availability

Performance Tuning

  • Adjust thread pools based on utilization
  • Optimize batch sizes for better throughput
  • Monitor and tune request timeouts