AxonOps Kafka System Dashboard Metrics Mapping¶
Overview¶
The Kafka System Dashboard provides comprehensive monitoring of system-level resources across your Kafka cluster. It tracks CPU utilization, disk I/O, memory usage, JVM performance, and network statistics to ensure optimal cluster health and performance.
Metrics Mapping¶
| Dashboard Metric | Description | Attributes |
|---|---|---|
| CPU Metrics | ||
host_CPU_Percent_Merge |
Overall CPU utilization percentage | time='real' |
host_CPU (mode='iowait') |
CPU time waiting for I/O operations | mode='iowait' |
host_load15 |
15-minute load average | - |
| Disk Metrics | ||
host_Disk_UsedPercent |
Percentage of disk space used | - |
host_Disk_Used |
Absolute disk space used in bytes | - |
host_Disk_SectorsRead |
Disk read throughput in bytes/sec | - |
host_Disk_SectorsWrite |
Disk write throughput in bytes/sec | - |
host_Disk_IOCount |
Input/output operations per second | - |
host_Disk_avgqsz |
Average disk queue size | - |
host_Disk_WeightedIO |
Weighted I/O time in milliseconds | - |
host_Disk_IoTime |
Time spent on disk I/O operations | - |
host_filefd_allocated |
Number of allocated file descriptors | - |
host_filefd_max |
Maximum available file descriptors | - |
| Memory Metrics | ||
host_Memory_Used |
Used system memory in bytes | - |
host_Memory_Cached |
Cached memory in bytes | - |
host_Memory_UsedPercent |
Memory usage percentage | - |
| JVM Metrics | ||
jvm_Threading_ |
JVM thread count | type=Threading |
jvm_GarbageCollector_G1_Young_Generation |
G1 GC statistics | name=G1 Young Generation |
jvm_GarbageCollector_ZGC |
ZGC statistics | name=ZGC |
jvm_GarbageCollector_Shenandoah_Cycles |
Shenandoah GC statistics | name=Shenandoah |
jvm_GarbageCollector_ConcurrentMarkSweep |
CMS GC statistics | name=ConcurrentMarkSweep |
jvm_GarbageCollector_ParNew |
ParNew GC statistics | name=ParNew |
jvm_Memory_ |
JVM heap and non-heap memory usage | type=Memory |
| Network Metrics | ||
host_netIOCounters_BytesRecv |
Network bytes received per second | - |
host_netIOCounters_BytesSent |
Network bytes sent per second | - |
host_ntp_offset_seconds |
NTP time offset in seconds | - |
Query Examples¶
CPU Usage by Rack¶
avg(host_CPU_Percent_Merge{time='real',rack=~'$rack',host_id=~'$host_id', node_type='$node_type', type='kafka'}) by (rack)
I/O Wait Percentage¶
avg(host_CPU{axonfunction='rate',mode='iowait',rack=~'$rack',host_id=~'$host_id',node_type='$node_type', type='kafka'}) by (host_id) * 100
Disk I/O Throughput¶
// Read throughput
host_Disk_SectorsRead{axonfunction='rate',rack=~'$rack',host_id=~'$host_id',partition=~'$partition',node_type='$node_type', type='kafka'}
// Write throughput
host_Disk_SectorsWrite{axonfunction='rate',rack=~'$rack',host_id=~'$host_id',partition=~'$partition', node_type='$node_type', type='kafka'}
File Descriptor Usage¶
(host_filefd_allocated{rack=~'$rack',host_id=~'$host_id'} / host_filefd_max{rack=~'$rack',host_id=~'$host_id', node_type='$node_type', type='kafka'})*100
JVM Heap Memory Usage¶
jvm_Memory_{function='used',scope='HeapMemoryUsage',rack=~'$rack',host_id=~'$host_id', node_type='$node_type', type='kafka'}
GC Rate¶
jvm_GarbageCollector_G1_Young_Generation{axonfunction='rate',function='CollectionCount',rack=~'$rack',host_id=~'$host_id', node_type='$node_type', type='kafka'}
Panel Organization¶
Overview Section
- Average CPU Usage per Rack
CPU and Load
- CPU usage per host
- Load Average (15m)
- Avg IO wait CPU per Host
Disk Statistics
- Disk % Usage by mount point
- Used Disk Space Per Node
- Bytes Read/Write Per Second
- IOPS
- Disk avgqsz
- Disk WeightedIO time
- % File Descriptors Allocated
- Time Spent on Disk IO
Memory Statistics
- Used memory
- Cached memory
- Used Memory Percentage
- JVM Thread Count
- GC Count per sec
- GC Duration
- JVM Utilization (Heap/Non-Heap)
- JVM Heap Utilization
Network Statistics
- Network Received (bytes)
- Network Transmitted (bytes)
- NTP offset (milliseconds)
Filters¶
-
node_type: Filter by node type (broker, controller, etc.)
-
rack: Filter by rack location
-
host_id: Filter by specific host/node
-
mountpoint: Filter by disk mount point
-
partition: Filter by disk partition
-
Interface: Filter by network interface
Best Practices¶
CPU Monitoring
- Monitor CPU usage to ensure it stays below 80% for production workloads
- High I/O wait indicates disk bottlenecks
- Monitor load average relative to CPU core count
Disk Monitoring
- Keep disk usage below 85% to prevent performance degradation
- Monitor IOPS and queue size for disk saturation
- Track weighted I/O time for disk latency issues
Memory Monitoring
- Monitor both system and JVM memory usage
- Ensure adequate memory for page cache (cached memory)
- Track JVM heap usage to prevent OutOfMemory errors
GC Monitoring
- Monitor GC frequency and duration
- Excessive GC activity indicates memory pressure
- Different GC algorithms (G1, ZGC, Shenandoah) have different characteristics
Network Monitoring
- Track network throughput for replication and client traffic
- Monitor NTP offset to ensure time synchronization
- High network usage may indicate replication storms
File Descriptors
- Monitor file descriptor usage to prevent "too many open files" errors
- Kafka requires many file descriptors for log segments and network connections