AxonOps Kafka System Dashboard Metrics Mapping¶

Overview¶

The Kafka System Dashboard provides comprehensive monitoring of system-level resources across your Kafka cluster. It tracks CPU utilization, disk I/O, memory usage, JVM performance, and network statistics to ensure optimal cluster health and performance.

Metrics Mapping¶

Dashboard Metric	Description	Attributes
CPU Metrics
`host_CPU_Percent_Merge`	Overall CPU utilization percentage	time='real'
`host_CPU` (mode='iowait')	CPU time waiting for I/O operations	mode='iowait'
`host_load15`	15-minute load average	-
Disk Metrics
`host_Disk_UsedPercent`	Percentage of disk space used	-
`host_Disk_Used`	Absolute disk space used in bytes	-
`host_Disk_SectorsRead`	Disk read throughput in bytes/sec	-
`host_Disk_SectorsWrite`	Disk write throughput in bytes/sec	-
`host_Disk_IOCount`	Input/output operations per second	-
`host_Disk_avgqsz`	Average disk queue size	-
`host_Disk_WeightedIO`	Weighted I/O time in milliseconds	-
`host_Disk_IoTime`	Time spent on disk I/O operations	-
`host_filefd_allocated`	Number of allocated file descriptors	-
`host_filefd_max`	Maximum available file descriptors	-
Memory Metrics
`host_Memory_Used`	Used system memory in bytes	-
`host_Memory_Cached`	Cached memory in bytes	-
`host_Memory_UsedPercent`	Memory usage percentage	-
JVM Metrics
`jvm_Threading_`	JVM thread count	type=Threading
`jvm_GarbageCollector_G1_Young_Generation`	G1 GC statistics	name=G1 Young Generation
`jvm_GarbageCollector_ZGC`	ZGC statistics	name=ZGC
`jvm_GarbageCollector_Shenandoah_Cycles`	Shenandoah GC statistics	name=Shenandoah
`jvm_GarbageCollector_ConcurrentMarkSweep`	CMS GC statistics	name=ConcurrentMarkSweep
`jvm_GarbageCollector_ParNew`	ParNew GC statistics	name=ParNew
`jvm_Memory_`	JVM heap and non-heap memory usage	type=Memory
Network Metrics
`host_netIOCounters_BytesRecv`	Network bytes received per second	-
`host_netIOCounters_BytesSent`	Network bytes sent per second	-
`host_ntp_offset_seconds`	NTP time offset in seconds	-

Query Examples¶

CPU Usage by Rack¶

avg(host_CPU_Percent_Merge{time='real',rack=~'$rack',host_id=~'$host_id', node_type='$node_type', type='kafka'}) by (rack)

I/O Wait Percentage¶

avg(host_CPU{axonfunction='rate',mode='iowait',rack=~'$rack',host_id=~'$host_id',node_type='$node_type', type='kafka'}) by (host_id) * 100

Disk I/O Throughput¶

// Read throughput
host_Disk_SectorsRead{axonfunction='rate',rack=~'$rack',host_id=~'$host_id',partition=~'$partition',node_type='$node_type', type='kafka'}

// Write throughput
host_Disk_SectorsWrite{axonfunction='rate',rack=~'$rack',host_id=~'$host_id',partition=~'$partition', node_type='$node_type', type='kafka'}

File Descriptor Usage¶

(host_filefd_allocated{rack=~'$rack',host_id=~'$host_id'} / host_filefd_max{rack=~'$rack',host_id=~'$host_id', node_type='$node_type', type='kafka'})*100

JVM Heap Memory Usage¶

jvm_Memory_{function='used',scope='HeapMemoryUsage',rack=~'$rack',host_id=~'$host_id', node_type='$node_type', type='kafka'}

GC Rate¶

jvm_GarbageCollector_G1_Young_Generation{axonfunction='rate',function='CollectionCount',rack=~'$rack',host_id=~'$host_id', node_type='$node_type', type='kafka'}

Panel Organization¶

Overview Section

Average CPU Usage per Rack

CPU and Load

CPU usage per host
Load Average (15m)
Avg IO wait CPU per Host

Disk Statistics

Disk % Usage by mount point
Used Disk Space Per Node
Bytes Read/Write Per Second
IOPS
Disk avgqsz
Disk WeightedIO time
% File Descriptors Allocated
Time Spent on Disk IO

Memory Statistics

Used memory
Cached memory
Used Memory Percentage
JVM Thread Count
GC Count per sec
GC Duration
JVM Utilization (Heap/Non-Heap)
JVM Heap Utilization

Network Statistics

Network Received (bytes)
Network Transmitted (bytes)
NTP offset (milliseconds)

Filters¶

node_type: Filter by node type (broker, controller, etc.)
rack: Filter by rack location
host_id: Filter by specific host/node
mountpoint: Filter by disk mount point
partition: Filter by disk partition
Interface: Filter by network interface

Best Practices¶

CPU Monitoring

Monitor CPU usage to ensure it stays below 80% for production workloads
High I/O wait indicates disk bottlenecks
Monitor load average relative to CPU core count

Disk Monitoring

Keep disk usage below 85% to prevent performance degradation
Monitor IOPS and queue size for disk saturation
Track weighted I/O time for disk latency issues

Memory Monitoring

Monitor both system and JVM memory usage
Ensure adequate memory for page cache (cached memory)
Track JVM heap usage to prevent OutOfMemory errors

GC Monitoring

Monitor GC frequency and duration
Excessive GC activity indicates memory pressure
Different GC algorithms (G1, ZGC, Shenandoah) have different characteristics

Network Monitoring

Track network throughput for replication and client traffic
Monitor NTP offset to ensure time synchronization
High network usage may indicate replication storms

File Descriptors

Monitor file descriptor usage to prevent "too many open files" errors
Kafka requires many file descriptors for log segments and network connections