Cassandra Key Metrics Reference¶

This guide covers the essential metrics to monitor for a healthy Cassandra cluster.

Metrics Overview¶

The Four Golden Signals¶

Signal	Cassandra Metrics	Why It Matters
Latency	Read/Write p99	User experience
Traffic	Requests/second	Capacity planning
Errors	Timeouts, Unavailables	Service reliability
Saturation	CPU, Disk, Memory	Resource headroom

Metric Categories¶

Category	Description
Client Request Metrics	Read/Write latency, throughput, errors
Thread Pool Metrics	Active, Pending, Blocked, Completed
Storage Metrics	SSTable count, Disk usage, Compaction
JVM Metrics	Heap usage, GC pauses, Off-heap
System Metrics	CPU, Memory, Disk I/O, Network

Critical Metrics¶

1. Read/Write Latency¶

JMX Path:

org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency
org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency

nodetool:

nodetool proxyhistograms

Thresholds: | Percentile | Read | Write | |------------|------|-------| | p50 | < 5ms | < 2ms | | p99 | < 50ms | < 20ms | | p999 | < 200ms | < 100ms |

Recommended Alerts:

Alert	Condition	Duration	Severity
High Read Latency	p99 > 100ms	5 min	Warning
Critical Read Latency	p99 > 500ms	5 min	Critical
High Write Latency	p99 > 50ms	5 min	Warning
Critical Write Latency	p99 > 200ms	5 min	Critical

2. Request Throughput¶

JMX Path:

org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency/Count
org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency/Count

What to look for: - Sudden drops (node issues, network problems) - Unexpected spikes (traffic surge, retry storms) - Uneven distribution across nodes (hot spots)

3. Dropped Messages¶

JMX Path:

org.apache.cassandra.metrics:type=DroppedMessage,scope=READ,name=Dropped
org.apache.cassandra.metrics:type=DroppedMessage,scope=MUTATION,name=Dropped
org.apache.cassandra.metrics:type=DroppedMessage,scope=RANGE_SLICE,name=Dropped

nodetool:

nodetool tpstats | grep -E "Message|Dropped"

Thresholds: | Metric | Warning | Critical | |--------|---------|----------| | Any dropped | > 0 | > 100/min |

What drops mean: - Messages exceeded timeout while queued - System overloaded or GC paused too long - Need capacity increase or query optimization

4. Pending Tasks¶

JMX Path:

org.apache.cassandra.metrics:type=ThreadPools,path=request,scope=*,name=PendingTasks

nodetool:

nodetool tpstats

Key thread pools: | Pool | Warning | Critical | |------|---------|----------| | MutationStage | > 15 | > 50 | | ReadStage | > 15 | > 50 | | CompactionExecutor | > 32 | > 64 | | MemtableFlushWriter | > 4 | > 8 |

Storage Metrics¶

5. SSTable Count¶

JMX Path:

org.apache.cassandra.metrics:type=Table,name=LiveSSTableCount

nodetool:

nodetool tablestats my_keyspace.my_table | grep "SSTable count"

Thresholds: | Level | Count | Action | |-------|-------|--------| | Normal | < 20 | None | | Warning | 20-50 | Check compaction | | Critical | > 50 | Investigate |

High SSTable count indicates: - Compaction falling behind - Write-heavy workload - Inappropriate compaction strategy

6. Pending Compactions¶

JMX Path:

org.apache.cassandra.metrics:type=Compaction,name=PendingTasks

nodetool:

nodetool compactionstats

Thresholds: | Level | Pending | Action | |-------|---------|--------| | Normal | < 10 | None | | Warning | 10-50 | Monitor | | Critical | > 50 | Increase throughput |

7. Disk Usage¶

nodetool:

nodetool status  # Shows Load per node
df -h /var/lib/cassandra

Thresholds: | Level | Usage | Action | |-------|-------|--------| | Normal | < 50% | None | | Warning | 50-70% | Plan expansion | | Critical | > 70% | Urgent expansion |

Important: Leave 50% free for compaction operations.

JVM Metrics¶

8. Heap Usage¶

JMX Path:

java.lang:type=Memory/HeapMemoryUsage

nodetool:

nodetool info | grep "Heap Memory"

Thresholds: | Level | Usage | Action | |-------|-------|--------| | Normal | < 60% | None | | Warning | 60-80% | Monitor GC | | Critical | > 80% | Risk of OOM |

9. GC Pause Time¶

JMX Path:

java.lang:type=GarbageCollector,name=G1 Young Generation/CollectionTime
java.lang:type=GarbageCollector,name=G1 Old Generation/CollectionTime

Log analysis:

grep -E "GC pause" /var/log/cassandra/gc.log | tail -20

Thresholds: | Level | Pause | Frequency | |-------|-------|-----------| | Normal | < 200ms | Occasional | | Warning | 200-500ms | Frequent | | Critical | > 500ms | Any |

Error Metrics¶

10. Timeouts¶

JMX Path:

org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Timeouts
org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Timeouts

Recommended Alerts:

Alert	Condition	Severity
Read Timeouts	Any timeouts occurring	Warning
Write Timeouts	Any timeouts occurring	Warning
Sustained Timeouts	> 10 timeouts/min for 5 min	Critical

11. Unavailables¶

JMX Path:

org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Unavailables
org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Unavailables

Recommended Alerts:

Alert	Condition	Severity
Read Unavailables	Any unavailable errors	Critical
Write Unavailables	Any unavailable errors	Critical

Unavailables Indicate Serious Issues

Unavailable errors mean insufficient replicas are accessible to satisfy the consistency level. This typically indicates multiple node failures or network partitions requiring immediate investigation.

12. Exceptions¶

JMX Path:

org.apache.cassandra.metrics:type=Storage,name=Exceptions

nodetool:

nodetool info | grep "Exceptions"

Per-Table Metrics¶

Read/Write Latency per Table¶

JMX Path:

org.apache.cassandra.metrics:type=Table,keyspace=my_ks,scope=my_table,name=ReadLatency
org.apache.cassandra.metrics:type=Table,keyspace=my_ks,scope=my_table,name=WriteLatency

nodetool:

nodetool tablehistograms my_keyspace my_table

Partition Size¶

nodetool:

nodetool tablehistograms my_keyspace my_table | grep "Partition Size"

Thresholds: | Percentile | Warning | Critical | |------------|---------|----------| | p99 | > 50MB | > 100MB |

Tombstone Metrics¶

JMX Path:

org.apache.cassandra.metrics:type=Table,name=TombstoneScannedHistogram

nodetool:

nodetool tablestats my_keyspace | grep -i tombstone

Monitoring Quick Reference¶

nodetool Commands¶

# Overall health
nodetool status
nodetool info
nodetool describecluster

# Performance
nodetool tpstats
nodetool proxyhistograms
nodetool tablestats my_keyspace

# Storage
nodetool compactionstats
nodetool tablehistograms my_keyspace my_table

# Operations
nodetool netstats
nodetool gossipinfo

AxonOps Dashboard¶

AxonOps provides purpose-built dashboards for Cassandra monitoring, displaying all key metrics in a unified interface without requiring manual dashboard configuration.

Pre-built Cassandra dashboards include:

Cluster Overview — Node status, schema agreement, cluster health at a glance
Latency & Throughput — Read/write latency percentiles, request rates, error rates
Resource Utilization — Heap usage, CPU, disk I/O, network across all nodes
Compaction & Storage — Pending compactions, SSTable counts, disk usage trends
Per-Table Metrics — Table-level latency, partition sizes, tombstone counts
Thread Pool Status — Pending tasks, blocked threads, dropped messages

Key advantages over manual monitoring:

Aspect	Manual (nodetool/JMX)	AxonOps
Setup time	Hours to days	Minutes
Historical data	Not retained	Full retention
Cross-node correlation	Manual comparison	Automatic
Alerting	Separate configuration	Integrated
Query analysis	Not available	Slow query detection

See AxonOps Monitoring for dashboard features and configuration.

Alert Configuration¶

Critical Alerts (Page Immediately)¶

Alert	Condition	Duration	Response
Node Down	Node unreachable	1 min	Check process, network, hardware
Heap Critical	Heap usage > 85%	5 min	Investigate memory pressure, potential OOM
Disk Critical	Disk usage > 80%	5 min	Clear snapshots, add capacity
Dropped Messages	Any messages dropped	1 min	Check thread pools, timeouts, capacity
Unavailable Errors	Any unavailables	Immediate	Check replica availability

Warning Alerts (Review Soon)¶

Alert	Condition	Duration	Response
High Latency	p99 read > 100ms	10 min	Check compaction, GC, disk I/O
Compaction Backlog	Pending > 30	15 min	Check throughput, consider tuning
Heap Warning	Heap usage > 70%	10 min	Monitor trend, prepare mitigation
Hints Growing	Hints > 1000	10 min	Check target node health
Schema Disagreement	Multiple versions	5 min	Check for stuck migrations

AxonOps Alert Configuration¶

AxonOps provides pre-configured alerts for all critical Cassandra metrics. Alerts can be customized through the AxonOps dashboard:

Threshold adjustment — Modify alert thresholds based on workload characteristics
Notification routing — Route alerts to Slack, PagerDuty, email, or webhooks
Alert suppression — Configure maintenance windows to suppress expected alerts
Escalation policies — Define escalation paths for unacknowledged alerts

See Setup Alert Rules for detailed configuration instructions.

Next Steps¶

AxonOps Monitoring — Purpose-built Cassandra dashboards and alerting
Alerting Guide — Configure alert thresholds and notifications
JMX Reference — Complete JMX metrics reference
Troubleshooting — Diagnose and resolve issues