Monitoring Operations¶
Effective Cassandra operations require continuous monitoring of cluster health, performance metrics, and resource utilization. This guide covers what to monitor, how to interpret metrics, and how to respond to alerts.
Proactive vs Reactive Operations
The goal of monitoring is to detect and resolve issues before they impact users. Establish baselines during normal operation, set alerts on deviations, and investigate anomalies promptly.
Monitoring Architecture¶
Data Collection Layers¶
Metric Sources¶
| Source | Type | Access Method |
|---|---|---|
| JMX MBeans | Performance metrics | JMX client, exporters |
| nodetool | Operational commands | CLI |
| System tables | Internal state | CQL queries |
| OS metrics | Resource utilization | Node exporter |
| Logs | Events, errors | Log aggregation |
Critical Metrics¶
Cluster Health Metrics¶
Must-monitor metrics for cluster stability:
| Metric | JMX Path | Healthy Range | Alert Threshold |
|---|---|---|---|
| Live nodes | StorageService.LiveNodes | All nodes | Any node down |
| Unreachable nodes | StorageService.UnreachableNodes | Empty | Any node unreachable |
| Schema versions | StorageService.SchemaVersion | Single version | Multiple versions >5 min |
| Pending compactions | Compaction.PendingTasks | <50 | >100 sustained |
| Dropped messages | DroppedMessage.Dropped | 0 | Any drops |
Read Performance¶
| Metric | Description | Healthy Range | Alert |
|---|---|---|---|
| Read latency (P99) | 99th percentile read time | <50ms | >100ms |
| Read timeouts | Timed out read requests | 0 | >0 |
| Key cache hit rate | Cache efficiency | >80% | <50% |
| Row cache hit rate | Row cache efficiency | >90% (if enabled) | <70% |
| Tombstone scans | Tombstones per read | <1000 | >5000 |
Write Performance¶
| Metric | Description | Healthy Range | Alert |
|---|---|---|---|
| Write latency (P99) | 99th percentile write time | <20ms | >50ms |
| Write timeouts | Timed out write requests | 0 | >0 |
| Memtable size | Memory used by memtables | <heap/3 | >heap/2 |
| Commit log size | Pending commit log | <1GB | >2GB |
| Hints stored | Pending hints | 0 | >1000 |
Resource Utilization¶
| Metric | Source | Healthy Range | Alert |
|---|---|---|---|
| Heap usage | JMX | <70% | >85% |
| GC pause time | JMX | <500ms | >1s |
| GC frequency | JMX | <5/min | >10/min |
| Disk usage | OS | <70% | >80% |
| Disk I/O wait | OS | <20% | >40% |
| CPU usage | OS | <70% | >85% |
| Network throughput | OS | Within capacity | Near saturation |
nodetool Monitoring Commands¶
Quick Health Check¶
#!/bin/bash
# daily-health-check.sh
echo "=== Cluster Status ==="
nodetool status
echo -e "\n=== Schema Agreement ==="
nodetool describecluster | grep -A 5 "Schema versions"
echo -e "\n=== Pending Compactions ==="
nodetool compactionstats | head -20
echo -e "\n=== Thread Pool Status ==="
nodetool tpstats | grep -v "^$"
echo -e "\n=== Dropped Messages ==="
nodetool tpstats | grep -i dropped
Detailed Performance Analysis¶
# Table statistics for specific keyspace
nodetool tablestats <keyspace>
# Per-table read/write latencies
nodetool tablestats <keyspace>.<table> | grep -E "latency|Bloom"
# Compaction throughput
nodetool compactionstats
# GC statistics
nodetool gcstats
# Streaming status
nodetool netstats
# Client connections
nodetool clientstats
Ring and Token Information¶
# Token distribution
nodetool ring
# Endpoints for a key
nodetool getendpoints <keyspace> <table> <key>
# Ownership percentages
nodetool status | awk '{print $1, $2, $6}'
JMX Metrics Reference¶
Key MBean Paths¶
Cluster metrics:
org.apache.cassandra.metrics:type=Storage,name=Load
org.apache.cassandra.metrics:type=Storage,name=Exceptions
org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency
org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency
Table metrics:
org.apache.cassandra.metrics:type=Table,keyspace=<ks>,scope=<table>,name=ReadLatency
org.apache.cassandra.metrics:type=Table,keyspace=<ks>,scope=<table>,name=WriteLatency
org.apache.cassandra.metrics:type=Table,keyspace=<ks>,scope=<table>,name=LiveSSTableCount
org.apache.cassandra.metrics:type=Table,keyspace=<ks>,scope=<table>,name=TombstoneScannedHistogram
Thread pool metrics:
org.apache.cassandra.metrics:type=ThreadPools,path=request,scope=ReadStage,name=PendingTasks
org.apache.cassandra.metrics:type=ThreadPools,path=request,scope=MutationStage,name=PendingTasks
org.apache.cassandra.metrics:type=ThreadPools,path=internal,scope=CompactionExecutor,name=PendingTasks
Compaction metrics:
org.apache.cassandra.metrics:type=Compaction,name=PendingTasks
org.apache.cassandra.metrics:type=Compaction,name=TotalCompactionsCompleted
org.apache.cassandra.metrics:type=Compaction,name=BytesCompacted
Querying JMX¶
# Using jmxterm
java -jar jmxterm.jar -l localhost:7199
> domain org.apache.cassandra.metrics
> bean type=ClientRequest,scope=Read,name=Latency
> get 99thPercentile
# Using jconsole (GUI)
jconsole localhost:7199
Log Monitoring¶
Log Levels and Locations¶
| Log | Location | Purpose |
|---|---|---|
| system.log | /var/log/cassandra/system.log | Main operational log |
| debug.log | /var/log/cassandra/debug.log | Detailed debugging |
| gc.log | /var/log/cassandra/gc.log | GC activity |
Critical Log Patterns¶
# Errors requiring immediate attention
grep -E "ERROR|FATAL" /var/log/cassandra/system.log | tail -50
# OutOfMemory events
grep -i "OutOfMemory\|OOM" /var/log/cassandra/system.log
# Compaction issues
grep -i "compaction" /var/log/cassandra/system.log | grep -i "error\|fail"
# Streaming problems
grep -i "stream" /var/log/cassandra/system.log | grep -i "error\|fail"
# Gossip issues
grep -i "gossip" /var/log/cassandra/system.log | grep -i "error\|fail"
# Dropped messages
grep -i "dropped" /var/log/cassandra/system.log
# Slow queries (if enabled)
grep "SLOW" /var/log/cassandra/system.log
Enabling Slow Query Logging¶
# cassandra.yaml
slow_query_log_timeout_in_ms: 500
Alert Configuration¶
Alert Severity Levels¶
| Severity | Response Time | Examples |
|---|---|---|
| Critical | Immediate | Node down, disk full, OOM |
| Warning | Within 1 hour | High latency, compaction backlog |
| Info | Next business day | Elevated tombstones, GC time increase |
Recommended Alerts¶
Critical Alerts (Page immediately):
| Alert | Condition | Response |
|---|---|---|
| Node Down | Any node unreachable | Investigate immediately, check network/process |
| Disk Full | Disk usage >85% | Add capacity or clean up snapshots |
| OOM/Frequent GC | Full GC >5 times in 5 min | Investigate heap usage, potential memory leak |
| Schema Disagreement | Multiple schema versions >5 min | Check for stuck schema migrations |
Warning Alerts:
| Alert | Condition | Response |
|---|---|---|
| High Read Latency | P99 >100ms sustained | Check compaction, tombstones, GC |
| Compaction Backlog | Pending >100 for 30 min | Increase throughput or investigate blockers |
| Dropped Messages | Any message drops | Check thread pools, network, timeouts |
| Hints Growing | >1000 hints stored | Check target node health |
AxonOps provides pre-configured alerts for these conditions. See Setup Alert Rules for configuration details.
Dashboard Design¶
Essential Dashboard Panels¶
Cluster Overview:
- Node status (up/down) per DC
- Total cluster load
- Request rates (reads/writes per second)
- Error rates
Performance:
- P50/P95/P99 read latency
- P50/P95/P99 write latency
- Requests per second (by node)
- Timeouts per second
Resources:
- Heap usage per node
- Disk usage per node
- CPU usage per node
- Network I/O per node
Operations:
- Pending compactions
- SSTable count
- Tombstone ratios
- Hint storage
AxonOps Dashboards¶
AxonOps provides pre-built dashboards for Cassandra monitoring:
- Cluster Overview: Node status, load distribution, request rates across all nodes
- Node Details: Per-node metrics including heap, disk, CPU, and thread pools
- Table Metrics: Per-table read/write latency, SSTable counts, partition sizes
- Compaction: Pending tasks, throughput, history across the cluster
- Repair: Repair coverage, progress, and scheduling status
See Metrics Dashboard for dashboard usage and customization.
System Table Queries¶
Cluster State¶
-- Node status from system tables
SELECT peer, data_center, rack, release_version, tokens
FROM system.peers;
-- Local node info
SELECT cluster_name, data_center, rack, release_version
FROM system.local;
-- Schema versions
SELECT schema_version, peer FROM system.peers;
Size and Distribution¶
-- Table sizes
SELECT keyspace_name, table_name,
mean_partition_size,
partitions_count
FROM system_schema.tables;
-- Compaction history
SELECT keyspace_name, columnfamily_name, compacted_at, bytes_in, bytes_out
FROM system.compaction_history
WHERE compacted_at > '2024-01-01'
ALLOW FILTERING;
Baseline and Capacity Planning¶
Establishing Baselines¶
Record metrics during normal operation periods:
#!/bin/bash
# baseline-capture.sh
DATE=$(date +%Y%m%d_%H%M)
OUTPUT="baseline_${DATE}.txt"
echo "Capturing baseline at $(date)" > $OUTPUT
echo -e "\n=== Table Stats ===" >> $OUTPUT
nodetool tablestats >> $OUTPUT
echo -e "\n=== Thread Pools ===" >> $OUTPUT
nodetool tpstats >> $OUTPUT
echo -e "\n=== GC Stats ===" >> $OUTPUT
nodetool gcstats >> $OUTPUT
echo -e "\n=== Compaction Stats ===" >> $OUTPUT
nodetool compactionstats >> $OUTPUT
Capacity Metrics¶
Track these for capacity planning:
| Metric | Purpose | Growth Trigger |
|---|---|---|
| Disk usage | Storage capacity | >60% |
| Data per node | Node sizing | >500GB |
| Write rate | Throughput capacity | Near limits |
| P99 latency | Performance capacity | >SLA threshold |
AxonOps Monitoring Platform¶
AxonOps provides purpose-built monitoring for Apache Cassandra, eliminating the complexity of assembling custom monitoring stacks.
Key Capabilities¶
| Capability | Description |
|---|---|
| Zero-configuration collection | Agent automatically discovers and collects all relevant Cassandra metrics |
| Pre-built dashboards | Production-tested dashboards for cluster, node, and table views |
| Historical analysis | Long-term metric storage with efficient compression |
| Cross-cluster visibility | Monitor multiple clusters from a single interface |
| Intelligent alerting | Pre-configured alerts with anomaly detection |
| Centralized logging | Aggregate and analyze logs from all nodes |
Operational Integration¶
AxonOps extends beyond metrics collection:
- Repair monitoring: Track repair progress and coverage across the cluster
- Backup monitoring: Verify backup completion and health status
- Capacity forecasting: Predict when resources will be exhausted
- Performance analysis: Identify slow queries and hot partitions
Getting Started¶
- AxonOps Cloud Setup - Quick start with AxonOps Cloud
- Agent Installation - Deploy the AxonOps agent
- Metrics Dashboards - Using the monitoring dashboards
- Alert Configuration - Configure alerting rules
Troubleshooting with Metrics¶
High Read Latency Investigation¶
# 1. Check if specific tables affected
nodetool tablestats | grep -A 10 "Table: problem_table"
# 2. Check tombstone counts
nodetool tablestats <ks>.<table> | grep -i tombstone
# 3. Check SSTable count
nodetool tablestats <ks>.<table> | grep "SSTable count"
# 4. Check compaction pending
nodetool compactionstats
# 5. Check GC activity
nodetool gcstats
High Write Latency Investigation¶
# 1. Check commit log disk
df -h /var/lib/cassandra/commitlog
# 2. Check memtable flush status
nodetool tpstats | grep -i memtable
# 3. Check mutation stage
nodetool tpstats | grep -i mutation
# 4. Check hints
nodetool tpstats | grep -i hint
# 5. Check disk I/O
iostat -x 1 5
Dropped Messages Investigation¶
# 1. Identify which message types dropped
nodetool tpstats | grep -i dropped
# 2. Check thread pool queues
nodetool tpstats | grep -i pending
# 3. Check if specific nodes affected
# (Check each node)
# 4. Check network connectivity
ping -c 5 <other_node>
nc -zv <other_node> 7000
Best Practices¶
Monitoring Strategy¶
- Start with cluster-level metrics: Node count, total throughput, overall latency
- Drill down on anomalies: Identify affected nodes, tables, operations
- Correlate across metrics: High latency often correlates with GC, compaction, or disk I/O
- Keep historical data: Compare current vs baseline
Alert Hygiene¶
- Alert on symptoms, not causes: Alert on high latency, not high CPU (unless CPU is the issue)
- Avoid alert fatigue: Too many alerts lead to ignoring alerts
- Include runbook links: Every alert should link to resolution steps
- Review and tune regularly: Adjust thresholds based on experience
Documentation¶
- Document normal ranges: What does "healthy" look like for this cluster?
- Record incidents: What happened, how it was detected, how it was resolved
- Maintain runbooks: Step-by-step procedures for common alerts
Related Documentation¶
- Cluster Management - Node operations to monitor
- Repair Operations - Repair progress monitoring
- Compaction Management - Compaction metrics
- Maintenance - Scheduled maintenance monitoring