Cassandra Troubleshooting Guide¶
Most Cassandra problems have a small number of root causes. High read latency? Probably tombstones or large partitions—not CPU, despite what instincts might suggest. Out of memory? Usually a partition that grew too large for the heap to handle during compaction, not a sign that more RAM is needed. Timeouts? Check if compaction is falling behind.
The diagnosis process is consistent: check the logs for warnings, look at the metrics (especially p99 latency, pending compactions, and thread pool stats), and use nodetool to inspect specific tables. The symptoms point to the cause for those who know where to look.
This guide provides systematic procedures for diagnosing and resolving common issues.
Troubleshooting Framework¶
We use the SDRR Framework for consistent problem resolution:
- Symptoms: What observable behaviors indicate the problem?
- Diagnosis: How to identify the root cause
- Resolution: Step-by-step fix procedures
- Recovery: Verification and prevention
Quick Reference: Common Issues¶
Performance Issues¶
| Symptom | Likely Cause | Quick Action |
|---|---|---|
| High read latency | Data model, tombstones | Check nodetool tablestats |
| High write latency | Disk I/O, compaction | Check nodetool compactionstats |
| Request timeouts | Overload, disk issues | Check nodetool tpstats |
| Slow startup | Large commitlog | Check commitlog size |
Availability Issues¶
| Symptom | Likely Cause | Quick Action |
|---|---|---|
| Node down | OOM, disk full, crash | Check logs, systemctl status |
| Nodes not joining | Network, schema | Check gossip, firewalls |
| Inconsistent data | Repair needed | Run nodetool repair |
Resource Issues¶
| Symptom | Likely Cause | Quick Action |
|---|---|---|
| High CPU | GC, compaction | Check GC logs, compaction |
| High memory | Heap settings | Review JVM configuration |
| Disk full | Large tables, snapshots | Clean snapshots, add capacity |
First Response Checklist¶
When an issue occurs, gather this information first:
1. Check Node Status¶
# Cluster overview
nodetool status
# Node info
nodetool info
# Thread pool stats
nodetool tpstats
# Compaction status
nodetool compactionstats
2. Check Logs¶
# Recent errors
tail -100 /var/log/cassandra/system.log | grep -i error
# Warnings
tail -100 /var/log/cassandra/system.log | grep -i warn
# GC issues
grep "GC pause" /var/log/cassandra/gc.log | tail -20
3. Check Resources¶
# Disk space
df -h /var/lib/cassandra
# Memory
free -h
# CPU
top -b -n 1 | head -20
# I/O
iostat -x 1 5
4. Check Metrics¶
# Table stats for specific keyspace
nodetool tablestats my_keyspace
# Pending compactions
nodetool compactionstats | grep pending
# Dropped messages
nodetool tpstats | grep -i dropped
Diagnostic Commands Reference¶
nodetool Commands¶
| Command | Purpose |
|---|---|
nodetool status |
Cluster and node health |
nodetool info |
Node information |
nodetool tpstats |
Thread pool statistics |
nodetool tablestats <ks> |
Table statistics |
nodetool compactionstats |
Compaction status |
nodetool netstats |
Network status |
nodetool gossipinfo |
Gossip state |
nodetool describecluster |
Cluster description |
nodetool proxyhistograms |
Request latencies |
nodetool tablehistograms <ks> <table> |
Table latencies |
Log Locations¶
| Log | Location | Purpose |
|---|---|---|
| System log | /var/log/cassandra/system.log |
Main application log |
| Debug log | /var/log/cassandra/debug.log |
Detailed debug info |
| GC log | /var/log/cassandra/gc.log |
Garbage collection |
| Audit log | /var/log/cassandra/audit/audit.log |
Security audit |
Key Metrics to Check¶
# JMX metrics via nodetool
nodetool gcstats # GC statistics
nodetool getlogginglevels # Current log levels
nodetool statusbinary # CQL port status
nodetool statusgossip # Gossip status
Issue Categories¶
Timeout Issues¶
Requests failing due to time limits:
| Error | Default Timeout | Configuration |
|---|---|---|
| Read timeout | 5000ms | read_request_timeout_in_ms |
| Write timeout | 2000ms | write_request_timeout_in_ms |
| Range timeout | 10000ms | range_request_timeout_in_ms |
| Counter timeout | 5000ms | counter_write_request_timeout_in_ms |
| Truncate timeout | 60000ms | truncate_request_timeout_in_ms |
Common causes: Overload, slow disks, network issues, data model problems.
Consistency Issues¶
Data inconsistency between replicas:
| Symptom | Cause | Action |
|---|---|---|
| Different data per query | Repair needed | Run repair |
| Unavailable errors | Insufficient replicas | Check node status |
| Timeout with QUORUM | Multiple nodes slow | Check all replicas |
Compaction Issues¶
Compaction falling behind:
| Symptom | Cause | Action |
|---|---|---|
| High pending compactions | Write-heavy, slow disk | Increase throughput |
| Large SSTable files | No compaction | Check strategy |
| High read latency | Many SSTables | Force compaction |
Memory Issues¶
Heap exhaustion or GC pressure:
| Symptom | Cause | Action |
|---|---|---|
| OOM errors | Heap too small, large partitions | Increase heap, fix data model |
| Long GC pauses | Heap too large | Reduce heap size |
| Off-heap OOM | Bloom filters, compression | Adjust off-heap settings |
Emergency Procedures¶
Node Unresponsive¶
# 1. Check if process is running
ps aux | grep cassandra
# 2. Check for OOM kills
dmesg | grep -i "killed process"
# 3. Check system log
tail -100 /var/log/cassandra/system.log
# 4. If hung, get thread dump
jstack $(pgrep -f CassandraDaemon) > /tmp/threaddump.txt
# 5. If necessary, restart
sudo systemctl restart cassandra
Disk Full¶
# 1. Check disk usage
df -h /var/lib/cassandra
# 2. Find large files
du -sh /var/lib/cassandra/*
# 3. Clear snapshots
nodetool clearsnapshot --all
# 4. If still full, consider temporary cleanup
ls -la /var/lib/cassandra/data/<keyspace>/<table>*/
Cluster Partition¶
# 1. Check gossip state on both sides
nodetool gossipinfo
# 2. Verify network connectivity
nc -zv <other-node-ip> 7000
nc -zv <other-node-ip> 9042
# 3. Check firewall rules
sudo iptables -L -n
# 4. If network is fine, check for zombie nodes
nodetool status | grep -E "DN|UJ|UL"
Getting Help¶
Information to Collect¶
When seeking help, gather:
- Cassandra version:
nodetool version - Cluster configuration:
nodetool describecluster - Node status:
nodetool status - Recent logs: Last 100 lines of system.log
- Error messages: Exact exception text
- Recent changes: What changed before the issue
Resources¶
- Apache Cassandra Slack - Community chat
- AxonOps Community - Professional support
- Stack Overflow - Q&A
Next Steps¶
- Monitoring Guide - Proactive monitoring
- Performance Tuning - Performance optimization