Cassandra Troubleshooting Guide¶

Most Cassandra problems have a small number of root causes. High read latency? Probably tombstones or large partitions—not CPU, despite what instincts might suggest. Out of memory? Usually a partition that grew too large for the heap to handle during compaction, not a sign that more RAM is needed. Timeouts? Check if compaction is falling behind.

The diagnosis process is consistent: check the logs for warnings, look at the metrics (especially p99 latency, pending compactions, and thread pool stats), and use nodetool to inspect specific tables. The symptoms point to the cause for those who know where to look.

This guide provides systematic procedures for diagnosing and resolving common issues.

Troubleshooting Framework¶

We use the SDRR Framework for consistent problem resolution:

Symptoms: What observable behaviors indicate the problem?
Diagnosis: How to identify the root cause
Resolution: Step-by-step fix procedures
Recovery: Verification and prevention

Quick Reference: Common Issues¶

Performance Issues¶

Symptom	Likely Cause	Quick Action
High read latency	Data model, tombstones	Check `nodetool tablestats`
High write latency	Disk I/O, compaction	Check `nodetool compactionstats`
Request timeouts	Overload, disk issues	Check `nodetool tpstats`
Slow startup	Large commitlog	Check commitlog size

Availability Issues¶

Symptom	Likely Cause	Quick Action
Node down	OOM, disk full, crash	Check logs, `systemctl status`
Nodes not joining	Network, schema	Check gossip, firewalls
Inconsistent data	Repair needed	Run `nodetool repair`

Resource Issues¶

Symptom	Likely Cause	Quick Action
High CPU	GC, compaction	Check GC logs, compaction
High memory	Heap settings	Review JVM configuration
Disk full	Large tables, snapshots	Clean snapshots, add capacity

First Response Checklist¶

When an issue occurs, gather this information first:

1. Check Node Status¶

# Cluster overview
nodetool status

# Node info
nodetool info

# Thread pool stats
nodetool tpstats

# Compaction status
nodetool compactionstats

2. Check Logs¶

# Recent errors
tail -100 /var/log/cassandra/system.log | grep -i error

# Warnings
tail -100 /var/log/cassandra/system.log | grep -i warn

# GC issues
grep "GC pause" /var/log/cassandra/gc.log | tail -20

3. Check Resources¶

# Disk space
df -h /var/lib/cassandra

# Memory
free -h

# CPU
top -b -n 1 | head -20

# I/O
iostat -x 1 5

4. Check Metrics¶

# Table stats for specific keyspace
nodetool tablestats my_keyspace

# Pending compactions
nodetool compactionstats | grep pending

# Dropped messages
nodetool tpstats | grep -i dropped

Diagnostic Commands Reference¶

nodetool Commands¶

Command	Purpose
`nodetool status`	Cluster and node health
`nodetool info`	Node information
`nodetool tpstats`	Thread pool statistics
`nodetool tablestats <ks>`	Table statistics
`nodetool compactionstats`	Compaction status
`nodetool netstats`	Network status
`nodetool gossipinfo`	Gossip state
`nodetool describecluster`	Cluster description
`nodetool proxyhistograms`	Request latencies
`nodetool tablehistograms <ks> <table>`	Table latencies

Log Locations¶

Log	Location	Purpose
System log	`/var/log/cassandra/system.log`	Main application log
Debug log	`/var/log/cassandra/debug.log`	Detailed debug info
GC log	`/var/log/cassandra/gc.log`	Garbage collection
Audit log	`/var/log/cassandra/audit/audit.log`	Security audit

Key Metrics to Check¶

# JMX metrics via nodetool
nodetool gcstats                    # GC statistics
nodetool getlogginglevels           # Current log levels
nodetool statusbinary               # CQL port status
nodetool statusgossip               # Gossip status

Issue Categories¶

Timeout Issues¶

Requests failing due to time limits:

Error	Default Timeout	Configuration
Read timeout	5000ms	`read_request_timeout_in_ms`
Write timeout	2000ms	`write_request_timeout_in_ms`
Range timeout	10000ms	`range_request_timeout_in_ms`
Counter timeout	5000ms	`counter_write_request_timeout_in_ms`
Truncate timeout	60000ms	`truncate_request_timeout_in_ms`

Common causes: Overload, slow disks, network issues, data model problems.

Consistency Issues¶

Data inconsistency between replicas:

Symptom	Cause	Action
Different data per query	Repair needed	Run repair
Unavailable errors	Insufficient replicas	Check node status
Timeout with QUORUM	Multiple nodes slow	Check all replicas

Compaction Issues¶

Compaction falling behind:

Symptom	Cause	Action
High pending compactions	Write-heavy, slow disk	Increase throughput
Large SSTable files	No compaction	Check strategy
High read latency	Many SSTables	Force compaction

Memory Issues¶

Heap exhaustion or GC pressure:

Symptom	Cause	Action
OOM errors	Heap too small, large partitions	Increase heap, fix data model
Long GC pauses	Heap too large	Reduce heap size
Off-heap OOM	Bloom filters, compression	Adjust off-heap settings

Emergency Procedures¶

Node Unresponsive¶

# 1. Check if process is running
ps aux | grep cassandra

# 2. Check for OOM kills
dmesg | grep -i "killed process"

# 3. Check system log
tail -100 /var/log/cassandra/system.log

# 4. If hung, get thread dump
jstack $(pgrep -f CassandraDaemon) > /tmp/threaddump.txt

# 5. If necessary, restart
sudo systemctl restart cassandra

Disk Full¶

# 1. Check disk usage
df -h /var/lib/cassandra

# 2. Find large files
du -sh /var/lib/cassandra/*

# 3. Clear snapshots
nodetool clearsnapshot --all

# 4. If still full, consider temporary cleanup
ls -la /var/lib/cassandra/data/<keyspace>/<table>*/

Cluster Partition¶

# 1. Check gossip state on both sides
nodetool gossipinfo

# 2. Verify network connectivity
nc -zv <other-node-ip> 7000
nc -zv <other-node-ip> 9042

# 3. Check firewall rules
sudo iptables -L -n

# 4. If network is fine, check for zombie nodes
nodetool status | grep -E "DN|UJ|UL"

Getting Help¶

Information to Collect¶

When seeking help, gather:

Cassandra version: nodetool version
Cluster configuration: nodetool describecluster
Node status: nodetool status
Recent logs: Last 100 lines of system.log
Error messages: Exact exception text
Recent changes: What changed before the issue

Resources¶

Apache Cassandra Slack - Community chat
AxonOps Community - Professional support
Stack Overflow - Q&A

Next Steps¶

Monitoring Guide - Proactive monitoring
Performance Tuning - Performance optimization