Troubleshooting Guide¶

This guide provides systematic approaches to diagnosing and resolving common Cassandra operational issues.

Systematic Troubleshooting

Follow the pattern: Observe (gather data) → Hypothesize (form theory) → Test (verify theory) → Fix (implement solution) → Verify (confirm resolution).

Quick Diagnostics¶

First Response Commands¶

Run these commands to quickly assess cluster health:

#!/bin/bash
# quick-diagnostic.sh

echo "=== Node Status ==="
nodetool status

echo -e "\n=== Ring Health ==="
nodetool describecluster | head -30

echo -e "\n=== Thread Pool Status ==="
nodetool tpstats | grep -E "Pool|Active|Pending|Blocked"

echo -e "\n=== Dropped Messages ==="
nodetool tpstats | grep -i dropped | grep -v "^0"

echo -e "\n=== Compaction Status ==="
nodetool compactionstats | head -20

echo -e "\n=== Recent Errors ==="
tail -100 /var/log/cassandra/system.log | grep -i "error\|exception\|warn" | tail -20

Cluster State Quick Check¶

Check	Command	Healthy State
All nodes up	`nodetool status`	All UN (Up Normal)
Schema agreement	`nodetool describecluster`	Single schema version
No dropped messages	`nodetool tpstats`	All zeros
Compaction healthy	`nodetool compactionstats`	<50 pending

Node Issues¶

Node Won't Start¶

Symptoms: Cassandra process fails to start or crashes immediately

Diagnostic steps:

# Check system log for startup errors
tail -200 /var/log/cassandra/system.log | grep -i "error\|exception\|fatal"

# Check if port is already in use
netstat -tlnp | grep -E "7000|9042|7199"

# Check disk space
df -h /var/lib/cassandra

# Check file permissions
ls -la /var/lib/cassandra/
ls -la /var/log/cassandra/

# Check JVM can allocate heap
java -Xmx16G -version 2>&1

Common causes and solutions:

Cause	Log Pattern	Solution
Port in use	"Address already in use"	Kill existing process or change ports
Insufficient disk	"No space left on device"	Free disk space or add storage
Permission denied	"Permission denied"	Fix ownership: `chown -R cassandra:cassandra /var/lib/cassandra`
Corrupt commit log	"CommitLog" + "corrupt"	Remove corrupt segment (data loss risk)
Heap allocation failure	"Could not reserve enough space"	Reduce heap or add memory
Schema corruption	"Schema" + "cannot"	Restore from backup or repair schema

Commit log corruption recovery:

# WARNING: May cause data loss for unflushed writes
# Identify corrupt segment
ls -la /var/lib/cassandra/commitlog/

# Move corrupt segment
mv /var/lib/cassandra/commitlog/CommitLog-*.log /tmp/corrupt_commitlog/

# Restart Cassandra
sudo systemctl start cassandra

Node Unresponsive (Hung)¶

Symptoms: Node running but not responding to queries or nodetool

Diagnostic steps:

# Check if process is running
ps aux | grep cassandra

# Check GC activity
tail -50 /var/log/cassandra/gc.log

# Generate thread dump
kill -3 $(pgrep -f CassandraDaemon)
# Or: jstack $(pgrep -f CassandraDaemon) > /tmp/thread_dump.txt

# Check system resources
top -p $(pgrep -f CassandraDaemon)
iostat -x 1 5

Common causes:

Cause	Indicators	Solution
GC storm	Long GC pauses in gc.log	Reduce heap, tune GC, reduce load
Thread starvation	Blocked threads in dump	Identify contention, increase pool
Disk I/O saturation	High iowait in top	Reduce load, faster storage
Network partition	Gossip timeouts in log	Check network connectivity

Node Marked Down (DN)¶

Symptoms: Node shows DN in nodetool status from other nodes' perspective

Diagnostic steps:

# On "down" node - check if it thinks it's up
nodetool status

# Check gossip state
nodetool gossipinfo

# Check network from other nodes
nc -zv <down_node_ip> 7000
nc -zv <down_node_ip> 9042

# Check firewall
iptables -L -n

# Check for gossip issues in log
grep -i gossip /var/log/cassandra/system.log | tail -50

Common causes:

Cause	Solution
Network partition	Resolve network issue
Firewall blocking	Open ports 7000, 7001, 9042
GC pauses causing timeouts	Tune GC
phi_convict_threshold too low	Increase threshold (rare)

Performance Issues¶

High Read Latency¶

Symptoms: P99 read latency exceeds SLA, slow queries

Diagnostic steps:

# Check latency by table
nodetool tablestats <keyspace> | grep -A 15 "Table:"

# Check tombstone counts
nodetool tablestats <keyspace>.<table> | grep -i tombstone

# Check SSTable count
nodetool tablestats <keyspace>.<table> | grep "SSTable count"

# Check partition sizes
nodetool tablehistograms <keyspace> <table>

# Check bloom filter effectiveness
nodetool tablestats <keyspace>.<table> | grep -i bloom

Common causes and solutions:

Cause	Indicator	Solution
Too many SSTables	SSTable count >20 (STCS)	Force compaction, adjust strategy
Excessive tombstones	Tombstones/read >1000	Reduce deletes, force compaction
Large partitions	Partition size >100MB	Redesign data model
Poor bloom filter	False positive >10%	Increase bloom filter FP chance
Cold cache	Low key cache hits	Warm cache, increase size
GC pauses	P99 latency spikes	Tune GC

High Write Latency¶

Symptoms: P99 write latency exceeds SLA, timeouts

Diagnostic steps:

# Check commit log disk
df -h /var/lib/cassandra/commitlog
iostat -x 1 5

# Check memtable flush rate
nodetool tpstats | grep -i memtable

# Check mutation stage
nodetool tpstats | grep -i mutation

# Check hints accumulation
ls -la /var/lib/cassandra/hints/

Common causes and solutions:

Cause	Indicator	Solution
Commit log disk slow	High iowait	Use faster disk, separate disk
Memtable flush backlog	Pending memtable flushes	Increase flush writers
Compaction backlog	Pending compactions >100	Increase compaction throughput
Hints accumulation	Large hints directory	Investigate down nodes
Batch too large	Batch size warnings	Reduce batch size

Dropped Messages¶

Symptoms: Non-zero dropped messages in nodetool tpstats

# Check which message types are dropped
nodetool tpstats | grep -i dropped

# Message types:
# MUTATION - Write requests dropped
# READ - Read requests dropped
# RANGE_SLICE - Range query requests dropped
# REQUEST_RESPONSE - Response to coordinator dropped

Interpretation:

Message Type	Meaning	Common Cause
MUTATION	Writes not processed in time	Overload, disk slow
READ	Reads not processed in time	Overload, compaction
RANGE_SLICE	Range queries dropped	Large scans, overload
REQUEST_RESPONSE	Responses dropped	Network, overload

Solutions:

# Increase timeouts (cassandra.yaml)
read_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 5000

# Or reduce load on cluster
# Or add capacity

Cluster Issues¶

Schema Disagreement¶

Symptoms: Multiple schema versions in nodetool describecluster

Diagnostic steps:

# Check schema versions
nodetool describecluster

# Identify which nodes have which version
nodetool describecluster | grep -A 100 "Schema versions"

# Check for pending schema changes
grep -i schema /var/log/cassandra/system.log | tail -50

Solutions:

# Usually resolves automatically - wait 30 seconds

# If persistent, restart problematic nodes
# (one at a time)

# Force schema refresh
nodetool resetlocalschema  # Cassandra 4.0+

# Last resort - rolling restart of cluster

Gossip Issues¶

Symptoms: Nodes don't see each other, split cluster

Diagnostic steps:

# Check gossip state
nodetool gossipinfo

# Check for gossip errors
grep -i gossip /var/log/cassandra/system.log | grep -i "error\|fail" | tail -20

# Verify seed connectivity
nc -zv <seed_ip> 7000

Common causes:

Cause	Solution
Network partition	Resolve network issue
Seed nodes down	Ensure seeds are up
Firewall	Open port 7000 between nodes
Different cluster names	Fix cassandra.yaml

Inconsistent Data (Read Repair Failures)¶

Symptoms: Different data returned for same query, repair errors

Diagnostic steps:

# Check repair history
nodetool netstats | grep -i repair

# Run consistency check
nodetool verify <keyspace> <table>

# Check for read repair errors
grep -i "read repair" /var/log/cassandra/system.log | tail -20

Solutions:

# Run full repair
nodetool repair -full <keyspace>

# For specific table
nodetool repair <keyspace> <table>

Storage Issues¶

Disk Space Exhaustion¶

Symptoms: Writes failing, "No space left on device" errors

Immediate actions:

# Check disk usage
df -h /var/lib/cassandra

# Find large files
du -sh /var/lib/cassandra/*
du -sh /var/lib/cassandra/data/*/*

# Clear snapshots
nodetool clearsnapshot

# Clear old commit logs (if already flushed)
ls -la /var/lib/cassandra/commitlog/

Longer-term solutions:

# Run cleanup to remove data no longer owned
nodetool cleanup

# Drop unused snapshots
nodetool listsnapshots
nodetool clearsnapshot -t <snapshot_name>

# Compact to reclaim tombstone space
nodetool compact <keyspace> <table>

# Add storage or nodes

SSTable Corruption¶

Symptoms: Read errors, "CorruptSSTableException" in logs

Diagnostic steps:

# Identify corrupt SSTable from error
grep -i corrupt /var/log/cassandra/system.log

# Verify SSTable
sstableverify <keyspace> <table>

# Check specific SSTable
tools/bin/sstablemetadata <sstable_path>

Solutions:

# If other replicas exist (RF > 1)
# Remove corrupt SSTable and repair
rm <corrupt_sstable>*  # Remove all components
nodetool repair <keyspace> <table>

# If no other replicas
# Try scrubbing (may lose some data)
nodetool scrub <keyspace> <table>

Commit Log Issues¶

Symptoms: Startup failures mentioning commit log

Solutions:

# If commit log corrupt and data can be rebuilt from replicas
# Move corrupt segments
mkdir /tmp/corrupt_commitlog
mv /var/lib/cassandra/commitlog/CommitLog-7-*.log /tmp/corrupt_commitlog/

# Restart and repair
sudo systemctl start cassandra
nodetool repair

Data Loss Risk

Removing commit log segments loses any writes not yet flushed to SSTables. Only do this if data can be recovered from other replicas.

Repair Issues¶

Repair Stuck or Slow¶

Symptoms: Repair running for excessive time, no progress

Diagnostic steps:

# Check repair status
nodetool netstats | grep -i repair

# Check for streaming
nodetool netstats | grep -i stream

# Check thread pools
nodetool tpstats | grep -i repair

Solutions:

# Stop stuck repair
nodetool stop REPAIR

# Check for network issues between nodes
nc -zv <other_node> 7000

# Reduce repair scope
# Instead of full keyspace repair:
nodetool repair <keyspace> <single_table>

# Use parallel repair for faster completion
nodetool repair -par <keyspace>

Repair Failures¶

Symptoms: "Repair session failed" errors

Common causes:

Error	Cause	Solution
"Connection refused"	Node down	Ensure all nodes up
"Timeout"	Network or load	Increase timeout, reduce load
"Out of memory"	Too many ranges	Reduce scope, increase heap
"Anti-compaction"	Incremental repair issue	Use full repair

Memory Issues¶

OutOfMemoryError¶

Symptoms: OOM errors in logs, node crashes

Diagnostic steps:

# Check heap usage before OOM
grep -i "heap\|memory" /var/log/cassandra/system.log | tail -50

# Check for large allocations
grep -i "allocat" /var/log/cassandra/debug.log | tail -50

# Analyze heap dump if generated
jmap -histo $(pgrep -f CassandraDaemon) | head -30

Common causes and solutions:

Cause	Indicator	Solution
Heap too small	Frequent full GCs	Increase heap
Large partitions	Query timeouts before OOM	Redesign data model
Too many tombstones	Tombstone warnings	Reduce deletes, compact
Memory leak	Gradual heap growth	Upgrade Cassandra
Off-heap exhaustion	Native memory errors	Reduce off-heap usage

GC Pressure¶

Symptoms: Long GC pauses, high GC CPU usage

Diagnostic steps:

# Check GC statistics
nodetool gcstats

# Analyze GC log
grep "pause" /var/log/cassandra/gc.log | tail -20

# Check for long pauses
grep -E "pause.*[0-9]{4,}ms" /var/log/cassandra/gc.log

Solutions:

# Reduce heap if >32GB
# jvm-server.options
-Xms24G
-Xmx24G

# Tune G1GC
-XX:MaxGCPauseMillis=300
-XX:InitiatingHeapOccupancyPercent=70

# Reduce memory pressure
# - Smaller partitions
# - Fewer tombstones
# - Lower concurrent queries

Network Issues¶

Inter-Node Connectivity Problems¶

Symptoms: Nodes marking each other down, streaming failures

Diagnostic steps:

# Test connectivity
nc -zv <other_node> 7000  # Storage
nc -zv <other_node> 7001  # SSL storage
nc -zv <other_node> 7199  # JMX

# Check for TCP issues
netstat -s | grep -i "retransmit\|timeout"

# Check MTU issues
ping -M do -s 1472 <other_node>

Solutions:

Issue	Solution
Firewall blocking	Open ports 7000, 7001, 9042
MTU mismatch	Set consistent MTU across network
Network saturation	Increase bandwidth, throttle streaming
TCP timeouts	Tune kernel TCP settings

Client Connection Issues¶

Symptoms: Client can't connect, connection timeouts

Diagnostic steps:

# Check native transport
nodetool status | grep -i native

# Test from client machine
nc -zv <cassandra_node> 9042

# Check connection limits
nodetool clientstats

# Check for connection errors
grep -i "native\|connect" /var/log/cassandra/system.log | grep -i error

Solutions:

# cassandra.yaml - ensure native transport enabled
start_native_transport: true

# Check rpc_address (should be reachable IP)
rpc_address: 0.0.0.0
broadcast_rpc_address: <public_ip>

AxonOps Troubleshooting Support¶

Diagnosing Cassandra issues requires correlating metrics, logs, and cluster state across multiple nodes. AxonOps provides integrated troubleshooting capabilities.

Centralized Diagnostics¶

AxonOps provides:

Unified log view: Search and correlate logs across all nodes
Metric correlation: Overlay metrics with events and errors
Historical analysis: Compare current state to past baselines
Alert context: See related metrics when alerts fire

Automated Issue Detection¶

Anomaly detection: ML-based identification of unusual patterns
Root cause suggestions: Guidance based on symptom patterns
Impact analysis: Understand which services are affected
Runbook integration: Link alerts to resolution procedures

Troubleshooting Workflows¶

Guided diagnostics: Step-by-step investigation workflows
Comparison tools: Compare nodes to identify outliers
Timeline reconstruction: Build sequence of events leading to issues
Knowledge base: Access to common issues and solutions

See the AxonOps documentation for troubleshooting features.

Emergency Procedures¶

Node Recovery Priority¶

Assess scope: How many nodes affected?
Maintain quorum: Ensure RF/2+1 nodes per DC remain up
Stop the bleeding: Prevent additional failures
Communicate: Alert stakeholders
Recover: Bring nodes back or replace

Emergency Contacts¶

Document these for your organization:

On-call DBA/SRE contact
Infrastructure team contact
Application team contacts
Management escalation path

Data Loss Scenarios¶

Scenario	Risk	Recovery
Single node loss	Low (RF>1)	Replace node, repair
Rack loss	Low-Medium	Replace nodes, repair
DC loss	Medium	Failover to other DC
All replicas lost	High	Restore from backup

Best Practices¶

Proactive Troubleshooting¶

Monitor continuously: Detect issues before users report
Establish baselines: Know what "normal" looks like
Document incidents: Build knowledge base
Practice recovery: Regular DR drills

During Incidents¶

Don't panic: Methodical troubleshooting is faster
Gather data first: Collect logs and metrics before changing things
One change at a time: Avoid confusing multiple changes
Document actions: Track what was tried and results

Post-Incident¶

Root cause analysis: Understand why it happened
Prevent recurrence: Implement fixes
Update runbooks: Capture new knowledge
Share learnings: Team awareness

Monitoring - Metrics for diagnostics
Performance Tuning - Resolving performance issues
Cluster Management - Node operations
Repair Operations - Repair troubleshooting
Backup & Restore - Recovery procedures

Troubleshooting Guide¶

Quick Diagnostics¶

First Response Commands¶

Cluster State Quick Check¶

Node Issues¶

Node Won't Start¶

Node Unresponsive (Hung)¶

Node Marked Down (DN)¶

Performance Issues¶

High Read Latency¶

High Write Latency¶

Dropped Messages¶

Cluster Issues¶

Schema Disagreement¶

Gossip Issues¶

Inconsistent Data (Read Repair Failures)¶

Storage Issues¶

Disk Space Exhaustion¶

SSTable Corruption¶

Commit Log Issues¶

Repair Issues¶

Repair Stuck or Slow¶

Repair Failures¶

Memory Issues¶

OutOfMemoryError¶

GC Pressure¶

Network Issues¶

Inter-Node Connectivity Problems¶

Client Connection Issues¶

AxonOps Troubleshooting Support¶

Centralized Diagnostics¶

Automated Issue Detection¶

Troubleshooting Workflows¶

Emergency Procedures¶

Node Recovery Priority¶

Emergency Contacts¶

Data Loss Scenarios¶

Best Practices¶

Proactive Troubleshooting¶

During Incidents¶

Post-Incident¶

Related Documentation¶