Troubleshooting Playbooks¶

Step-by-step guides for diagnosing and resolving specific Cassandra issues.

Each playbook follows the SDRR Framework:

Symptoms - Observable indicators of the problem
Diagnosis - Commands and checks to identify root cause
Resolution - Step-by-step fix procedures
Recovery - Verification and prevention

Performance Issues¶

Playbook	Symptoms	Severity
High CPU Usage	CPU consistently > 80%, slow responses	Medium
High Memory Usage	OOM errors, frequent GC, heap exhaustion	High
Slow Queries	High latency, timeouts on specific queries	Medium
GC Pause Issues	Long GC pauses, application stalls	High
Large Partition Issues	Slow reads, OOM during compaction	High
Tombstone Accumulation	TombstoneOverwhelmingException, slow reads	High
Compaction Issues	Growing SSTable count, degrading reads	Medium

Cluster Issues¶

Playbook	Symptoms	Severity
Schema Disagreement	Schema versions differ across nodes	High
Gossip Failures	Nodes not seeing each other	Critical
Repair Failures	Repairs failing or not completing	Medium

Node Operations¶

Playbook	Symptoms	Severity
Replace Dead Node	Node permanently failed	High
Decommission Node	Removing node from cluster	Medium
Add Node	Expanding cluster capacity	Low
Recover from OOM	Node killed by OOM	High
Handle Full Disk	Disk space exhausted	Critical

Quick Reference¶

Emergency Response Priority¶

Severity	Response Time	Examples
Critical	Immediate	Disk full, gossip failure, cluster partition
High	Within 1 hour	OOM, schema disagreement, node down
Medium	Within 4 hours	High CPU, compaction backlog, repair failures
Low	Scheduled	Capacity planning, node additions

First Response Commands¶

# Quick cluster health check
nodetool status
nodetool tpstats | head -20
nodetool compactionstats

# Check for immediate issues
df -h /var/lib/cassandra          # Disk space
free -h                            # Memory
tail -50 /var/log/cassandra/system.log | grep -i error

Using These Playbooks¶

Before Starting¶

Read the entire playbook before executing commands
Understand the impact of each step
Have rollback plan ready
Notify stakeholders for production changes

During Execution¶

Follow steps in order - sequence matters
Verify each step before proceeding
Document what was done for post-incident review
Monitor impact on cluster and applications

After Resolution¶

Verify the fix using the recovery section
Document root cause and timeline
Implement prevention measures
Update runbooks if needed

Common Errors - Error reference guide
Diagnosis Guide - Systematic diagnosis
Log Analysis - Understanding logs
Monitoring - Proactive monitoring

Troubleshooting Playbooks¶

Performance Issues¶

Cluster Issues¶

Node Operations¶

Quick Reference¶

Emergency Response Priority¶

First Response Commands¶

Using These Playbooks¶

Before Starting¶

During Execution¶

After Resolution¶

Related Documentation¶