Troubleshooting Playbooks
Step-by-step guides for diagnosing and resolving specific Cassandra issues.
Each playbook follows the SDRR Framework:
- Symptoms - Observable indicators of the problem
- Diagnosis - Commands and checks to identify root cause
- Resolution - Step-by-step fix procedures
- Recovery - Verification and prevention
Cluster Issues
Node Operations
Quick Reference
Emergency Response Priority
| Severity |
Response Time |
Examples |
| Critical |
Immediate |
Disk full, gossip failure, cluster partition |
| High |
Within 1 hour |
OOM, schema disagreement, node down |
| Medium |
Within 4 hours |
High CPU, compaction backlog, repair failures |
| Low |
Scheduled |
Capacity planning, node additions |
First Response Commands
# Quick cluster health check
nodetool status
nodetool tpstats | head -20
nodetool compactionstats
# Check for immediate issues
df -h /var/lib/cassandra # Disk space
free -h # Memory
tail -50 /var/log/cassandra/system.log | grep -i error
Using These Playbooks
Before Starting
- Read the entire playbook before executing commands
- Understand the impact of each step
- Have rollback plan ready
- Notify stakeholders for production changes
During Execution
- Follow steps in order - sequence matters
- Verify each step before proceeding
- Document what was done for post-incident review
- Monitor impact on cluster and applications
After Resolution
- Verify the fix using the recovery section
- Document root cause and timeline
- Implement prevention measures
- Update runbooks if needed