Cassandra Operations Guide¶

Cassandra does not need scheduled downtime. Nodes can be added, removed, upgraded, and reconfigured while the cluster serves traffic. But this flexibility comes with responsibility—Cassandra does not maintain itself.

Three things will cause problems if neglected: repair (synchronizes replicas, prevents deleted data from resurrecting), compaction (merges SSTables, keeps reads fast), and monitoring (catches issues before users notice). Most Cassandra incidents trace back to skipping one of these.

This guide covers day-to-day operations, maintenance procedures, and emergency response.

Operations Philosophy¶

Cassandra is designed for continuous operation. The operational model prioritizes availability—the database should never go down for maintenance. This influences every procedure in this guide.

Cassandra Operational Principles

Principle	Implication
No single point of failure	Any node can handle any request
Continuous availability	Rolling operations, no downtime windows
Eventual consistency	Background processes maintain correctness
Self-healing	Repair and anti-entropy fix inconsistencies
Horizontal scaling	Add capacity by adding nodes

The Three Critical Operations¶

These three operations must be performed regularly. Neglecting any of them leads to production failures:

Operation	Why It is Critical	What Happens If Neglected
Repair	Fixes inconsistencies between replicas	Data divergence, zombie data after deletes
Backup	Enables recovery from disasters	Permanent data loss
Monitoring	Detect problems before failures	Surprise outages, cascading failures

Cluster Health Assessment¶

Before performing any operation, assess cluster health:

# Essential health check commands
nodetool status           # Node states and token distribution
nodetool describecluster  # Schema agreement, cluster name
nodetool tpstats          # Thread pool status (blocked = problem)
nodetool compactionstats  # Pending compactions
nodetool gossipinfo       # Inter-node communication

Virtual Tables Alternative (Cassandra 4.0+)

Many nodetool commands have CQL equivalents via virtual tables:

SELECT * FROM system_views.gossip_info;         -- gossipinfo
SELECT * FROM system_views.thread_pools;        -- tpstats
SELECT * FROM system_views.sstable_tasks;       -- compactionstats
SELECT * FROM system_views.clients;             -- clientstats

Understanding Node States¶

nodetool status output:

Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns    Host ID
UN  10.0.0.1     125.4 GB   256     25.2%   abc123...
UN  10.0.0.2     118.9 GB   256     24.8%   def456...
UN  10.0.0.3     122.1 GB   256     25.0%   ghi789...
DN  10.0.0.4     0 bytes    256     25.0%   jkl012...  ← Problem!

Status letters:
  U = Up (node is online)
  D = Down (node is offline)

State letters:
  N = Normal (healthy, serving requests)
  L = Leaving (decommissioning)
  J = Joining (bootstrapping)
  M = Moving (rebalancing tokens)

Red Flags to Watch For¶

Critical — Take Immediate Action

Any node showing DN (Down Normal) state
Schema disagreement in describecluster
Blocked thread pools in tpstats
Dropped messages > 0 in tpstats
Pending compactions > 100 (or growing continuously)
Disk usage > 80% on any node
Heap usage consistently > 85%

Warning — Investigate Within 24 Hours

Uneven load distribution (> 20% variance between nodes)
Pending compactions > 20
Read latency p99 > 100ms
Hints growing on any node
GC pause times > 500ms

Daily Health Check Script¶

Run this script daily (or automate it):

#!/bin/bash
# daily_health_check.sh

LOG_FILE="/var/log/cassandra/health_$(date +%Y%m%d).log"

echo "=== Cassandra Health Check: $(date) ===" | tee -a ${LOG_FILE}

# 1. Cluster Status
echo -e "\n--- Cluster Status ---" | tee -a ${LOG_FILE}
STATUS=$(nodetool status 2>&1)
echo "${STATUS}" | tee -a ${LOG_FILE}

# Check for down nodes
if echo "${STATUS}" | grep -q "^DN"; then
    echo "CRITICAL: Down nodes detected!" | tee -a ${LOG_FILE}
fi

# 2. Schema Agreement
echo -e "\n--- Schema Agreement ---" | tee -a ${LOG_FILE}
SCHEMA=$(nodetool describecluster 2>&1 | grep -A5 "Schema versions")
echo "${SCHEMA}" | tee -a ${LOG_FILE}

SCHEMA_COUNT=$(echo "${SCHEMA}" | grep -c "\[")
if [ ${SCHEMA_COUNT} -gt 1 ]; then
    echo "WARNING: Schema disagreement detected!" | tee -a ${LOG_FILE}
fi

# 3. Thread Pools
echo -e "\n--- Thread Pool Status ---" | tee -a ${LOG_FILE}
TPSTATS=$(nodetool tpstats 2>&1)
echo "${TPSTATS}" | head -20 | tee -a ${LOG_FILE}

# Check for blocked pools
if echo "${TPSTATS}" | grep -E "Blocked.*[1-9]"; then
    echo "CRITICAL: Blocked thread pools!" | tee -a ${LOG_FILE}
fi

# Check for dropped messages
DROPPED=$(echo "${TPSTATS}" | grep -E "Dropped" | awk '{sum += $NF} END {print sum}')
if [ "${DROPPED}" -gt 0 ]; then
    echo "WARNING: ${DROPPED} dropped messages detected!" | tee -a ${LOG_FILE}
fi

# 4. Compaction
echo -e "\n--- Compaction Status ---" | tee -a ${LOG_FILE}
COMPACTION=$(nodetool compactionstats 2>&1)
echo "${COMPACTION}" | head -5 | tee -a ${LOG_FILE}

PENDING=$(echo "${COMPACTION}" | grep "pending tasks" | awk '{print $3}')
if [ "${PENDING}" -gt 50 ]; then
    echo "WARNING: High pending compactions (${PENDING})" | tee -a ${LOG_FILE}
fi

# 5. Disk Usage
echo -e "\n--- Disk Usage ---" | tee -a ${LOG_FILE}
DISK=$(df -h /var/lib/cassandra 2>&1)
echo "${DISK}" | tee -a ${LOG_FILE}

DISK_PERCENT=$(df /var/lib/cassandra | tail -1 | awk '{print $5}' | tr -d '%')
if [ ${DISK_PERCENT} -gt 80 ]; then
    echo "CRITICAL: Disk usage at ${DISK_PERCENT}%!" | tee -a ${LOG_FILE}
fi

# 6. Recent Errors
echo -e "\n--- Recent Errors (last 100 lines) ---" | tee -a ${LOG_FILE}
ERROR_COUNT=$(grep -c -i "error\|exception\|warn" /var/log/cassandra/system.log | tail -100)
echo "Errors/Warnings in recent logs: ${ERROR_COUNT}" | tee -a ${LOG_FILE}

# 7. Summary
echo -e "\n=== Health Check Complete ===" | tee -a ${LOG_FILE}

Operations Quick Reference¶

Cluster Management¶

# Add a new node
# 1. Install Cassandra on new node
# 2. Configure cassandra.yaml (same cluster_name, seeds pointing to existing nodes)
# 3. Start Cassandra - it will automatically bootstrap
sudo systemctl start cassandra

# Monitor bootstrap progress
nodetool netstats

# After bootstrap, run cleanup on existing nodes
# This removes data they no longer own
for node in existing_nodes; do
    ssh $node "nodetool cleanup"
done

# Remove a node (graceful - node is running)
nodetool decommission

# Remove a node (dead - node is not running)
nodetool removenode <host_id>

# Replace a dead node
# Add to JVM options on new node:
# -Dcassandra.replace_address_first_boot=<dead_node_ip>
# Then start Cassandra

Backup Operations¶

# Take a snapshot (point-in-time backup)
nodetool flush                                    # Flush memtables first
nodetool snapshot -t backup_$(date +%Y%m%d)      # Create snapshot

# List snapshots
nodetool listsnapshots

# Clear old snapshots
nodetool clearsnapshot -t old_backup_name

# Snapshot location
# /var/lib/cassandra/data/<keyspace>/<table>/snapshots/<snapshot_name>/

Repair Operations¶

# Primary range repair (recommended for routine maintenance)
nodetool repair -pr my_keyspace

# Repair specific table
nodetool repair -pr my_keyspace my_table

# Full repair (after disaster recovery)
nodetool repair -full my_keyspace

# Check repair progress
nodetool netstats | grep -i repair

# Cancel stuck repair
nodetool repair_admin list
nodetool repair_admin cancel <repair_id>

Maintenance Operations¶

# Rolling restart (on each node)
nodetool drain                        # Stop accepting writes, flush
sudo systemctl stop cassandra         # Stop the service
sudo systemctl start cassandra        # Start the service

# Cleanup (after topology changes)
nodetool cleanup my_keyspace

# Force compaction (use sparingly)
nodetool compact my_keyspace my_table

# Refresh SSTables (after manually copying files)
nodetool refresh my_keyspace my_table

Repair: Preventing Data Inconsistency¶

Repair is the most important maintenance operation. Without regular repair, clusters develop inconsistencies and eventually lose data.

Why Repair Is Essential¶

Scenario: Write to node that goes down before replication completes
─────────────────────────────────────────────────────────────────────────────

Time T0: Client writes row with RF=3
         Coordinator → Node A (success)
                    → Node B (success)
                    → Node C (fails - network issue)

         Result: Row exists on A, B, but NOT on C

Time T1 to T7: No repair runs, reads happen to hit A or B
               Data looks consistent to clients

Time T8: Node A disk fails, replaced
         New A bootstraps from B and C
         Row missing on C, so new A does not receive it

Time T9: Node B fails
         Row is now LOST - only existed on original A and B

Without repair: Data was silently lost
With repair: Inconsistency would have been detected and fixed at T1

Repair and Tombstones (Zombie Data)¶

The gc_grace_seconds Problem:
─────────────────────────────────────────────────────────────────────────────

gc_grace_seconds default: 864000 (10 days)

This setting controls when tombstones (delete markers) can be purged.
If repair does not run within gc_grace_seconds:

Time T0:  DELETE row WHERE id = 123
          Tombstone created on Nodes A, B
          Node C was down, didn't get tombstone

Time T5:  Node C comes back online
          Still has the original row (no tombstone)

Time T11: gc_grace_seconds expires
          Nodes A, B purge tombstone during compaction

Time T12: Read repair or repair runs
          Node C's "live" row is propagated to A, B
          DELETED DATA HAS RESURRECTED!

Rule: Complete repair on all nodes within gc_grace_seconds

Repair Scheduling¶

Cluster Size	Repair Frequency	Strategy
3-6 nodes	Weekly	Sequential on each node
6-20 nodes	Every 3-4 days	Parallel/sequential mix
20-50 nodes	Daily subrange	Break into token ranges
50+ nodes	Continuous	Use automated tools

Backup: Protecting Against Data Loss¶

Backup Strategy Matrix¶

Strategy	RPO	Complexity	Storage Cost
Daily snapshots	24 hours	Low	High
Snapshot + incremental	1-4 hours	Medium	Medium
Snapshot + commit log	Minutes	High	Medium
Continuous replication	Seconds	High	High (2x)

RPO = Recovery Point Objective (maximum data loss acceptable)

What to Back Up¶

Essential:
─────────────────────────────────────────────────────────────────────────────
✓ SSTables (data files)           → /var/lib/cassandra/data/
✓ Schema                          → cqlsh -e "DESC SCHEMA"
✓ Cassandra configuration         → /etc/cassandra/

Recommended:
─────────────────────────────────────────────────────────────────────────────
✓ JVM options                     → /etc/cassandra/jvm*.options
✓ System keyspace (if needed)     → system_schema, system_auth

NOT needed in backup:
─────────────────────────────────────────────────────────────────────────────
✗ Commit logs                     → Unless doing PITR
✗ Saved caches                    → Rebuilt automatically
✗ Hints                           → Transient

Backup Verification Checklist¶

# Monthly: Verify backup restores correctly

1. [ ] Restore backup to staging cluster
2. [ ] Verify schema was restored
3. [ ] Run application queries against restored data
4. [ ] Compare row counts between production and restored
5. [ ] Test point-in-time recovery (if using commit log archiving)
6. [ ] Document restore time (for RTO planning)
7. [ ] Update restore runbook if procedures changed

Capacity Planning¶

Disk Space Planning¶

Disk Space Formula:
─────────────────────────────────────────────────────────────────────────────

Required = (Raw Data × RF × Compaction Overhead × Growth Buffer) / Nodes

Where:
- Raw Data = Your actual data size
- RF = Replication Factor (typically 3)
- Compaction Overhead = 1.5 (STCS) or 1.1 (LCS)
- Growth Buffer = 1.5 (50% headroom for compaction + growth)

Example:
- Raw Data: 500 GB
- RF: 3
- Compaction: STCS (1.5x)
- Buffer: 1.5x
- Nodes: 6

Required per node = (500 × 3 × 1.5 × 1.5) / 6 = 562.5 GB

Recommendation: 1TB per node

When to Add Nodes¶

Add capacity when any of these thresholds approach:

Metric	Warning	Critical	Action
Disk usage	60%	75%	Add nodes or disk
CPU usage	70% sustained	85% sustained	Add nodes
Read latency p99	50ms	100ms	Investigate, possibly add nodes
Write latency p99	20ms	50ms	Investigate, possibly add nodes
Pending compactions	20 sustained	50 sustained	Add disk throughput or nodes

Emergency Procedures¶

Node Won't Start¶

# 1. Check logs
tail -500 /var/log/cassandra/system.log | grep -i "error\|exception"

# 2. Common causes and fixes

# OOM killer?
dmesg | grep -i "killed process"
→ Fix: Increase heap or add RAM

# Commit log corruption?
grep -i "corrupt" /var/log/cassandra/system.log
→ Fix: Move corrupted commit logs (DATA LOSS!)
→ mv /var/lib/cassandra/commitlog/* /var/lib/cassandra/commitlog_bad/

# SSTable corruption?
grep -i "sstable" /var/log/cassandra/system.log | grep -i "error"
→ Fix: nodetool scrub (after node starts) or sstablescrub

# Disk full?
df -h /var/lib/cassandra
→ Fix: Clear snapshots, delete old logs, add disk

# Permissions?
ls -la /var/lib/cassandra
→ Fix: chown -R cassandra:cassandra /var/lib/cassandra

Cluster Split Brain¶

# Symptoms: Different nodes see different cluster membership

# 1. Check gossip on each node
nodetool gossipinfo

# 2. Check for network partition
# From each node, test connectivity to others
for node in 10.0.0.1 10.0.0.2 10.0.0.3; do
    nc -zv $node 7000  # Inter-node
    nc -zv $node 9042  # CQL
done

# 3. Resolution
# - Fix network issues
# - If necessary, restart nodes one at a time
# - Run repair after cluster stabilizes

Schema Disagreement¶

# Check schema versions
nodetool describecluster

# If multiple versions:
# 1. Identify which node(s) have wrong schema
# 2. Try restarting the affected node
# 3. If persists, reset local schema (LAST RESORT):
nodetool resetlocalschema

# Prevention: Ensure DDL runs on single node, wait for propagation

Documentation Sections¶

Cluster Management ¶

Adding nodes to expand capacity
Removing nodes (decommission)
Replacing failed nodes
Topology changes (rack/DC)

Backup & Restore ¶

Snapshot procedures
Incremental backups
Point-in-time recovery
Restore procedures

Repair ¶

Repair types and strategies
Scheduling repair
Troubleshooting repair
Automation tools

Compaction Management ¶

Compaction configuration
Strategy selection and tuning
Troubleshooting compaction issues

Maintenance ¶

Rolling restarts
Schema management
Cleanup operations
Upgrade procedures

Monitoring ¶

Critical metrics to monitor
JMX metrics reference
Alert configuration
Dashboard design

Virtual Tables ¶

Query internal state via CQL
Metrics, caches, and thread pools
Repair tracking
Client connections and cluster state

Performance ¶

Read/write optimization
JVM and GC tuning
Compaction tuning
Hardware sizing

Troubleshooting ¶

Node issues
Performance issues
Cluster issues
Emergency procedures

Key Metrics to Monitor¶

Metric	Warning	Critical	Source
Node state	Any DN	Multiple DN	`nodetool status`, `system_views.gossip_info`
Heap usage	>75%	>85%	JMX/metrics
Disk usage	>60%	>80%	`df -h`, `system_views.disk_usage`
Pending compactions	>20	>50	`nodetool compactionstats`, `system_views.sstable_tasks`
Dropped messages	Any	Growing	`nodetool tpstats`, `system_views.thread_pools`
Read latency p99	>50ms	>100ms	JMX/metrics, `system_views.coordinator_read_latency`
Write latency p99	>20ms	>50ms	JMX/metrics, `system_views.coordinator_write_latency`
GC pause time	>500ms	>1s	GC logs
Tombstones per read	>100 p99	>1000 p99	`system_views.tombstones_per_read`
Cache hit ratio	<80%	<60%	`system_views.caches`

AxonOps Operations Platform¶

Operating Cassandra at scale requires coordinating multiple tools, scripts, and processes across nodes. AxonOps provides an integrated operations platform designed specifically for Cassandra.

Unified Operations Console¶

AxonOps provides:

Single-pane-of-glass: All cluster metrics, logs, and operations in one interface
Multi-cluster management: Manage multiple Cassandra clusters from a single console
Role-based access control: Define who can view, operate, or administer clusters
Audit logging: Complete history of all operational actions

Automated Maintenance¶

Scheduled repairs: Intelligent repair scheduling that minimizes impact
Automated backups: Policy-driven backup with multiple storage backends
Rolling operations: Coordinated rolling restarts and upgrades across nodes
Cleanup automation: Post-topology-change cleanup orchestration

Proactive Monitoring¶

Pre-configured alerts: Out-of-the-box alerts for common issues
Trend analysis: Identify gradual degradation before failures
Capacity forecasting: Predict when resources will be exhausted
Anomaly detection: ML-based detection of unusual patterns

Operational Workflows¶

Guided procedures: Step-by-step wizards for complex operations
Pre-flight checks: Automatic validation before operations
Rollback support: Quick recovery from failed operations
Runbook integration: Link alerts to resolution procedures

See the AxonOps documentation for setup and configuration.

Next Steps¶

Cluster Management - Node lifecycle operations
Backup & Restore - Data protection procedures
Repair - Maintain consistency
Maintenance - Day-to-day upkeep
Monitoring - Observability setup
Performance - Optimization procedures
Troubleshooting - Problem resolution