Handle Full Disk¶

A full disk is a critical emergency that can cause data loss, node failures, and cluster instability. Immediate action is required.

Symptoms¶

Write failures with "No space left on device"
Cassandra process crashes or refuses to start
Compaction failures
Commit log segment allocation failures
Node becomes unavailable

Immediate Response¶

Step 1: Assess Situation¶

# Check disk usage
df -h /var/lib/cassandra

# Check what's consuming space
du -sh /var/lib/cassandra/*

Step 2: Stop Writes (If Possible)¶

# Disable binary protocol to stop client writes
nodetool disablebinary

# Disable gossip to prevent coordinator traffic
nodetool disablegossip

Step 3: Quick Space Recovery¶

Option A: Clear snapshots (fastest, usually safe)

# List snapshots
nodetool listsnapshots

# Clear all snapshots
nodetool clearsnapshot --all

# Check recovered space
df -h /var/lib/cassandra

Option B: Clear old hints (if hints are large)

# Check hints size
du -sh /var/lib/cassandra/hints

# Truncate hints (some data loss risk during node down scenarios)
nodetool truncatehints

Option C: Clear saved caches

rm -rf /var/lib/cassandra/saved_caches/*

Step 4: Verify Recovery¶

df -h /var/lib/cassandra
# Should show < 90% usage for safe operation

Step 5: Re-enable Operations¶

nodetool enablegossip
nodetool enablebinary

Diagnosis¶

What's Using Space?¶

# Detailed breakdown
du -h /var/lib/cassandra/data/* | sort -h | tail -20

# Snapshots
du -sh /var/lib/cassandra/data/*/*/snapshots/* 2>/dev/null | sort -h | tail -10

# Commitlog
du -sh /var/lib/cassandra/commitlog

# Hints
du -sh /var/lib/cassandra/hints

Identify Large Tables¶

# Size per table
nodetool tablestats 2>/dev/null | grep -E "Table:|Space used" | paste - - | sort -t: -k3 -h | tail -20

Check for Snapshot Accumulation¶

nodetool listsnapshots

Old snapshots from backups, repairs, or schema changes accumulate over time.

Resolution by Cause¶

Cause 1: Snapshot Accumulation¶

Clear specific snapshots:

# Clear snapshot by name
nodetool clearsnapshot -t snapshot_name

# Clear all snapshots
nodetool clearsnapshot --all

Clear snapshots for specific keyspace:

nodetool clearsnapshot -t snapshot_name -- my_keyspace

Cause 2: Failed Compaction¶

Compaction needs temporary space. If disk filled mid-compaction:

# Clear snapshots first
nodetool clearsnapshot --all

# Reduce compaction parallelism
nodetool setconcurrentcompactors 1

# Reduce compaction throughput
nodetool setcompactionthroughput 32

Cause 3: Large Table Growth¶

# Identify growing tables
nodetool tablestats my_keyspace | grep -E "Table:|Space used"

# Consider:
# 1. Add nodes to distribute data
# 2. Implement TTLs
# 3. Archive old data

Cause 4: Commitlog Growth¶

# Check commitlog
du -sh /var/lib/cassandra/commitlog/*

# Force flush to reduce commitlog
nodetool flush

# If commitlog is blocking startup, may need to clear
# WARNING: DATA LOSS - unflushed data will be lost
# sudo rm /var/lib/cassandra/commitlog/*

Cause 5: Hints Accumulation¶

Hints accumulate when nodes are down:

# Check hints
du -sh /var/lib/cassandra/hints

# Truncate hints (loses hints data)
nodetool truncatehints

# Fix underlying node issues
nodetool status  # All should be UN

Emergency Procedures¶

Cannot Start Cassandra Due to Full Disk¶

# 1. Clear snapshots manually
rm -rf /var/lib/cassandra/data/*/*/snapshots/*

# 2. Clear saved caches
rm -rf /var/lib/cassandra/saved_caches/*

# 3. If still full, reduce commitlog
# WARNING: Potential data loss
rm /var/lib/cassandra/commitlog/*

# 4. Try starting
sudo systemctl start cassandra

Multiple Nodes Full¶

Indicates cluster capacity issue:

Add temporary disk capacity if possible
Clear snapshots on all nodes
Plan capacity expansion urgently
Consider emergency node additions

Prevention¶

Monitoring¶

Set up alerts:

Metric	Warning	Critical
Disk usage	> 70%	> 85%
Disk growth rate	Unusual spike	-

Automated Cleanup¶

#!/bin/bash
# cleanup_snapshots.sh - Run periodically

# Clear snapshots older than 7 days
find /var/lib/cassandra/data -path '*/snapshots/*' -mtime +7 -delete

# Report disk usage
df -h /var/lib/cassandra | mail -s "Cassandra disk report" [email protected]

Configuration¶

# cassandra.yaml

# Auto-snapshot before DROP/TRUNCATE
auto_snapshot: true

# Limit hints storage
max_hints_file_size_in_mb: 128
hints_flush_period_in_ms: 10000
max_hints_delivery_threads: 2

Capacity Planning¶

Data Growth	Action
< 5% per month	Monitor
5-10% per month	Plan expansion
> 10% per month	Expand immediately

Rule of thumb: Keep disk usage below 50% to allow for: - Compaction temporary space - Growth headroom - Emergency buffer

Recovery Verification¶

# Verify disk space
df -h /var/lib/cassandra

# Verify node health
nodetool status
nodetool info

# Verify compaction can run
nodetool compactionstats

# Verify writes work
cqlsh -e "INSERT INTO system_auth.roles (role) VALUES ('test_write');"
cqlsh -e "DELETE FROM system_auth.roles WHERE role = 'test_write';"

Space Requirements¶

Minimum Free Space¶

Component	Requirement
Compaction	50% of largest SSTable
Repair	Variable, can be significant
Normal operations	20% free recommended
Safe operating range	< 70% used

Estimation¶

# Current usage
df -h /var/lib/cassandra

# Data size
nodetool tablestats 2>/dev/null | grep "Space used (total)" | awk '{sum+=$5} END {print sum/1024/1024/1024 " GB"}'

# Snapshot size
du -sh /var/lib/cassandra/data/*/*/snapshots/* 2>/dev/null | awk '{sum+=$1} END {print sum " total in snapshots"}'

Problem	Playbook
Compaction failing	Compaction Issues
Node down	Replace Dead Node
OOM related to disk	Recover from OOM

Command	Purpose
`nodetool clearsnapshot`	Remove snapshots
`nodetool listsnapshots`	List snapshots
`nodetool truncatehints`	Clear hints
`nodetool flush`	Flush memtables
`nodetool disablebinary`	Stop client connections