High Memory Usage¶

High memory usage can lead to OOM kills, long GC pauses, and degraded performance. This playbook covers diagnosis and resolution of memory-related issues.

Symptoms¶

OOM (OutOfMemoryError) in logs
Process killed by Linux OOM killer (dmesg | grep -i killed)
Long GC pauses (see GC Pause Issues)
Swap usage increasing
Cassandra process consuming more than expected memory
Slow queries during high memory periods

Diagnosis¶

Step 1: Check Current Memory Usage¶

# Heap usage
nodetool info | grep -i heap

# Process memory
ps aux | grep cassandra

# System memory
free -h

Step 2: Check for OOM Events¶

# Linux OOM killer
dmesg | grep -i "killed process\|oom"

# Cassandra OOM errors
grep -i "outofmemory\|heap space" /var/log/cassandra/system.log

Step 3: Analyze Memory Breakdown¶

# GC stats
nodetool gcstats

# Check off-heap usage (bloom filters, compression metadata)
nodetool info | grep -i "off.heap\|bloom\|compression"

Step 4: Check for Memory-Intensive Operations¶

# Large partitions being read
grep -i "large partition" /var/log/cassandra/system.log | tail -20

# Compaction activity
nodetool compactionstats

# Streaming activity
nodetool netstats

Step 5: Analyze Heap Dump (if available)¶

# Generate heap dump
jmap -dump:format=b,file=/tmp/heap.hprof $(pgrep -f CassandraDaemon)

# Analyze with tools like Eclipse MAT or jhat

Resolution¶

Immediate: Reduce Memory Pressure¶

# Clear caches
nodetool invalidatekeycache
nodetool invalidaterowcache

# Reduce concurrent operations
nodetool setconcurrency read 16
nodetool setconcurrency write 16

# Trigger GC
nodetool gcstats  # Shows GC activity

Short-term: Adjust Memory Settings¶

Right-size heap:

# In jvm.options
# Generally 8GB max for most workloads
-Xms8G
-Xmx8G

Tune GC:

# For G1GC
-XX:+UseG1GC
-XX:MaxGCPauseMillis=300
-XX:G1HeapRegionSize=16m

Medium-term: Address Root Causes¶

Cause 1: Large partitions

See Large Partition Issues.

# Find large partitions
grep "large partition" /var/log/cassandra/system.log
nodetool tablestats my_keyspace | grep -i partition

Cause 2: Too many SSTables

# Check SSTable counts
nodetool tablestats my_keyspace | grep -E "Table:|SSTable count"

# Run compaction if needed
nodetool compact my_keyspace my_table

Cause 3: Row cache enabled

# Check row cache
nodetool info | grep -i "row cache"

# Disable if causing issues
ALTER TABLE my_table WITH caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'};

Cause 4: Bloom filter memory

# Check bloom filter size
nodetool tablestats my_keyspace | grep -i "bloom"

# Adjust bloom filter FP chance (higher = less memory)
ALTER TABLE my_table WITH bloom_filter_fp_chance = 0.1;

Cause 5: Concurrent repairs/streaming

# Check active streams
nodetool netstats

# Reduce concurrent repairs
nodetool repair_admin cancel --force

Long-term: Capacity Planning¶

Calculate required memory:

Total memory needed =
  JVM Heap (8-16GB)
  + Off-heap structures (~1-4GB depending on data size)
  + OS page cache (remaining available RAM)
  + OS overhead (~1GB)

Right-size the node:

Data per node	Recommended RAM
< 500 GB	16 GB
500 GB - 1 TB	32 GB
1 TB - 2 TB	64 GB
> 2 TB	64 GB + add nodes

Recovery¶

After OOM¶

# Check if node is running
systemctl status cassandra

# If down, start it
sudo systemctl start cassandra

# Monitor startup
tail -f /var/log/cassandra/system.log

# Verify node rejoined cluster
nodetool status

Verify Memory Stability¶

# Monitor heap usage
watch -n 10 'nodetool info | grep -i heap'

# Watch for GC issues
watch -n 30 'nodetool gcstats'

Memory Configuration Reference¶

JVM Heap Settings¶

# jvm.options
-Xms8G                    # Initial heap
-Xmx8G                    # Maximum heap (should equal -Xms)
-XX:+AlwaysPreTouch       # Pre-touch heap pages

Off-Heap Settings¶

# cassandra.yaml
# Memtable space
memtable_heap_space_in_mb: 2048
memtable_offheap_space_in_mb: 2048

# Native transport
native_transport_max_concurrent_connections: 128

Memory Guidelines¶

Component	Typical Size	Notes
Heap	8 GB	Rarely benefit from > 16 GB
Memtables	2-4 GB	Configured in cassandra.yaml
Bloom filters	Varies	~1.25 bytes per key
Compression metadata	Varies	~60 bytes per 64KB chunk
Page cache	Remaining RAM	OS managed

Prevention¶

Monitor heap usage - Alert at 75% utilization
Set heap limits - Don't let JVM grow unbounded
Avoid large partitions - Design for bounded partition sizes
Disable row cache - Unless specific use case requires it
Regular compaction - Reduce SSTable overhead
Capacity planning - Add nodes before memory becomes critical

Command	Purpose
`nodetool info`	Memory usage overview
`nodetool gcstats`	GC statistics
`nodetool tablestats`	Per-table memory usage
`nodetool invalidatekeycache`	Clear key cache
`nodetool invalidaterowcache`	Clear row cache

GC Pause Issues - GC-related problems
Large Partition Issues - Partition size problems
JVM Options - JVM tuning