Skip to content

High CPU Usage Playbook

Symptoms

  • CPU utilization consistently above 70%
  • Slow query response times
  • Node performance degradation
  • Increased latency across the cluster

Impact

  • Severity: Medium to High
  • Services Affected: All queries to the affected node
  • User Impact: Increased latency, potential timeouts

Immediate Assessment

Step 1: Confirm High CPU

# Check overall CPU usage
top -b -n 1 | head -20

# Check Cassandra process
ps aux | grep -i cassandra

# Check CPU usage over time
sar -u 1 10

# Expected output showing high CPU:
# %user   %nice %system %iowait  %idle
# 75.50    0.00    8.20    2.30   14.00

Step 2: Identify CPU Consumer

# Cassandra JVM CPU breakdown
top -H -p $(pgrep -f CassandraDaemon)

# Look for:
# - GC threads (high = memory pressure)
# - CompactionExecutor threads
# - ReadStage/MutationStage threads

# Check thread distribution
nodetool tpstats

Step 3: Check for Common Causes

# 1. Check compaction activity
nodetool compactionstats

# 2. Check for GC pressure
tail -100 /var/log/cassandra/gc.log | grep "pause"

# 3. Check thread pools
nodetool tpstats | grep -E "Pending|Blocked"

# 4. Check for expensive queries
grep -i "slow" /var/log/cassandra/system.log | tail -20

Common Causes and Solutions

Cause 1: Heavy Compaction

Diagnosis:

nodetool compactionstats

# High CPU during compaction:
# Active compactions: 4
# pending tasks: 45

Resolution:

# Reduce compaction throughput
nodetool setcompactionthroughput 32  # MB/s

# Reduce concurrent compactors
# Edit cassandra.yaml:
# concurrent_compactors: 1

# For immediate relief (temporary)
nodetool stop COMPACTION
# WARNING: Only temporary, compaction backlog will grow

Cause 2: Garbage Collection

Diagnosis:

# Check GC activity
grep -E "GC pause|Total time" /var/log/cassandra/gc.log | tail -50

# Check heap usage
nodetool info | grep "Heap Memory"

# High GC means heap pressure

Resolution:

# Tune GC settings in jvm11-server.options
# -XX:MaxGCPauseMillis=500
# -XX:InitiatingHeapOccupancyPercent=70

# If heap is too small, increase (max 31GB)
# -Xmx16G -Xms16G

# After changes, rolling restart required

Cause 3: High Query Volume

Diagnosis:

# Check request rates
nodetool proxyhistograms

# Check thread pool saturation
nodetool tpstats | grep -E "ReadStage|MutationStage"

# Active or Pending > 0 indicates load

Resolution:

# Scale cluster (add nodes)
# Or redirect traffic (load balancer)

# Temporary: Increase thread pool size
# Edit cassandra.yaml:
# concurrent_reads: 64
# concurrent_writes: 64
# Requires restart

Cause 4: Expensive Queries

Diagnosis:

# Check for ALLOW FILTERING or large scans
grep -i "allow filtering\|full scan" /var/log/cassandra/system.log

# Enable slow query logging
# cassandra.yaml: slow_query_log_timeout_in_ms: 500

# Check table statistics for problem tables
nodetool tablestats | grep -A20 "Table: "

Resolution:

-- Fix query patterns
-- Instead of ALLOW FILTERING, create proper table

-- Check for missing indexes or bad data model
DESCRIBE TABLE problem_table;

-- Review and optimize queries
TRACING ON;
SELECT * FROM problem_table WHERE ...;

Cause 5: Large Partitions

Diagnosis:

# Check partition sizes
nodetool tablehistograms keyspace table | grep "Partition Size"

# Large partitions cause CPU during reads
# 99th percentile > 100MB indicates problem

Resolution:

-- Implement time bucketing
-- Split large partitions
-- See data modeling best practices

Cause 6: Tombstone Scanning

Diagnosis:

# Check tombstone counts
nodetool tablestats keyspace | grep -i tombstone

# Large tombstone counts cause CPU on reads

Resolution:

# Run compaction to remove tombstones
nodetool compact keyspace table

# Review delete patterns
# Use TTL instead of explicit deletes

Detailed Investigation

Thread Dump Analysis

# Take thread dump
jstack $(pgrep -f CassandraDaemon) > /tmp/threaddump_$(date +%s).txt

# Look for:
# - Many threads in RUNNABLE state
# - Threads blocked on locks
# - GC threads consuming CPU

# Analyze hot methods
grep -A 10 "RUNNABLE" /tmp/threaddump_*.txt | head -100

Profile with async-profiler

# Download async-profiler
wget https://github.com/jvm-profiling-tools/async-profiler/releases/download/v2.9/async-profiler-2.9-linux-x64.tar.gz
tar xzf async-profiler-2.9-linux-x64.tar.gz

# Profile CPU for 60 seconds
./profiler.sh -d 60 -f /tmp/profile.html $(pgrep -f CassandraDaemon)

# Open profile.html in browser to see flame graph

Check for Hot Partitions

# Enable tracing
cqlsh -e "TRACING ON; SELECT * FROM keyspace.table WHERE partition_key = ?;"

# Look for:
# - High number of rows read
# - Long read time
# - Many SSTables accessed

Resolution Steps

Immediate Actions

# 1. Reduce compaction impact
nodetool setcompactionthroughput 16

# 2. Check if single node is hot (route traffic away)
nodetool status  # Check load distribution

# 3. If GC-related, restart node (clears heap)
nodetool drain
sudo systemctl restart cassandra

Short-term Fixes

# 1. Add capacity (if needed)
# Scale horizontally

# 2. Tune thread pools
# cassandra.yaml adjustments

# 3. Optimize queries
# Fix ALLOW FILTERING
# Add appropriate indexes

Long-term Solutions

# 1. Review and fix data model
# Implement proper partitioning
# Time bucketing for time-series

# 2. Upgrade hardware
# More CPU cores
# Faster storage (NVMe)

# 3. Scale cluster appropriately
# Add nodes before hitting limits

Prevention

Monitoring

# Alert on:
- CPU > 60% for 10 minutes (warning)
- CPU > 80% for 5 minutes (critical)
- Pending compactions > 20
- GC pause > 500ms

Capacity Planning

Monitor trends:
- CPU utilization over time
- Request rate growth
- Data growth rate

Plan scaling:
- Scale before hitting 70% sustained CPU
- Add nodes in pairs for balance
- Test new capacity in staging

Query Review

-- Regularly review slow queries
-- Enable slow query logging

-- Audit ALLOW FILTERING usage
grep -r "ALLOW FILTERING" /app/queries/

-- Review data model quarterly

Verification

After resolution, verify:

# CPU returned to normal
top -b -n 1 | grep Cpu

# Latency improved
nodetool proxyhistograms

# Thread pools healthy
nodetool tpstats | grep -E "Pending|Blocked"

# Compaction backlog managed
nodetool compactionstats

Next Steps