Memory Management¶
Cassandra uses multiple memory regions: JVM heap, off-heap native memory, and OS page cache. Understanding how Cassandra allocates memory across these regions is essential for capacity planning and performance tuning.
For JVM configuration and garbage collection tuning, see JVM.
Memory Architecture¶
Heap Components¶
The JVM heap holds Cassandra's managed data structures. These are subject to garbage collection.
| Component | Description | Configuration |
|---|---|---|
| Memtables | In-memory write buffer per table | memtable_heap_space_in_mb |
| Key Cache | Maps partition keys to SSTable offsets | key_cache_size_in_mb |
| Row Cache | Caches entire rows (use sparingly) | row_cache_size_in_mb |
| Partition Summary | Sampled index (pre-4.0 only) | min_index_interval |
Memtables¶
Memtables buffer writes before flushing to SSTables. Each table has one active memtable.
# cassandra.yaml
# Total heap space for memtables across all tables
memtable_heap_space_in_mb: 4096
# Flush threshold (fraction of memtable space)
memtable_cleanup_threshold: 0.11
Memtable Sizing
Larger memtables reduce flush frequency but increase memory pressure. For write-heavy workloads, consider off-heap memtables.
Key Cache¶
The key cache stores partition key to SSTable offset mappings, eliminating partition index lookups for frequently accessed partitions.
# cassandra.yaml
# Key cache size (default: min of 5% heap or 100MB)
key_cache_size_in_mb: 100
# Save interval (0 disables saving)
key_cache_save_period: 14400
# Keys to save (empty = all)
key_cache_keys_to_save:
Row Cache¶
The row cache stores entire rows. It can dramatically improve read performance for frequently accessed rows but consumes significant heap space and can cause GC pressure.
# cassandra.yaml
# Disabled by default (0)
row_cache_size_in_mb: 0
# Save interval
row_cache_save_period: 0
Row Cache Caution
Row cache is disabled by default for good reason. Enable only for specific tables with read-heavy, rarely-updated data. Consider the key cache or OS page cache first.
Off-Heap Memory¶
Off-heap memory is native memory allocated outside the JVM heap. It is not subject to garbage collection, eliminating GC pauses for these structures.
Why Off-Heap Memory Matters for Databases¶
Garbage collection is the primary source of latency variability in Java applications. For databases like Cassandra, GC pauses directly translate to:
- Query latency spikes: A 500ms GC pause means 500ms added to every in-flight query
- Coordinator timeouts: Other nodes may mark a pausing node as unresponsive
- Cluster instability: Gossip failures during long pauses can trigger unnecessary node replacements
The fundamental problem is that GC pause duration scales with heap size and object count. A database handling millions of partitions with gigabytes of cached data will have longer GC pauses than a simple web application.
Off-heap memory solves this by removing large, long-lived data structures from GC's responsibility entirely. The GC only sees a small pointer to the off-heap region, not the gigabytes of data stored there.
What is Off-Heap Memory?¶
The JVM manages two distinct memory regions:
-
Heap memory: Managed by the garbage collector. Objects are allocated here by default. When objects are no longer referenced, the GC reclaims the memory during collection cycles, which can cause application pauses.
-
Off-heap (native) memory: Allocated directly from the operating system, bypassing the JVM's garbage collector. The application is responsible for explicitly allocating and freeing this memory. No GC pauses occur for off-heap allocations.
How Java Accesses Off-Heap Memory¶
Java provides several mechanisms for allocating and accessing memory outside the heap:
1. Direct ByteBuffers (ByteBuffer.allocateDirect())
The standard Java API for off-heap memory allocation. Direct buffers allocate memory outside the heap and are commonly used for I/O operations.
// Allocate 1MB of off-heap memory
ByteBuffer buffer = ByteBuffer.allocateDirect(1024 * 1024);
// Write data
buffer.putInt(42);
buffer.putLong(System.currentTimeMillis());
// Read data
buffer.flip();
int value = buffer.getInt();
- Memory is allocated via
malloc()in native code - Buffer object on heap tracks the native memory address
- Memory is freed when the ByteBuffer is garbage collected (via a
Cleaner) - Can also be explicitly freed in Java 9+ via
sun.misc.Unsafeor the Foreign Memory API
2. sun.misc.Unsafe (Internal API)
Provides low-level, direct memory access with no bounds checking. Used by high-performance libraries including Cassandra.
// Cassandra uses Unsafe for direct memory operations
Unsafe unsafe = getUnsafe();
// Allocate native memory
long address = unsafe.allocateMemory(1024);
// Write directly to memory address
unsafe.putInt(address, 42);
unsafe.putLong(address + 4, System.currentTimeMillis());
// Read from memory address
int value = unsafe.getInt(address);
// Must explicitly free
unsafe.freeMemory(address);
- Fastest possible memory access (no bounds checking)
- Application must manage allocation and deallocation
- Memory leaks occur if
freeMemory()is not called - Being replaced by the Foreign Function & Memory API in newer Java versions
3. Memory-Mapped Files (MappedByteBuffer)
Maps a file directly into the process address space. The OS handles paging data between disk and RAM.
// Map a file into memory
FileChannel channel = FileChannel.open(path, READ, WRITE);
MappedByteBuffer mapped = channel.map(MapMode.READ_WRITE, 0, fileSize);
// Access file contents as memory
int value = mapped.getInt(offset);
- OS manages which portions are in RAM (page cache)
- Efficient for large files that don't fit in memory
- Used by Cassandra for SSTable access in some configurations
How Cassandra Uses Off-Heap Memory¶
Cassandra strategically places different data structures on or off the heap based on their characteristics:
| Component | Why Off-Heap? |
|---|---|
| Bloom filters | Bloom filters are large (10 bits per partition key × number of SSTables), long-lived (exist for SSTable lifetime), and accessed on every read. On-heap, they would consume gigabytes and be scanned by every GC cycle. |
| Partition index (trie) | The trie-based partition index (4.0+) grows with partition count. A table with 100 million partitions could have a multi-gigabyte index. Off-heap placement prevents this from bloating GC pause times. |
| Compression metadata | Stores byte offsets for each compressed chunk in an SSTable. Grows proportionally to data size and SSTable count. Rarely changes once written. |
| Off-heap memtables | Memtables have high object churn—data is constantly written and then flushed. This churn creates GC pressure. Off-heap memtables keep write-path allocations out of the heap. |
| Chunk cache | Caches decompressed SSTable blocks. Can grow to multiple gigabytes for read-heavy workloads. Off-heap prevents cache growth from impacting GC. |
| Networking buffers | Direct ByteBuffers enable zero-copy I/O between the network stack and Cassandra. Data can be sent/received without copying through the heap. |
Why Not Put Everything Off-Heap?
Off-heap memory requires manual lifecycle management. Cassandra keeps some structures on-heap because:
- Short-lived objects: Request-scoped objects that are quickly discarded benefit from generational GC, which efficiently handles short-lived allocations
- Complex object graphs: Data structures with many internal references are difficult to serialize to flat memory regions
- Debugging: Heap dumps capture on-heap objects; off-heap memory is invisible to standard Java profiling tools
Cassandra's Memory Allocators
Cassandra implements custom memory allocators to manage off-heap memory efficiently:
BufferPool: Manages pools of direct ByteBuffers for networking and chunk cache, avoiding allocation overheadNativeAllocator: UsesUnsafefor bloom filters and index structures with explicit lifecycle management- Slab allocation: Reduces fragmentation by allocating fixed-size chunks rather than variable-sized blocks
Off-Heap Trade-offs¶
| Advantage | Disadvantage |
|---|---|
| No GC pauses | Manual memory management required |
| Can exceed heap size limits | Memory leaks if not freed properly |
| Better cache locality for large structures | Slightly slower allocation than heap |
| Reduced heap pressure | Harder to debug (not visible in heap dumps) |
| Enables larger working sets | Must account for in capacity planning |
Monitoring Off-Heap Usage¶
# Total off-heap memory used by Cassandra
nodetool info | grep "Off Heap"
# Native memory tracking (JVM flag required)
# Add to jvm-server.options: -XX:NativeMemoryTracking=summary
jcmd <pid> VM.native_memory summary
Off-Heap Components in Detail¶
| Component | Description | Memory Scaling |
|---|---|---|
| Bloom filters | Probabilistic existence checks | ~10 bits per partition key per SSTable |
| Compression metadata | Chunk offset mappings | Proportional to data size |
| Partition index | Trie-based index (4.0+) | Proportional to partition count |
| Memtables | Write buffer (if configured) | memtable_offheap_space_in_mb |
| Chunk cache | Compressed SSTable chunks | file_cache_size_in_mb |
Off-Heap Memtables¶
Moving memtables off-heap reduces GC pressure significantly for write-heavy workloads.
# cassandra.yaml
# Off-heap memtable allocation (choose based on workload)
# memtable_allocation_type: offheap_objects # Write-heavy: lowest GC
# memtable_allocation_type: offheap_buffers # Read-heavy: minimal read impact
memtable_allocation_type: offheap_objects
| Type | Description | Best For |
|---|---|---|
heap_buffers |
All memtable data on heap (default) | Low-memory environments, simple deployments |
offheap_buffers |
Cell names/values in DirectBuffers, metadata on heap | Read-heavy workloads, large cell values (blobs, long strings) |
offheap_objects |
Entire cells off-heap, only pointers on heap | Write-heavy workloads, small cell values (ints, UUIDs), lowest GC pressure |
Choosing a Memtable Allocation Type
offheap_objects: Recommended for write-heavy workloads. Provides lowest GC pressure but adds slight read overhead (data copied back to heap when read). Requires JNA library.offheap_buffers: Recommended for read-heavy workloads with large values. Minimal read impact but less GC reduction thanoffheap_objects.heap_buffers: Default. Use when off-heap complexity is not justified or JNA is unavailable.
Chunk Cache¶
The chunk cache stores decompressed SSTable chunks, reducing CPU overhead for repeated reads.
# cassandra.yaml
# Auto-sized by default (1/4 of available memory)
# file_cache_size_in_mb: auto
# Manually set if needed
file_cache_size_in_mb: 2048
OS Page Cache¶
The operating system automatically caches recently accessed file data in unused RAM. This is Cassandra's primary read cache for SSTable data.
How It Works¶
- SSTable data cached after first read
- No Cassandra configuration required
- Automatically sized to available RAM
- Shared across all processes
- Evicted under memory pressure (LRU)
Sizing¶
Page Cache = Total RAM - JVM Heap - Off-Heap - OS Overhead
Example (64GB server):
- JVM Heap: 24GB
- Off-heap: 4-6GB
- OS overhead: 4GB
- Page cache: 30-32GB available
Maximizing Page Cache Effectiveness¶
- Size heap appropriately (not too large)
- Leave sufficient free RAM
- Avoid memory-hungry co-located processes
- Use SSDs for faster cache misses
Memory Sizing Example¶
64GB Server Configuration¶
Configuration¶
# cassandra.yaml
# Memtables
memtable_heap_space_in_mb: 4096
memtable_allocation_type: offheap_buffers
# Key cache
key_cache_size_in_mb: 100
# Row cache (disabled)
row_cache_size_in_mb: 0
# jvm-server.options
-Xms24G
-Xmx24G
Monitoring Memory¶
Heap and Off-Heap Usage¶
# Overall memory status
nodetool info
# Heap memory
nodetool info | grep "Heap Memory"
# Off-heap memory
nodetool info | grep "Off Heap Memory"
# GC statistics
nodetool gcstats
Per-Table Memory¶
# Table statistics including bloom filter size
nodetool tablestats keyspace.table
# Bloom filter memory
nodetool tablestats | grep -i bloom
JMX Metrics¶
# Heap
java.lang:type=Memory/HeapMemoryUsage
# Memtables
org.apache.cassandra.metrics:type=Table,name=MemtableOnHeapSize
org.apache.cassandra.metrics:type=Table,name=MemtableOffHeapSize
# Caches
org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Size
org.apache.cassandra.metrics:type=Cache,scope=RowCache,name=Size
# Bloom filters
org.apache.cassandra.metrics:type=Table,name=BloomFilterOffHeapMemoryUsed
Troubleshooting¶
High Heap Usage¶
Symptoms: - Long GC pauses - Heap usage consistently >70% - OutOfMemoryError
Solutions:
- Move memtables off-heap (
memtable_allocation_type: offheap_buffers) - Reduce key cache size
- Disable row cache if enabled
- Reduce number of tables (see below)
Memory Pressure from Many Tables¶
Each table requires memory for: - One memtable - Bloom filters per SSTable - Index structures per SSTable
Table Count Guideline
Avoid more than 200 tables per node. Each table consumes memory regardless of data volume.
Bloom Filter Memory¶
Bloom filter memory scales with partition count and SSTable count:
Memory ≈ partitions × SSTables × bits_per_key
Example:
- 100 million partitions
- 20 SSTables average
- 10 bits per key
- ≈ 2.5GB bloom filter memory
Reduce bloom filter memory by:
- Increasing
bloom_filter_fp_chance(allows more false positives) - Reducing SSTable count through better compaction
- Using fewer, larger partitions
-- Increase false positive rate to reduce memory
ALTER TABLE my_table WITH bloom_filter_fp_chance = 0.1;
Configuration Reference¶
Workload-Specific Settings¶
Write-Heavy:
memtable_heap_space_in_mb: 4096
memtable_allocation_type: offheap_objects # Lowest GC pressure for high write rates
memtable_flush_writers: 4
Read-Heavy:
memtable_allocation_type: offheap_buffers # Minimal read overhead
key_cache_size_in_mb: 200
# Ensure sufficient page cache for working set
Mixed:
memtable_heap_space_in_mb: 2048
memtable_allocation_type: offheap_objects
key_cache_size_in_mb: 100
Related Documentation¶
- JVM - JVM configuration and garbage collection
- Linux - Kernel settings, swap, THP, and NUMA
- Storage Engine Overview - Architecture overview
- Write Path - Memtable flush process
- Read Path - Cache behavior during reads