Storage Engine¶
This section describes the internal operations of a single Cassandra node. While Cassandra is a distributed database, each node operates an independent storage engine that manages local data persistence and retrieval. The distributed aspects—replication, consistency coordination, and cross-node communication—are covered in Distributed Data.
The storage engine is the core database component responsible for persisting data to disk and retrieving it efficiently. Cassandra's storage engine is based on the Log-Structured Merge-tree (LSM-tree) design described in Google's Bigtable paper (Chang et al., 2006, "Bigtable: A Distributed Storage System for Structured Data"), adapted for Cassandra's requirements.
Unlike traditional relational databases that update data in place, Cassandra's storage engine treats all writes as sequential appends. This design choice optimizes for write throughput and enables consistent performance regardless of dataset size.
Node-Local Focus
All operations described in this section—commit log writes, memtable management, SSTable creation, compaction, and indexing—occur independently on each node. A write operation arriving at a node is processed entirely by that node's storage engine, with no coordination with other nodes during the local persistence phase.
LSM-Tree Design¶
The LSM-tree architecture originated from the need to handle write-intensive workloads efficiently. Rather than performing random I/O to update B-tree pages, LSM-trees buffer writes in memory and periodically flush sorted data to immutable files on disk.
B-tree vs LSM-tree¶
Traditional databases use B-tree storage with random writes to update pages in place. LSM-tree databases use sequential writes only, appending data to immutable files.
| Characteristic | B-tree (RDBMS) | LSM-tree (Cassandra) |
|---|---|---|
| Write pattern | Random I/O | Sequential I/O |
| Write performance | Degrades with size | Consistent |
| Read performance | Single seek | Multiple file checks |
| Space efficiency | High | Requires compaction |
| Write amplification | Page splits | Compaction rewrites |
Design Trade-offs¶
LSM-tree advantages:
- Write latency remains consistent regardless of data size
- Sequential writes maximize disk throughput
- Horizontal scaling without central index coordination
- Effective on both HDD and SSD storage
LSM-tree costs:
- Reads may check multiple files
- Background compaction required
- Space amplification during compaction
Storage Architecture¶
Component Overview¶
Commit Log¶
Write-ahead log providing durability. All writes append to the commit log before updating the memtable. Used only for crash recovery.
- Sequential append-only writes
- Segmented into fixed-size files (default 32MB)
- Recycled after referenced memtables flush
See Write Path for configuration details.
Memtable¶
In-memory sorted data structure holding recent writes. One memtable exists per table per node.
- ConcurrentSkipListMap implementation
- Sorted by partition key token, then clustering columns
- Flushed to SSTable when size threshold reached
See Write Path for memory configuration.
SSTable¶
Sorted String Table - immutable files on disk containing partition data. Each SSTable consists of multiple component files.
- Immutable after creation
- Contains data, indexes, bloom filter, metadata
- Merged during compaction
See SSTable Reference for file format details.
Compaction¶
Background process merging SSTables to reclaim space and improve read performance.
- Combines multiple SSTables into fewer, larger files
- Removes obsolete data and expired tombstones
- Multiple strategies available (STCS, LCS, TWCS)
See Compaction for strategy details.
Documentation Structure¶
| Section | Description |
|---|---|
| Write Path | Commit log, memtable, flush process |
| Read Path | Bloom filters, indexes, caching |
| SSTable Reference | File components and format |
| Tombstones | Deletion markers and gc_grace |
| Compaction | SSTable merge strategies and operations |
| Indexes | Secondary indexes, SASI, and SAI |
| Materialized Views | Automatic denormalization |
Related Documentation¶
- Compaction - SSTable merge strategies
- Indexes - Secondary indexes, SASI, and SAI
- Materialized Views - Automatic denormalization
- Replication - Data distribution
- Consistency - Read and write consistency levels