SSTable Reference¶

SSTables (Sorted String Tables) are Cassandra's persistent storage files. All data ultimately resides in SSTables on disk—they are the database files. When a memtable flushes, it creates an SSTable. When compaction runs, it reads SSTables and writes new ones. When a node restarts, it reads SSTables to rebuild its state.

Each SSTable is immutable once written. This immutability simplifies concurrency (no locks needed for reads), enables efficient sequential writes, and allows safe snapshots via hard links. However, it also means that updates and deletes create new data rather than modifying existing files, requiring background compaction to reclaim space and merge versions.

An SSTable is not a single file but a set of component files: data, indexes, bloom filter, compression metadata, and statistics. Understanding these components is essential for troubleshooting, capacity planning, and performance analysis.

SSTable File Location¶

data_directory/keyspace_name/table_name-table_uuid/
├── na-1-big-Data.db
├── na-1-big-Index.db
├── na-1-big-Filter.db
├── na-1-big-Statistics.db
├── na-1-big-Summary.db
├── na-1-big-CompressionInfo.db
├── na-1-big-Digest.crc32
└── na-1-big-TOC.txt

File Naming Convention¶

<version>-<generation>-<format>-<component>.<extension>

Example: na-1-big-Data.db

na      - SSTable format version
1       - Generation number (increments with compaction)
big     - Format type
Data    - Component type
db      - File extension

Version Identifiers¶

Version	Cassandra Version	Notes
`la`	2.1	Legacy format
`lb`	2.1	Legacy format
`ma`	3.0	Introduced new storage format
`mb`	3.0	Storage format revision
`mc`	3.0	Storage format revision
`md`	3.11	Storage format revision
`na`	4.0	Trie-based partition index
`nb`	4.0+	Format revision
`nc`	5.0	Latest format

SSTable Format Types¶

The second-to-last component in the filename (e.g., big or bti) indicates the SSTable format type. Cassandra 5.0 introduces the BTI format as an alternative to the legacy "big" format.

Format	Name	Introduced	Description
`big`	Big Table Format	Original	Legacy format with separate index and summary files
`bti`	Big Trie Index	5.0	New format with block-based trie indexes

SSTable Formats: Big vs BTI¶

Cassandra 5.0 introduced the BTI (Big Trie Index) format (CASSANDRA-17056), a significant redesign of SSTable on-disk structure. The BTI format uses block-based trie indexes for both partition and row lookups, replacing the legacy index structures.

Format Comparison¶

Aspect	Big Format	BTI Format
Partition index	Index.db + Summary.db	Partitions.db (trie)
Row index	Embedded in Index.db	Rows.db (trie)
Memory usage	Higher (summary in heap)	Lower (off-heap, memory-mapped)
Index size	Larger	50-80% smaller (prefix compression)
Lookup complexity	O(log n)	O(key length)
Write amplification	Standard	Slightly higher during flush

BTI Format File Components¶

The BTI format introduces new file extensions:

Component	Big Format	BTI Format	Description
Partition Index	`-Index.db`	`-Partitions.db`	Maps partition keys to data offsets
Row Index	(in Index.db)	`-Rows.db`	Maps clustering keys within partitions
Summary	`-Summary.db`	(eliminated)	Not needed with trie index

BTI SSTable file listing:

data_directory/keyspace_name/table_name-table_uuid/
├── nc-1-bti-Data.db           # Row data (same as big format)
├── nc-1-bti-Partitions.db     # Trie-based partition index (replaces Index.db)
├── nc-1-bti-Rows.db           # Trie-based row index (new)
├── nc-1-bti-Filter.db         # Bloom filter (same as big format)
├── nc-1-bti-Statistics.db     # SSTable metadata
├── nc-1-bti-CompressionInfo.db
├── nc-1-bti-Digest.crc32
└── nc-1-bti-TOC.txt

How BTI Trie Indexes Work¶

The BTI format uses byte-ordered trie data structures for both partition and row indexes. This approach provides several advantages over the legacy format:

1. Prefix Compression

Partition keys with common prefixes share storage in the trie structure:

Keys: user:1001, user:1002, user:1003, user:2001

Legacy Index.db:
  user:1001 → offset 1000
  user:1002 → offset 2000
  user:1003 → offset 3000
  user:2001 → offset 4000
  (each key stored in full)

BTI Partitions.db (trie):
  root → "user:" → "100" → "1" → offset 1000
                         → "2" → offset 2000
                         → "3" → offset 3000
               → "2001" → offset 4000
  (common prefixes stored once)

2. Block-Based Organization

The trie is organized into fixed-size blocks that can be:

Memory-mapped for efficient access
Loaded on-demand (not all in memory)
Cached at the OS page cache level

3. Efficient Range Queries

The trie structure naturally supports efficient iteration for range queries, as entries are stored in sorted order.

Configuration¶

SSTable format is configured cluster-wide in cassandra.yaml:

# cassandra.yaml

# Default SSTable format for new SSTables
# Options: big, bti
# Default: big (for compatibility), bti recommended for 5.0+
sstable_format: bti

Per-table SSTable format configuration is not yet available. See CASSANDRA-18534 for status.

Migration Considerations¶

Consideration	Details
Compatibility	Big and BTI SSTables can coexist in the same table
Conversion	Existing SSTables remain in their format until rewritten
Upgrade path	Set `sstable_format: bti`, then run `nodetool upgradesstables -a`
Downgrade	BTI SSTables cannot be read by Cassandra < 5.0
Tools	All SSTable tools (sstablemetadata, sstabledump, etc.) support BTI

Converting existing tables to BTI:

# 1. Update cassandra.yaml to use BTI format
# 2. Rewrite all SSTables to the new format
nodetool upgradesstables -a keyspace table

# -a flag rewrites all SSTables, even if already at current version
# Without -a, only SSTables from older Cassandra versions are rewritten

When to Use BTI Format¶

Recommended for:

New Cassandra 5.0+ clusters
Tables with many partitions (index size savings)
Tables with long partition keys (prefix compression benefits)
Memory-constrained environments (lower heap usage)

Consider staying with Big format if:

Running mixed-version clusters during upgrade
Need to maintain downgrade capability to < 5.0
Existing tooling depends on legacy file structure

SSTable Identifiers (Cassandra 4.1+)¶

Cassandra 4.1 introduced an alternative SSTable naming scheme using globally unique identifiers instead of sequential generation numbers. This feature is enabled by default in Cassandra 5.0.

Traditional (sequential):

na-1-big-Data.db
na-2-big-Data.db
na-3-big-Data.db

ULID-based (Cassandra 4.1+):

nb-1-big-Data.db                              (sequential)
nb-3fw2_0zer_0000wjnhm8y18d-big-Data.db       (ULID-based)

What is ULID?¶

ULID (Universally Unique Lexicographically Sortable Identifier) is a 128-bit identifier designed as an alternative to UUID that maintains chronological ordering when sorted as strings.

Why ULID instead of UUID:

Characteristic	UUID v1/v4	ULID
Sortability	Not sortable (random bits)	Lexicographically sortable by time
Time component	UUID v1: present but not prefix	First 48 bits are timestamp
String encoding	Hex with dashes (36 chars)	Base32/Base36 (26-28 chars)
Natural ordering	None	Creation time order
Filesystem friendliness	Contains dashes	No special characters

Standard UUIDs (even time-based UUID v1) do not sort lexicographically by creation time because the timestamp bits are not positioned at the start. ULID places the timestamp in the most significant bits, ensuring that lexicographic string comparison produces chronological ordering.

Identifier Structure:

Cassandra's ULID implementation uses 28 characters in Base36 encoding (0-9a-z):

3fw2_0zer_0000wjnhm8y18d000
├──┘ ├──┘ ├───┘├──────────┘
│    │    │    │
│    │    │    └── Random part (13 chars) - unique per Cassandra process
│    │    └─────── Nano part (5 chars) - nanosecond precision
│    └──────────── Second part (4 chars) - seconds within day
└───────────────── Day part (4 chars) - days since epoch

Format regex: ([0-9a-z]{4})_([0-9a-z]{4})_([0-9a-z]{5})([0-9a-z]{13})

Benefits of ULID for SSTables:

Lexicographically sortable - SSTable files sort naturally by creation time in directory listings
Globally unique - No collisions across the entire cluster, even after truncate and restart
Self-describing - Creation time encoded directly in the identifier without metadata lookup
Monotonic - Within the same millisecond, identifiers increment to preserve ordering
Compact - 28 characters vs 36 for standard UUID string representation

Configuration:

# cassandra.yaml

# Enable ULID-based SSTable identifiers
# Default: false (4.1), true (5.0+)
# WARNING: Cannot be disabled once SSTables are created with ULIDs
uuid_sstable_identifiers_enabled: true

Note: The configuration parameter retains the name uuid_sstable_identifiers_enabled for historical reasons, though the implementation uses ULID.

Comparison:

Aspect	Sequential	ULID-based
Uniqueness	Per-table only	Cluster-wide
Streaming conflicts	Possible	None
Sorting	Numeric order	Lexicographic (time-ordered)
Creation time	Requires metadata lookup	Encoded in identifier
Downgrade	Always supported	Not supported once enabled

Problem: Generation Counter Reset After Truncate

Sequential generation numbers reset after truncating a table and restarting the node. This causes SSTable identifier collisions during backup restore operations.

Scenario: Backup and restore after truncate

Step 1: Table has data, take a snapshot backup
└── keyspace/table-abc123/
    ├── nb-1-big-Data.db
    ├── nb-2-big-Data.db
    └── nb-3-big-Data.db

    → nodetool snapshot keyspace table (backup saved)

Step 2: Truncate the table
    → TRUNCATE keyspace.table;
    → All SSTables removed, generation counter state cleared

Step 3: Restart the node
    → Generation counter resets to 1

Step 4: New data written to table
└── keyspace/table-abc123/
    ├── nb-1-big-Data.db    ← NEW data, same filename as backup!
    ├── nb-2-big-Data.db    ← NEW data, same filename as backup!
    └── nb-3-big-Data.db    ← NEW data, same filename as backup!

Step 5: Attempt to restore backup
    → CONFLICT: Backup files (nb-1, nb-2, nb-3) collide with current files
    → Cannot restore without overwriting current data or manually renaming

Remote backup storage corruption:

The problem is worse with remote backup destinations (S3, GCS, Azure Blob). Incremental backups upload SSTables by filename:

Remote storage (S3 bucket):
└── backups/cluster1/node1/keyspace/table/
    ├── nb-1-big-Data.db    ← From initial backup (important data)
    ├── nb-2-big-Data.db
    └── nb-3-big-Data.db

After truncate + restart + new writes:
└── New SSTable files: nb-1, nb-2, nb-3

Next incremental backup runs:
└── backups/cluster1/node1/keyspace/table/
    ├── nb-1-big-Data.db    ← OVERWRITTEN with new data!
    ├── nb-2-big-Data.db    ← OVERWRITTEN - original backup lost!
    └── nb-3-big-Data.db    ← OVERWRITTEN

The original backup data is permanently lost. Backup tools cannot distinguish between "same file updated" and "different file with same name."

Same scenario with ULID identifiers:

Step 1: Table has data, take a snapshot backup
└── keyspace/table-abc123/
    ├── nb-3fw2_0zer_0000wjnhm8y18d000-big-Data.db
    ├── nb-3fw2_0zer_0001xkpl9z28e111-big-Data.db
    └── nb-3fw2_0zer_0002ymqm0a39f222-big-Data.db

    → nodetool snapshot keyspace table (backup saved)

Step 2: Truncate and restart
    → TRUNCATE keyspace.table;
    → Restart node (ULID generator continues with new random component)

Step 3: New data written to table
└── keyspace/table-abc123/
    ├── nb-3fw3_1abc_0000wabc123def00-big-Data.db   ← Different identifier
    ├── nb-3fw3_1abc_0001xdef456ghi11-big-Data.db
    └── nb-3fw3_1abc_0002yghi789jkl22-big-Data.db

Step 4: Restore backup - no conflicts
└── keyspace/table-abc123/
    ├── nb-3fw2_0zer_0000wjnhm8y18d000-big-Data.db   ← Restored from backup
    ├── nb-3fw2_0zer_0001xkpl9z28e111-big-Data.db   ← Restored from backup
    ├── nb-3fw2_0zer_0002ymqm0a39f222-big-Data.db   ← Restored from backup
    ├── nb-3fw3_1abc_0000wabc123def00-big-Data.db   ← Current data preserved
    ├── nb-3fw3_1abc_0001xdef456ghi11-big-Data.db
    └── nb-3fw3_1abc_0002yghi789jkl22-big-Data.db

ULID identifiers incorporate a random component unique to each Cassandra process, so identifiers never repeat even after truncate and restart.

Scenarios where ULID identifiers prevent collisions:

Operation	Sequential Problem	ULID Solution
Restore after truncate	Generation resets, filenames collide	Unique identifiers always
Incremental backup to S3/GCS	New files overwrite old backups	Each backup file unique
Multiple backup restore	Cannot merge backups from different times	Safe to combine
Repair streaming	Incoming SSTable may match local name	No conflicts possible
Node rebuild	Streamed files may collide	Safe parallel streaming

Source: Apache Cassandra 4.1: New SSTable Identifiers

SSTable Component Files¶

Data File (Data.db)¶

Contains the actual row data for all partitions in the SSTable.

Attribute	Description
Purpose	Store partition and row data
Contents	Serialized partitions with rows and cells
Compression	Compressed in chunks (configurable)
Size	Largest component, varies with data volume

Structure:

┌─────────────────────────────────────────────────────────┐
│ Partition 1                                             │
│ ├── Partition Key (serialized)                          │
│ ├── Partition Header (deletion info, flags)             │
│ ├── Row 1 (clustering key + cells)                      │
│ ├── Row 2 (clustering key + cells)                      │
│ └── ...                                                 │
├─────────────────────────────────────────────────────────┤
│ Partition 2                                             │
│ └── ...                                                 │
├─────────────────────────────────────────────────────────┤
│ ...                                                     │
└─────────────────────────────────────────────────────────┘

Partition Index¶

Maps partition keys to byte offsets in the Data file. The implementation differs between SSTable formats.

Big Format: Index.db¶

Attribute	Description
File	`-Index.db`
Purpose	Map partition keys to data file offsets
Structure	Sorted list of (partition key, offset) pairs with embedded trie (4.0+)
Memory	Off-heap trie index (4.0+), or heap-based with Summary.db (pre-4.0)

Pre-4.0 lookup flow:

Partition Key → Summary.db (sampled) → Index.db (scan) → Data.db offset

4.0+ lookup flow:

Partition Key → Index.db (trie lookup) → Data.db offset

BTI Format: Partitions.db (Cassandra 5.0+)¶

Attribute	Description
File	`-Partitions.db`
Purpose	Block-based trie partition index
Structure	Byte-ordered trie with fixed-size blocks
Memory	Memory-mapped, fully off-heap
Benefits	50-80% smaller than Index.db, O(key length) lookups

BTI lookup flow:

Partition Key → Partitions.db (trie traversal) → Data.db offset

The BTI format's block-based organization allows efficient memory mapping and on-demand loading—only accessed blocks are read from disk.

Row Index¶

Maps clustering keys to positions within large partitions, enabling efficient lookups without scanning entire partitions.

Big Format: Embedded in Index.db¶

Attribute	Description
Location	Stored within Index.db
Purpose	Locate rows within partitions exceeding column_index_size
Contents	Clustering key boundaries at configurable intervals
Threshold	Created when partition exceeds `column_index_size_in_kb` (default: 64KB)

BTI Format: Rows.db (Cassandra 5.0+)¶

Attribute	Description
File	`-Rows.db`
Purpose	Separate trie-based row index
Structure	Byte-ordered trie of clustering keys
Memory	Memory-mapped, off-heap
Benefits	Faster row lookups in wide partitions, separate from partition index

The BTI format separates row indexing into its own file, improving cache efficiency and allowing independent optimization of partition and row lookups.

Bloom Filter (Filter.db)¶

Probabilistic data structure for quick partition key lookups.

Attribute	Description
Purpose	Quickly eliminate SSTables from read path
Contents	Bit array with hashed partition keys
False Positives	Possible (configurable rate)
False Negatives	Impossible
Memory	Loaded into off-heap memory

Configuration:

ALTER TABLE my_table WITH bloom_filter_fp_chance = 0.01;

Summary (Summary.db) - Big Format Pre-4.0 Only¶

Sampled index for efficient partition lookup. Not present in Cassandra 4.0+ Big format or BTI format.

Attribute	Description
File	`-Summary.db`
Purpose	In-memory sample of partition index
Contents	Every Nth partition key from Index.db
Memory	Loaded into JVM heap
Status	Eliminated in 4.0 (Big format uses trie in Index.db); not used in BTI

The summary file was necessary in pre-4.0 because Index.db contained a flat list of all partition keys. Scanning the entire index for each lookup was too slow, so the summary provided jump points into the index.

Cassandra 4.0+ replaced this with trie-based indexes that provide O(key length) lookups directly, eliminating the need for sampling.

Configuration (pre-4.0 only):

# cassandra.yaml
min_index_interval: 128    # Minimum sampling rate
max_index_interval: 2048   # Maximum sampling rate

Format	Summary.db Present?
Big (pre-4.0)	Yes
Big (4.0+)	No
BTI (5.0+)	No

Compression Info (CompressionInfo.db)¶

Metadata for compressed data chunks.

Attribute	Description
Purpose	Map uncompressed offsets to compressed chunks
Contents	Chunk boundaries and compressed sizes
Required For	Random access within compressed data

Structure:

Data.db is compressed in fixed-size chunks:

Uncompressed: [Chunk 1: 64KB][Chunk 2: 64KB][Chunk 3: 64KB]
                   ↓              ↓              ↓
Compressed:   [28KB]         [30KB]         [25KB]

CompressionInfo.db stores:
- Chunk 1 starts at offset 0
- Chunk 2 starts at offset 28672
- Chunk 3 starts at offset 59392

Statistics (Statistics.db)¶

Metadata about the SSTable contents.

Attribute	Description
Purpose	Store SSTable metadata for query optimization
Contents	Min/max values, tombstone counts, timestamps
Used By	Query planner, compaction, repair

Contents include:

Statistic	Description
Partition count	Number of partitions in SSTable
Row count	Total rows across all partitions
Min/max timestamp	Timestamp range of data
Min/max clustering	Clustering key range
Min/max partition key	Partition key range (token)
Tombstone count	Number of tombstones
Droppable tombstone count	Tombstones eligible for removal
SSTable level	Compaction level (for LCS)
Compression ratio	Achieved compression ratio

Digest (Digest.crc32 / Digest.adler32 / Digest.sha1)¶

Checksum for data integrity verification.

Attribute	Description
Purpose	Detect data corruption
Contents	Checksum of Data.db contents
Verification	Checked during reads and streaming

Table of Contents (TOC.txt)¶

Lists all component files for the SSTable.

Attribute	Description
Purpose	Enumerate SSTable components
Contents	List of component file names
Format	Plain text, one file per line

Example contents (Big format):

TOC.txt
Data.db
Index.db
Filter.db
Statistics.db
CompressionInfo.db
Digest.crc32

Example contents (BTI format):

TOC.txt
Data.db
Partitions.db
Rows.db
Filter.db
Statistics.db
CompressionInfo.db
Digest.crc32

Component File Reference Summary¶

By SSTable Format¶

Component	Big Format	BTI Format	Purpose
Data	`-Data.db`	`-Data.db`	Row data
Partition Index	`-Index.db`	`-Partitions.db`	Key → offset mapping
Row Index	(in Index.db)	`-Rows.db`	Clustering key → offset
Bloom Filter	`-Filter.db`	`-Filter.db`	Partition existence check
Summary	`-Summary.db` (pre-4.0)	—	Sampled index (legacy)
Compression Info	`-CompressionInfo.db`	`-CompressionInfo.db`	Chunk offsets
Statistics	`-Statistics.db`	`-Statistics.db`	SSTable metadata
Digest	`-Digest.*`	`-Digest.*`	Data checksum
TOC	`-TOC.txt`	`-TOC.txt`	Component file list

Memory Location by Component¶

Component	Pre-4.0	4.0+ Big Format	BTI Format
Data	Page cache	Page cache	Page cache
Partition Index	Heap (via Summary)	Off-heap	Off-heap (mmap)
Row Index	—	Off-heap	Off-heap (mmap)
Bloom Filter	Off-heap	Off-heap	Off-heap
Summary	Heap	—	—
Compression Info	Off-heap	Off-heap	Off-heap

Compression¶

SSTable data is compressed in chunks for efficient random access.

Compression Configuration¶

-- View current compression
SELECT compression FROM system_schema.tables
WHERE keyspace_name = 'ks' AND table_name = 'table';

-- Configure compression
ALTER TABLE my_table WITH compression = {
    'class': 'LZ4Compressor',
    'chunk_length_in_kb': 64
};

-- Disable compression
ALTER TABLE my_table WITH compression = {'enabled': 'false'};

Compressor Comparison¶

Compressor	Speed	Ratio	CPU	Use Case
LZ4Compressor	Fastest	~2.5x	Lowest	Default, most workloads
SnappyCompressor	Fast	~2.5x	Low	Alternative to LZ4
ZstdCompressor	Medium	~3-4x	Medium	Better ratio (4.0+)
DeflateCompressor	Slow	~3-4x	High	Maximum compression
NoCompressor	N/A	1x	None	Pre-compressed data

Chunk Size¶

Chunk Size	Read Pattern	Compression Ratio
16KB	Random reads	Lower
64KB (default)	Mixed	Balanced
256KB	Sequential reads	Higher

-- Smaller chunks for random reads
ALTER TABLE random_access WITH compression = {
    'class': 'LZ4Compressor',
    'chunk_length_in_kb': 16
};

-- Larger chunks for sequential reads
ALTER TABLE sequential_access WITH compression = {
    'class': 'LZ4Compressor',
    'chunk_length_in_kb': 256
};

SSTable Tools¶

sstablemetadata¶

Display SSTable metadata:

tools/bin/sstablemetadata /path/to/na-1-big-Data.db

Output includes: - Partition count - Row count - Timestamp range - Tombstone statistics - Compression ratio

sstableutil¶

List SSTable files:

tools/bin/sstableutil keyspace table

sstabledump¶

Dump SSTable contents as JSON:

tools/bin/sstabledump /path/to/na-1-big-Data.db

sstablescrub¶

Rebuild SSTable, removing corrupt data:

tools/bin/sstablescrub keyspace table

sstableexpiredblockers¶

Find SSTables blocking tombstone removal:

tools/bin/sstableexpiredblockers keyspace table

Monitoring SSTables¶

# SSTable count per table
nodetool tablestats keyspace.table | grep "SSTable count"

# Total disk usage
nodetool tablestats keyspace.table | grep "Space used"

# List SSTables
ls -la /var/lib/cassandra/data/keyspace/table-*/

# SSTable sizes
du -sh /var/lib/cassandra/data/keyspace/table-*/*.db

JMX Metrics¶

org.apache.cassandra.metrics:type=Table,name=LiveSSTableCount
org.apache.cassandra.metrics:type=Table,name=SSTablesPerReadHistogram
org.apache.cassandra.metrics:type=Table,name=TotalDiskSpaceUsed
org.apache.cassandra.metrics:type=Table,name=CompressionRatio

Storage Engine Overview - Architecture overview
Write Path - How SSTables are created
Read Path - How SSTables are read
Compaction - How SSTables are merged