Skip to content

SSTable Reference

SSTables (Sorted String Tables) are Cassandra's persistent storage files. All data ultimately resides in SSTables on disk—they are the database files. When a memtable flushes, it creates an SSTable. When compaction runs, it reads SSTables and writes new ones. When a node restarts, it reads SSTables to rebuild its state.

Each SSTable is immutable once written. This immutability simplifies concurrency (no locks needed for reads), enables efficient sequential writes, and allows safe snapshots via hard links. However, it also means that updates and deletes create new data rather than modifying existing files, requiring background compaction to reclaim space and merge versions.

An SSTable is not a single file but a set of component files: data, indexes, bloom filter, compression metadata, and statistics. Understanding these components is essential for troubleshooting, capacity planning, and performance analysis.


SSTable File Location

data_directory/keyspace_name/table_name-table_uuid/
├── na-1-big-Data.db
├── na-1-big-Index.db
├── na-1-big-Filter.db
├── na-1-big-Statistics.db
├── na-1-big-Summary.db
├── na-1-big-CompressionInfo.db
├── na-1-big-Digest.crc32
└── na-1-big-TOC.txt

File Naming Convention

<version>-<generation>-<format>-<component>.<extension>

Example: na-1-big-Data.db

na      - SSTable format version
1       - Generation number (increments with compaction)
big     - Format type
Data    - Component type
db      - File extension

Version Identifiers

Version Cassandra Version Notes
la 2.1 Legacy format
lb 2.1 Legacy format
ma 3.0 Introduced new storage format
mb 3.0 Storage format revision
mc 3.0 Storage format revision
md 3.11 Storage format revision
na 4.0 Trie-based partition index
nb 4.0+ Format revision
nc 5.0 Latest format

SSTable Format Types

The second-to-last component in the filename (e.g., big or bti) indicates the SSTable format type. Cassandra 5.0 introduces the BTI format as an alternative to the legacy "big" format.

Format Name Introduced Description
big Big Table Format Original Legacy format with separate index and summary files
bti Big Trie Index 5.0 New format with block-based trie indexes

SSTable Formats: Big vs BTI

Cassandra 5.0 introduced the BTI (Big Trie Index) format (CASSANDRA-17056), a significant redesign of SSTable on-disk structure. The BTI format uses block-based trie indexes for both partition and row lookups, replacing the legacy index structures.

Format Comparison

uml diagram

Aspect Big Format BTI Format
Partition index Index.db + Summary.db Partitions.db (trie)
Row index Embedded in Index.db Rows.db (trie)
Memory usage Higher (summary in heap) Lower (off-heap, memory-mapped)
Index size Larger 50-80% smaller (prefix compression)
Lookup complexity O(log n) O(key length)
Write amplification Standard Slightly higher during flush

BTI Format File Components

The BTI format introduces new file extensions:

Component Big Format BTI Format Description
Partition Index -Index.db -Partitions.db Maps partition keys to data offsets
Row Index (in Index.db) -Rows.db Maps clustering keys within partitions
Summary -Summary.db (eliminated) Not needed with trie index

BTI SSTable file listing:

data_directory/keyspace_name/table_name-table_uuid/
├── nc-1-bti-Data.db           # Row data (same as big format)
├── nc-1-bti-Partitions.db     # Trie-based partition index (replaces Index.db)
├── nc-1-bti-Rows.db           # Trie-based row index (new)
├── nc-1-bti-Filter.db         # Bloom filter (same as big format)
├── nc-1-bti-Statistics.db     # SSTable metadata
├── nc-1-bti-CompressionInfo.db
├── nc-1-bti-Digest.crc32
└── nc-1-bti-TOC.txt

How BTI Trie Indexes Work

The BTI format uses byte-ordered trie data structures for both partition and row indexes. This approach provides several advantages over the legacy format:

1. Prefix Compression

Partition keys with common prefixes share storage in the trie structure:

Keys: user:1001, user:1002, user:1003, user:2001

Legacy Index.db:
  user:1001 → offset 1000
  user:1002 → offset 2000
  user:1003 → offset 3000
  user:2001 → offset 4000
  (each key stored in full)

BTI Partitions.db (trie):
  root → "user:" → "100" → "1" → offset 1000
                         → "2" → offset 2000
                         → "3" → offset 3000
               → "2001" → offset 4000
  (common prefixes stored once)

2. Block-Based Organization

The trie is organized into fixed-size blocks that can be:

  • Memory-mapped for efficient access
  • Loaded on-demand (not all in memory)
  • Cached at the OS page cache level

3. Efficient Range Queries

The trie structure naturally supports efficient iteration for range queries, as entries are stored in sorted order.

Configuration

SSTable format is configured cluster-wide in cassandra.yaml:

# cassandra.yaml

# Default SSTable format for new SSTables
# Options: big, bti
# Default: big (for compatibility), bti recommended for 5.0+
sstable_format: bti

Per-table SSTable format configuration is not yet available. See CASSANDRA-18534 for status.

Migration Considerations

Consideration Details
Compatibility Big and BTI SSTables can coexist in the same table
Conversion Existing SSTables remain in their format until rewritten
Upgrade path Set sstable_format: bti, then run nodetool upgradesstables -a
Downgrade BTI SSTables cannot be read by Cassandra < 5.0
Tools All SSTable tools (sstablemetadata, sstabledump, etc.) support BTI

Converting existing tables to BTI:

# 1. Update cassandra.yaml to use BTI format
# 2. Rewrite all SSTables to the new format
nodetool upgradesstables -a keyspace table

# -a flag rewrites all SSTables, even if already at current version
# Without -a, only SSTables from older Cassandra versions are rewritten

When to Use BTI Format

Recommended for:

  • New Cassandra 5.0+ clusters
  • Tables with many partitions (index size savings)
  • Tables with long partition keys (prefix compression benefits)
  • Memory-constrained environments (lower heap usage)

Consider staying with Big format if:

  • Running mixed-version clusters during upgrade
  • Need to maintain downgrade capability to < 5.0
  • Existing tooling depends on legacy file structure

SSTable Identifiers (Cassandra 4.1+)

Cassandra 4.1 introduced an alternative SSTable naming scheme using globally unique identifiers instead of sequential generation numbers. This feature is enabled by default in Cassandra 5.0.

Traditional (sequential):

na-1-big-Data.db
na-2-big-Data.db
na-3-big-Data.db

ULID-based (Cassandra 4.1+):

nb-1-big-Data.db                              (sequential)
nb-3fw2_0zer_0000wjnhm8y18d-big-Data.db       (ULID-based)

What is ULID?

ULID (Universally Unique Lexicographically Sortable Identifier) is a 128-bit identifier designed as an alternative to UUID that maintains chronological ordering when sorted as strings.

Why ULID instead of UUID:

Characteristic UUID v1/v4 ULID
Sortability Not sortable (random bits) Lexicographically sortable by time
Time component UUID v1: present but not prefix First 48 bits are timestamp
String encoding Hex with dashes (36 chars) Base32/Base36 (26-28 chars)
Natural ordering None Creation time order
Filesystem friendliness Contains dashes No special characters

Standard UUIDs (even time-based UUID v1) do not sort lexicographically by creation time because the timestamp bits are not positioned at the start. ULID places the timestamp in the most significant bits, ensuring that lexicographic string comparison produces chronological ordering.

Identifier Structure:

Cassandra's ULID implementation uses 28 characters in Base36 encoding (0-9a-z):

3fw2_0zer_0000wjnhm8y18d000
├──┘ ├──┘ ├───┘├──────────┘
│    │    │    │
│    │    │    └── Random part (13 chars) - unique per Cassandra process
│    │    └─────── Nano part (5 chars) - nanosecond precision
│    └──────────── Second part (4 chars) - seconds within day
└───────────────── Day part (4 chars) - days since epoch

Format regex: ([0-9a-z]{4})_([0-9a-z]{4})_([0-9a-z]{5})([0-9a-z]{13})

Benefits of ULID for SSTables:

  • Lexicographically sortable - SSTable files sort naturally by creation time in directory listings
  • Globally unique - No collisions across the entire cluster, even after truncate and restart
  • Self-describing - Creation time encoded directly in the identifier without metadata lookup
  • Monotonic - Within the same millisecond, identifiers increment to preserve ordering
  • Compact - 28 characters vs 36 for standard UUID string representation

Configuration:

# cassandra.yaml

# Enable ULID-based SSTable identifiers
# Default: false (4.1), true (5.0+)
# WARNING: Cannot be disabled once SSTables are created with ULIDs
uuid_sstable_identifiers_enabled: true

Note: The configuration parameter retains the name uuid_sstable_identifiers_enabled for historical reasons, though the implementation uses ULID.

Comparison:

Aspect Sequential ULID-based
Uniqueness Per-table only Cluster-wide
Streaming conflicts Possible None
Sorting Numeric order Lexicographic (time-ordered)
Creation time Requires metadata lookup Encoded in identifier
Downgrade Always supported Not supported once enabled

Problem: Generation Counter Reset After Truncate

Sequential generation numbers reset after truncating a table and restarting the node. This causes SSTable identifier collisions during backup restore operations.

Scenario: Backup and restore after truncate

Step 1: Table has data, take a snapshot backup
└── keyspace/table-abc123/
    ├── nb-1-big-Data.db
    ├── nb-2-big-Data.db
    └── nb-3-big-Data.db

    → nodetool snapshot keyspace table (backup saved)

Step 2: Truncate the table
    → TRUNCATE keyspace.table;
    → All SSTables removed, generation counter state cleared

Step 3: Restart the node
    → Generation counter resets to 1

Step 4: New data written to table
└── keyspace/table-abc123/
    ├── nb-1-big-Data.db    ← NEW data, same filename as backup!
    ├── nb-2-big-Data.db    ← NEW data, same filename as backup!
    └── nb-3-big-Data.db    ← NEW data, same filename as backup!

Step 5: Attempt to restore backup
    → CONFLICT: Backup files (nb-1, nb-2, nb-3) collide with current files
    → Cannot restore without overwriting current data or manually renaming

Remote backup storage corruption:

The problem is worse with remote backup destinations (S3, GCS, Azure Blob). Incremental backups upload SSTables by filename:

Remote storage (S3 bucket):
└── backups/cluster1/node1/keyspace/table/
    ├── nb-1-big-Data.db    ← From initial backup (important data)
    ├── nb-2-big-Data.db
    └── nb-3-big-Data.db

After truncate + restart + new writes:
└── New SSTable files: nb-1, nb-2, nb-3

Next incremental backup runs:
└── backups/cluster1/node1/keyspace/table/
    ├── nb-1-big-Data.db    ← OVERWRITTEN with new data!
    ├── nb-2-big-Data.db    ← OVERWRITTEN - original backup lost!
    └── nb-3-big-Data.db    ← OVERWRITTEN

The original backup data is permanently lost. Backup tools cannot distinguish between "same file updated" and "different file with same name."

Same scenario with ULID identifiers:

Step 1: Table has data, take a snapshot backup
└── keyspace/table-abc123/
    ├── nb-3fw2_0zer_0000wjnhm8y18d000-big-Data.db
    ├── nb-3fw2_0zer_0001xkpl9z28e111-big-Data.db
    └── nb-3fw2_0zer_0002ymqm0a39f222-big-Data.db

    → nodetool snapshot keyspace table (backup saved)

Step 2: Truncate and restart
    → TRUNCATE keyspace.table;
    → Restart node (ULID generator continues with new random component)

Step 3: New data written to table
└── keyspace/table-abc123/
    ├── nb-3fw3_1abc_0000wabc123def00-big-Data.db   ← Different identifier
    ├── nb-3fw3_1abc_0001xdef456ghi11-big-Data.db
    └── nb-3fw3_1abc_0002yghi789jkl22-big-Data.db

Step 4: Restore backup - no conflicts
└── keyspace/table-abc123/
    ├── nb-3fw2_0zer_0000wjnhm8y18d000-big-Data.db   ← Restored from backup
    ├── nb-3fw2_0zer_0001xkpl9z28e111-big-Data.db   ← Restored from backup
    ├── nb-3fw2_0zer_0002ymqm0a39f222-big-Data.db   ← Restored from backup
    ├── nb-3fw3_1abc_0000wabc123def00-big-Data.db   ← Current data preserved
    ├── nb-3fw3_1abc_0001xdef456ghi11-big-Data.db
    └── nb-3fw3_1abc_0002yghi789jkl22-big-Data.db

ULID identifiers incorporate a random component unique to each Cassandra process, so identifiers never repeat even after truncate and restart.

Scenarios where ULID identifiers prevent collisions:

Operation Sequential Problem ULID Solution
Restore after truncate Generation resets, filenames collide Unique identifiers always
Incremental backup to S3/GCS New files overwrite old backups Each backup file unique
Multiple backup restore Cannot merge backups from different times Safe to combine
Repair streaming Incoming SSTable may match local name No conflicts possible
Node rebuild Streamed files may collide Safe parallel streaming

Source: Apache Cassandra 4.1: New SSTable Identifiers


SSTable Component Files

Data File (Data.db)

Contains the actual row data for all partitions in the SSTable.

Attribute Description
Purpose Store partition and row data
Contents Serialized partitions with rows and cells
Compression Compressed in chunks (configurable)
Size Largest component, varies with data volume

Structure:

┌─────────────────────────────────────────────────────────┐
│ Partition 1                                             │
│ ├── Partition Key (serialized)                          │
│ ├── Partition Header (deletion info, flags)             │
│ ├── Row 1 (clustering key + cells)                      │
│ ├── Row 2 (clustering key + cells)                      │
│ └── ...                                                 │
├─────────────────────────────────────────────────────────┤
│ Partition 2                                             │
│ └── ...                                                 │
├─────────────────────────────────────────────────────────┤
│ ...                                                     │
└─────────────────────────────────────────────────────────┘

Partition Index

Maps partition keys to byte offsets in the Data file. The implementation differs between SSTable formats.

Big Format: Index.db

Attribute Description
File -Index.db
Purpose Map partition keys to data file offsets
Structure Sorted list of (partition key, offset) pairs with embedded trie (4.0+)
Memory Off-heap trie index (4.0+), or heap-based with Summary.db (pre-4.0)

Pre-4.0 lookup flow:

Partition Key → Summary.db (sampled) → Index.db (scan) → Data.db offset

4.0+ lookup flow:

Partition Key → Index.db (trie lookup) → Data.db offset

BTI Format: Partitions.db (Cassandra 5.0+)

Attribute Description
File -Partitions.db
Purpose Block-based trie partition index
Structure Byte-ordered trie with fixed-size blocks
Memory Memory-mapped, fully off-heap
Benefits 50-80% smaller than Index.db, O(key length) lookups

BTI lookup flow:

Partition Key → Partitions.db (trie traversal) → Data.db offset

The BTI format's block-based organization allows efficient memory mapping and on-demand loading—only accessed blocks are read from disk.


Row Index

Maps clustering keys to positions within large partitions, enabling efficient lookups without scanning entire partitions.

Big Format: Embedded in Index.db

Attribute Description
Location Stored within Index.db
Purpose Locate rows within partitions exceeding column_index_size
Contents Clustering key boundaries at configurable intervals
Threshold Created when partition exceeds column_index_size_in_kb (default: 64KB)

BTI Format: Rows.db (Cassandra 5.0+)

Attribute Description
File -Rows.db
Purpose Separate trie-based row index
Structure Byte-ordered trie of clustering keys
Memory Memory-mapped, off-heap
Benefits Faster row lookups in wide partitions, separate from partition index

The BTI format separates row indexing into its own file, improving cache efficiency and allowing independent optimization of partition and row lookups.


Bloom Filter (Filter.db)

Probabilistic data structure for quick partition key lookups.

Attribute Description
Purpose Quickly eliminate SSTables from read path
Contents Bit array with hashed partition keys
False Positives Possible (configurable rate)
False Negatives Impossible
Memory Loaded into off-heap memory

Configuration:

ALTER TABLE my_table WITH bloom_filter_fp_chance = 0.01;

Summary (Summary.db) - Big Format Pre-4.0 Only

Sampled index for efficient partition lookup. Not present in Cassandra 4.0+ Big format or BTI format.

Attribute Description
File -Summary.db
Purpose In-memory sample of partition index
Contents Every Nth partition key from Index.db
Memory Loaded into JVM heap
Status Eliminated in 4.0 (Big format uses trie in Index.db); not used in BTI

The summary file was necessary in pre-4.0 because Index.db contained a flat list of all partition keys. Scanning the entire index for each lookup was too slow, so the summary provided jump points into the index.

Cassandra 4.0+ replaced this with trie-based indexes that provide O(key length) lookups directly, eliminating the need for sampling.

Configuration (pre-4.0 only):

# cassandra.yaml
min_index_interval: 128    # Minimum sampling rate
max_index_interval: 2048   # Maximum sampling rate
Format Summary.db Present?
Big (pre-4.0) Yes
Big (4.0+) No
BTI (5.0+) No

Compression Info (CompressionInfo.db)

Metadata for compressed data chunks.

Attribute Description
Purpose Map uncompressed offsets to compressed chunks
Contents Chunk boundaries and compressed sizes
Required For Random access within compressed data

Structure:

Data.db is compressed in fixed-size chunks:

Uncompressed: [Chunk 1: 64KB][Chunk 2: 64KB][Chunk 3: 64KB]
                   ↓              ↓              ↓
Compressed:   [28KB]         [30KB]         [25KB]

CompressionInfo.db stores:
- Chunk 1 starts at offset 0
- Chunk 2 starts at offset 28672
- Chunk 3 starts at offset 59392

Statistics (Statistics.db)

Metadata about the SSTable contents.

Attribute Description
Purpose Store SSTable metadata for query optimization
Contents Min/max values, tombstone counts, timestamps
Used By Query planner, compaction, repair

Contents include:

Statistic Description
Partition count Number of partitions in SSTable
Row count Total rows across all partitions
Min/max timestamp Timestamp range of data
Min/max clustering Clustering key range
Min/max partition key Partition key range (token)
Tombstone count Number of tombstones
Droppable tombstone count Tombstones eligible for removal
SSTable level Compaction level (for LCS)
Compression ratio Achieved compression ratio

Digest (Digest.crc32 / Digest.adler32 / Digest.sha1)

Checksum for data integrity verification.

Attribute Description
Purpose Detect data corruption
Contents Checksum of Data.db contents
Verification Checked during reads and streaming

Table of Contents (TOC.txt)

Lists all component files for the SSTable.

Attribute Description
Purpose Enumerate SSTable components
Contents List of component file names
Format Plain text, one file per line

Example contents (Big format):

TOC.txt
Data.db
Index.db
Filter.db
Statistics.db
CompressionInfo.db
Digest.crc32

Example contents (BTI format):

TOC.txt
Data.db
Partitions.db
Rows.db
Filter.db
Statistics.db
CompressionInfo.db
Digest.crc32

Component File Reference Summary

By SSTable Format

Component Big Format BTI Format Purpose
Data -Data.db -Data.db Row data
Partition Index -Index.db -Partitions.db Key → offset mapping
Row Index (in Index.db) -Rows.db Clustering key → offset
Bloom Filter -Filter.db -Filter.db Partition existence check
Summary -Summary.db (pre-4.0) Sampled index (legacy)
Compression Info -CompressionInfo.db -CompressionInfo.db Chunk offsets
Statistics -Statistics.db -Statistics.db SSTable metadata
Digest -Digest.* -Digest.* Data checksum
TOC -TOC.txt -TOC.txt Component file list

Memory Location by Component

Component Pre-4.0 4.0+ Big Format BTI Format
Data Page cache Page cache Page cache
Partition Index Heap (via Summary) Off-heap Off-heap (mmap)
Row Index Off-heap Off-heap (mmap)
Bloom Filter Off-heap Off-heap Off-heap
Summary Heap
Compression Info Off-heap Off-heap Off-heap

Compression

SSTable data is compressed in chunks for efficient random access.

Compression Configuration

-- View current compression
SELECT compression FROM system_schema.tables
WHERE keyspace_name = 'ks' AND table_name = 'table';

-- Configure compression
ALTER TABLE my_table WITH compression = {
    'class': 'LZ4Compressor',
    'chunk_length_in_kb': 64
};

-- Disable compression
ALTER TABLE my_table WITH compression = {'enabled': 'false'};

Compressor Comparison

Compressor Speed Ratio CPU Use Case
LZ4Compressor Fastest ~2.5x Lowest Default, most workloads
SnappyCompressor Fast ~2.5x Low Alternative to LZ4
ZstdCompressor Medium ~3-4x Medium Better ratio (4.0+)
DeflateCompressor Slow ~3-4x High Maximum compression
NoCompressor N/A 1x None Pre-compressed data

Chunk Size

Chunk Size Read Pattern Compression Ratio
16KB Random reads Lower
64KB (default) Mixed Balanced
256KB Sequential reads Higher
-- Smaller chunks for random reads
ALTER TABLE random_access WITH compression = {
    'class': 'LZ4Compressor',
    'chunk_length_in_kb': 16
};

-- Larger chunks for sequential reads
ALTER TABLE sequential_access WITH compression = {
    'class': 'LZ4Compressor',
    'chunk_length_in_kb': 256
};

SSTable Tools

sstablemetadata

Display SSTable metadata:

tools/bin/sstablemetadata /path/to/na-1-big-Data.db

Output includes: - Partition count - Row count - Timestamp range - Tombstone statistics - Compression ratio

sstableutil

List SSTable files:

tools/bin/sstableutil keyspace table

sstabledump

Dump SSTable contents as JSON:

tools/bin/sstabledump /path/to/na-1-big-Data.db

sstablescrub

Rebuild SSTable, removing corrupt data:

tools/bin/sstablescrub keyspace table

sstableexpiredblockers

Find SSTables blocking tombstone removal:

tools/bin/sstableexpiredblockers keyspace table

Monitoring SSTables

# SSTable count per table
nodetool tablestats keyspace.table | grep "SSTable count"

# Total disk usage
nodetool tablestats keyspace.table | grep "Space used"

# List SSTables
ls -la /var/lib/cassandra/data/keyspace/table-*/

# SSTable sizes
du -sh /var/lib/cassandra/data/keyspace/table-*/*.db

JMX Metrics

org.apache.cassandra.metrics:type=Table,name=LiveSSTableCount
org.apache.cassandra.metrics:type=Table,name=SSTablesPerReadHistogram
org.apache.cassandra.metrics:type=Table,name=TotalDiskSpaceUsed
org.apache.cassandra.metrics:type=Table,name=CompressionRatio