Size-Tiered Compaction Strategy (STCS)¶

Cassandra 5.0+

Starting with Cassandra 5.0, Unified Compaction Strategy (UCS) is the recommended compaction strategy for most workloads. UCS provides adaptive behavior that can emulate STCS characteristics when appropriate. STCS remains fully supported and is still the default for tables in earlier versions.

STCS is Cassandra's original and default compaction strategy. It groups SSTables of similar size and compacts them together, optimizing for write throughput at the cost of read amplification.

Background and History¶

Origins¶

Size-Tiered Compaction Strategy is Cassandra's original compaction implementation, present since the earliest versions. It derives from classical LSM-tree (Log-Structured Merge-tree) compaction as described in the 1996 paper by O'Neil et al. The strategy was designed to optimize write throughput—a primary design goal for Cassandra's original use case as a high-volume write store.

STCS remained Cassandra's only compaction strategy until version 1.0 (October 2011), when Leveled Compaction Strategy was introduced to address STCS's read amplification problems.

Design Philosophy¶

STCS follows a simple principle: minimize write amplification by only compacting SSTables of similar size together. This approach ensures that:

Small SSTables are compacted frequently (low cost per compaction)
Large SSTables are compacted infrequently (high cost but rare)
Each byte of data is rewritten approximately log(N) times over its lifetime

The trade-off is that partition keys are scattered across many SSTables, requiring reads to check multiple files.

How STCS Works in Theory¶

Core Concept¶

STCS organizes compaction around size similarity rather than key ranges or levels:

Grouping: SSTables are grouped into "buckets" based on size
Threshold: When a bucket contains min_threshold SSTables (default: 4), compaction is triggered
Merging: All SSTables in the bucket are merged into a single larger SSTable
Growth: The output SSTable joins a larger size bucket, and the process repeats

Size Buckets¶

SSTables are assigned to buckets based on size similarity:

\[\bar{s} \times b_{\text{low}} \leq s \leq \bar{s} \times b_{\text{high}}\]

Where:

\(s\) = SSTable size
\(\bar{s}\) = average size of SSTables in the bucket
\(b_{\text{low}}\) = bucket_low (default: 0.5)
\(b_{\text{high}}\) = bucket_high (default: 1.5)

Example: Bucket with \(\bar{s} = 10\text{MB}\):

\[5\text{MB} \leq s \leq 15\text{MB}\]

SSTables outside this range form separate buckets. The min_sstable_size parameter (default 50MB) groups all smaller SSTables together, preventing proliferation of tiny SSTable buckets.

Compaction Trigger and Bucket Selection¶

Compaction occurs when:

A bucket reaches min_threshold SSTables (default: 4)
The bucket is selected by the compaction scheduler based on "hotness"
Up to max_threshold SSTables (default: 32) are included

Hotness-Based Bucket Prioritization¶

When multiple buckets are eligible for compaction, STCS selects the "hottest" bucket using a read-intensity metric:

\[\text{hotness} = \frac{\text{readMeter.twoHourRate()}}{\text{estimatedKeys()}}\]

This formula measures recent read activity (two-hour rate) divided by the number of keys in the SSTable. The bucket with the highest aggregate hotness is selected first, ensuring that frequently-read data is compacted and consolidated sooner.

Tie-breaking: When buckets have equal hotness, STCS prioritizes buckets with smaller average file size to reduce compaction I/O.

Tombstone Compaction Fallback¶

If no standard compaction candidates exist (no bucket meets min_threshold), STCS falls back to single-SSTable tombstone compaction:

Scans SSTables for droppable tombstone ratio exceeding tombstone_threshold (default: 0.2)
Selects the largest eligible SSTable
Rewrites it to purge tombstones

This ensures tombstones are eventually reclaimed even when SSTables lack compaction partners.

Write Amplification Calculation¶

STCS achieves logarithmic write amplification:

Data progression with \(t = 4\) (min_threshold):

Step	Input	Output
1	\(4 \times 1\text{MB}\)	4 MB
2	\(4 \times 4\text{MB}\)	16 MB
3	\(4 \times 16\text{MB}\)	64 MB
4	\(4 \times 64\text{MB}\)	256 MB

Write amplification formula:

\[W \approx \log_t \left( \frac{N}{s_f} \right)\]

Where:

\(W\) = write amplification
\(t\) = min_threshold (default: 4)
\(N\) = total data size
\(s_f\) = flush size

Example: 1GB dataset with 1MB flushes:

\[W = \log_4(1024) \approx 5\times\]

This is significantly lower than LCS's \(10\times\) per level amplification.

Read Amplification Problem¶

The fundamental weakness of STCS is that any partition key may exist in any SSTable:

After extended operation, SSTables accumulate:
  [1MB] [4MB] [16MB] [64MB] [256MB] [1GB] [4GB]

Read for partition key K:
  1. Check bloom filter on each SSTable
  2. For positive results, read index and data
  3. Merge all fragments found

Worst case: Every SSTable contains data for K
  → 7+ disk reads for a single partition

Large SSTables compound this problem because they take longer to compact together, leading to long periods with many SSTables.

Benefits¶

Low Write Amplification¶

STCS minimizes how often data is rewritten:

Logarithmic growth: ~5-10× total amplification for typical datasets
SSD-friendly: Less wear compared to LCS
Sustained write throughput: Compaction I/O remains bounded

Simple and Predictable¶

The bucketing algorithm is straightforward:

Easy to understand and debug
Predictable compaction sizes
No complex level management

Efficient for Sequential Writes¶

Append-only and time-series patterns benefit from STCS:

New data stays in small, recent SSTables
Old data migrates to large SSTables
Natural temporal locality

Handles Variable Write Rates¶

STCS adapts to changing workloads:

Burst writes: Small SSTables accumulate, compact later
Steady writes: Regular compaction cadence
Idle periods: Compaction catches up

Drawbacks¶

High Read Amplification¶

The primary cost of STCS is read performance:

Point queries may touch many SSTables
No upper bound on SSTable count
P99 latency degrades as data ages

Large SSTable Accumulation¶

Over time, large SSTables accumulate without compacting:

"The big SSTable problem":

To compact 4GB SSTables, need 4 of them = 16GB similar size
To compact 16GB SSTables, need 4 of them = 64GB similar size

These large SSTables may exist for months without partners,
degrading read performance throughout.

High Space Amplification During Compaction¶

STCS requires temporary space during compaction:

Compacting 4 × 1GB SSTables:
  Before: 4GB used
  During: 4GB old + 4GB new = 8GB peak
  After:  4GB used

Requires 50%+ free space headroom

Unpredictable Read Latency¶

SSTable count varies widely:

After compaction: Few SSTables, fast reads
Before compaction: Many SSTables, slow reads
P99 latency fluctuates with compaction state

Tombstone Accumulation¶

Deleted data persists until the containing SSTable compacts:

Large SSTables hold tombstones for extended periods
Space is not reclaimed promptly
May cause "zombie data" issues if tombstones expire before compaction

When to Use STCS¶

Ideal Use Cases¶

Workload Pattern	Why STCS Works
Write-heavy (>90% writes)	Low write amplification maximizes throughput
Append-only logs	Data rarely read, write cost dominates
Time-series ingestion	Natural size tiering as data ages
Batch ETL	Bulk writes followed by bulk reads
HDD storage	Sequential I/O patterns suit spinning disks

Avoid STCS When¶

Workload Pattern	Why STCS Is Wrong
Read-heavy (<30% writes)	Read amplification dominates performance
Point query latency matters	Unpredictable SSTable count
Frequently updated rows	Multiple versions scatter across SSTables
Limited disk space	Requires 50%+ headroom for compaction
Consistent latency required	P99 varies with compaction state

Bucketing Logic¶

SSTables are grouped into "buckets" based on size similarity:

Bucket boundaries determined by bucket_high and bucket_low:

Average size of bucket: X
- Include SSTables from X × bucket_low to X × bucket_high
- Default: 0.5X to 1.5X

Example with 10MB average:
- Include: 5MB to 15MB SSTables
- Exclude: 4MB (too small), 20MB (too large)

When a bucket reaches min_threshold SSTables, compact them.

Configuration¶

CREATE TABLE my_table (
    id uuid PRIMARY KEY,
    data text
) WITH compaction = {
    'class': 'SizeTieredCompactionStrategy',

    -- Minimum SSTables to trigger compaction
    -- Lower = more frequent compaction, fewer SSTables
    -- Higher = less compaction, more SSTables
    'min_threshold': 4,  -- Default: 4

    -- Maximum SSTables per compaction
    -- Limits peak I/O and memory usage
    'max_threshold': 32,  -- Default: 32

    -- Size ratio for grouping into buckets
    -- SSTables within bucket_low to bucket_high ratio are grouped
    'bucket_high': 1.5,  -- Default: 1.5
    'bucket_low': 0.5,   -- Default: 0.5

    -- Minimum SSTable size to consider for compaction
    -- Smaller SSTables are grouped together first
    'min_sstable_size': 50  -- Default: 50MB
};

Configuration Parameters¶

STCS-Specific Options¶

Parameter	Default	Description
`min_threshold`	4	Minimum SSTables in a bucket to trigger compaction. Lower values reduce SSTable count but increase compaction frequency.
`max_threshold`	32	Maximum SSTables to compact at once. Limits peak I/O and memory usage during compaction.
`bucket_high`	1.5	Upper bound multiplier for bucket membership. An SSTable joins a bucket if its size ≤ average × bucket_high. Must be > bucket_low.
`bucket_low`	0.5	Lower bound multiplier for bucket membership. An SSTable joins a bucket if its size ≥ average × bucket_low.
`min_sstable_size`	50 MB	SSTables below this size are grouped together regardless of the bucket_low/bucket_high ratio. Prevents proliferation of tiny-SSTable buckets.

Common Compaction Options¶

These options apply to all compaction strategies:

Parameter	Default	Description
`enabled`	true	Enables background compaction. When false, automatic compaction is disabled but configuration is retained.
`tombstone_threshold`	0.2	Ratio of droppable tombstones that triggers single-SSTable compaction. Value of 0.2 means 20% tombstones.
`tombstone_compaction_interval`	86400	Minimum seconds between tombstone compaction attempts for the same SSTable.
`unchecked_tombstone_compaction`	false	Bypasses pre-checking for tombstone compaction eligibility. Tombstones are still only dropped when safe.
`only_purge_repaired_tombstones`	false	Only purge tombstones from repaired SSTables. Prevents data resurrection in clusters with inconsistent repair.
`log_all`	false	Enables detailed compaction logging to a separate log file.

Validation Constraints¶

The following constraints are enforced during configuration:

bucket_high must be strictly greater than bucket_low
min_sstable_size must be non-negative
min_threshold must be ≥ 2
max_threshold must be ≥ min_threshold

Write Amplification Analysis¶

Tiered growth with \(t = 4\) (min_threshold):

Tier	Input	Output
0	\(4 \times 1\text{MB}\)	4 MB
1	\(4 \times 4\text{MB}\)	16 MB
2	\(4 \times 16\text{MB}\)	64 MB
3	\(4 \times 64\text{MB}\)	256 MB

For 1GB of data to reach final state:

Step	Operation	Cumulative Writes
1	Write to memtable	\(1\times\)
2	Flush to 1MB SSTables	\(1\times\)
3	Compact to 4MB	\(2\times\)
4	Compact to 16MB	\(3\times\)
5	Compact to 64MB	\(4\times\)
6	Compact to 256MB	\(5\times\)
7	Compact to 1GB	\(6\times\)

Total: \(W \approx 6\times\) (logarithmic in data size)

This logarithmic write amplification is significantly lower than LCS, making STCS suitable for write-heavy workloads.

Read Amplification Problem¶

The primary weakness of STCS is read amplification from large SSTable accumulation:

The "big SSTable problem":

After extended operation, SSTables of various sizes accumulate:

\[\{1\text{MB}, 4\text{MB}, 16\text{MB}, 64\text{MB}, 256\text{MB}, 1\text{GB}, 4\text{GB}, 16\text{GB}\}\]

Every read must check ALL these SSTables.

Why large SSTables do not compact:

\(4 \times 16\text{GB}\) SSTables needed to trigger compaction
That requires 64GB of SSTables at similar size
Until then, they remain, degrading read performance

Read Path Impact¶

Single partition read with 8 SSTables:

Operation	Calculation	Time
Bloom filter checks	\(8 \times 0.1\text{ms}\)	0.8 ms
Index lookups (4 positive)	\(4 \times 0.5\text{ms}\)	2.0 ms
Data reads	\(4 \times 1\text{ms}\)	4.0 ms
Merge results	—	—
Total		~7 ms

Compare to LCS with \(\sim 9\) SSTable maximum: More predictable latency.

Production Issues¶

Issue 1: Large SSTable Accumulation¶

Symptoms:

Read latency increasing over months
Many large SSTables visible in tablestats
Bloom filter false positive rate increasing

Diagnosis:

# Check SSTable sizes and counts
nodetool tablestats keyspace.table

# Look for large SSTables with no compaction partners
ls -lhS /var/lib/cassandra/data/keyspace/table-*/

Solutions:

Run major compaction during maintenance window:
```
nodetool compact keyspace table
```

Lower min_threshold to trigger compaction sooner:

ALTER TABLE keyspace.table WITH compaction = {
    'class': 'SizeTieredCompactionStrategy',
    'min_threshold': 2
};

Consider switching to LCS if reads are important

Issue 2: Temporary Space During Compaction¶

STCS can require up to 2x disk space temporarily:

Before: [1GB] [1GB] [1GB] [1GB]  = 4GB
During: [1GB] [1GB] [1GB] [1GB] + [4GB being written] = 8GB
After:  [4GB] = 4GB

Guideline: Maintain at least 50% free disk space with STCS.

Mitigation:

# Check available space before major operations
df -h /var/lib/cassandra/data

# Monitor during compaction
watch 'df -h /var/lib/cassandra/data && nodetool compactionstats'

Issue 3: Uneven SSTable Sizes¶

Symptoms:

Some tables have many small SSTables that never compact
Others have few large SSTables

Causes:

Varying write patterns across tables
min_sstable_size preventing small SSTable compaction
Bucket boundaries excluding certain sizes

Solutions:

-- Adjust bucket boundaries for more inclusive grouping
ALTER TABLE keyspace.table WITH compaction = {
    'class': 'SizeTieredCompactionStrategy',
    'bucket_high': 2.0,
    'bucket_low': 0.33
};

-- Lower minimum size threshold
ALTER TABLE keyspace.table WITH compaction = {
    'class': 'SizeTieredCompactionStrategy',
    'min_sstable_size': 10
};

Tuning Recommendations¶

High Write Throughput¶

ALTER TABLE keyspace.table WITH compaction = {
    'class': 'SizeTieredCompactionStrategy',
    'min_threshold': 4,
    'max_threshold': 64  -- Allow larger compactions
};

Reduce SSTable Count¶

ALTER TABLE keyspace.table WITH compaction = {
    'class': 'SizeTieredCompactionStrategy',
    'min_threshold': 2,  -- Compact sooner
    'bucket_high': 2.0,  -- Wider buckets
    'bucket_low': 0.33
};

Large Dataset (Minimize Compaction I/O)¶

ALTER TABLE keyspace.table WITH compaction = {
    'class': 'SizeTieredCompactionStrategy',
    'min_threshold': 8,   -- Fewer, larger compactions
    'max_threshold': 16
};

Monitoring STCS¶

Key Indicators¶

Metric	Healthy	Investigate
SSTable count	<20	>30
Largest SSTable	<10% of total	>25% of total
Pending compactions	<10	>50
Read latency P99	Stable	Increasing over time

Commands¶

# SSTable count and sizes
nodetool tablestats keyspace.table | grep -E "SSTable count|Space used"

# Check for compaction activity
nodetool compactionstats

# Analyze SSTable size distribution
ls -lh /var/lib/cassandra/data/keyspace/table-*/*-Data.db | \
    awk '{print $5}' | sort | uniq -c

Implementation Internals¶

This section documents implementation details from the Cassandra source code.

Bucket Formation Algorithm¶

The bucket formation algorithm processes SSTables in a specific order:

Sort SSTables by on-disk size in ascending order (for deterministic results)
For each SSTable, attempt to match to an existing bucket:
Match if: bucket_avg × bucket_low ≤ sstable_size ≤ bucket_avg × bucket_high
OR if both SSTable and bucket average are below min_sstable_size
Recalculate bucket average when adding new SSTables
Create new bucket for unmatched SSTables

Task Estimation¶

The estimated number of pending compaction tasks is calculated as:

For each bucket meeting min_threshold:
    tasks += ceil(bucket.size() / max_threshold)

Hotness Calculation Details¶

SSTable hotness is computed using read meter statistics:

hotness = (readMeter != null)
    ? readMeter.twoHourRate() / estimatedKeys()
    : 0.0

SSTables without read meters (e.g., newly flushed) have zero hotness, causing them to be deprioritized unless they form the only eligible bucket.

Constants Reference¶

Constant	Value	Description
Default min_threshold	4	Minimum SSTables per bucket
Default max_threshold	32	Maximum SSTables per compaction
Default bucket_low	0.5	Lower size ratio bound
Default bucket_high	1.5	Upper size ratio bound
Default min_sstable_size	50 MiB	Small SSTable grouping threshold

Compaction Overview - Concepts and strategy selection
Leveled Compaction (LCS) - Alternative for read-heavy workloads
Time-Window Compaction (TWCS) - Optimized for time-series data with TTL
Unified Compaction (UCS) - Recommended strategy for Cassandra 5.0+
Compaction Management - Tuning and troubleshooting

Size-Tiered Compaction Strategy (STCS)¶

Background and History¶

Origins¶

Design Philosophy¶

How STCS Works in Theory¶

Core Concept¶

Size Buckets¶

Compaction Trigger and Bucket Selection¶

Hotness-Based Bucket Prioritization¶

Tombstone Compaction Fallback¶

Write Amplification Calculation¶

Read Amplification Problem¶

Benefits¶

Low Write Amplification¶

Simple and Predictable¶

Efficient for Sequential Writes¶

Handles Variable Write Rates¶

Drawbacks¶

High Read Amplification¶

Large SSTable Accumulation¶

High Space Amplification During Compaction¶

Unpredictable Read Latency¶

Tombstone Accumulation¶

When to Use STCS¶

Ideal Use Cases¶

Avoid STCS When¶

Bucketing Logic¶

Configuration¶

Configuration Parameters¶

STCS-Specific Options¶

Common Compaction Options¶

Validation Constraints¶

Write Amplification Analysis¶

Read Amplification Problem¶

Read Path Impact¶

Production Issues¶

Issue 1: Large SSTable Accumulation¶

Issue 2: Temporary Space During Compaction¶

Issue 3: Uneven SSTable Sizes¶

Tuning Recommendations¶

High Write Throughput¶

Reduce SSTable Count¶

Large Dataset (Minimize Compaction I/O)¶

Monitoring STCS¶

Key Indicators¶

Commands¶

Implementation Internals¶

Bucket Formation Algorithm¶

Task Estimation¶

Hotness Calculation Details¶

Constants Reference¶

Related Documentation¶