Size-Tiered Compaction Strategy (STCS)¶
Cassandra 5.0+
Starting with Cassandra 5.0, Unified Compaction Strategy (UCS) is the recommended compaction strategy for most workloads. UCS provides adaptive behavior that can emulate STCS characteristics when appropriate. STCS remains fully supported and is still the default for tables in earlier versions.
STCS is Cassandra's original and default compaction strategy. It groups SSTables of similar size and compacts them together, optimizing for write throughput at the cost of read amplification.
Background and History¶
Origins¶
Size-Tiered Compaction Strategy is Cassandra's original compaction implementation, present since the earliest versions. It derives from classical LSM-tree (Log-Structured Merge-tree) compaction as described in the 1996 paper by O'Neil et al. The strategy was designed to optimize write throughput—a primary design goal for Cassandra's original use case as a high-volume write store.
STCS remained Cassandra's only compaction strategy until version 1.0 (October 2011), when Leveled Compaction Strategy was introduced to address STCS's read amplification problems.
Design Philosophy¶
STCS follows a simple principle: minimize write amplification by only compacting SSTables of similar size together. This approach ensures that:
- Small SSTables are compacted frequently (low cost per compaction)
- Large SSTables are compacted infrequently (high cost but rare)
- Each byte of data is rewritten approximately log(N) times over its lifetime
The trade-off is that partition keys are scattered across many SSTables, requiring reads to check multiple files.
How STCS Works in Theory¶
Core Concept¶
STCS organizes compaction around size similarity rather than key ranges or levels:
- Grouping: SSTables are grouped into "buckets" based on size
- Threshold: When a bucket contains
min_thresholdSSTables (default: 4), compaction is triggered - Merging: All SSTables in the bucket are merged into a single larger SSTable
- Growth: The output SSTable joins a larger size bucket, and the process repeats
Size Buckets¶
SSTables are assigned to buckets based on size similarity:
Where:
- \(s\) = SSTable size
- \(\bar{s}\) = average size of SSTables in the bucket
- \(b_{\text{low}}\) =
bucket_low(default: 0.5) - \(b_{\text{high}}\) =
bucket_high(default: 1.5)
Example: Bucket with \(\bar{s} = 10\text{MB}\):
SSTables outside this range form separate buckets. The min_sstable_size parameter (default 50MB) groups all smaller SSTables together, preventing proliferation of tiny SSTable buckets.
Compaction Trigger and Bucket Selection¶
Compaction occurs when:
- A bucket reaches
min_thresholdSSTables (default: 4) - The bucket is selected by the compaction scheduler based on "hotness"
- Up to
max_thresholdSSTables (default: 32) are included
Hotness-Based Bucket Prioritization¶
When multiple buckets are eligible for compaction, STCS selects the "hottest" bucket using a read-intensity metric:
This formula measures recent read activity (two-hour rate) divided by the number of keys in the SSTable. The bucket with the highest aggregate hotness is selected first, ensuring that frequently-read data is compacted and consolidated sooner.
Tie-breaking: When buckets have equal hotness, STCS prioritizes buckets with smaller average file size to reduce compaction I/O.
Tombstone Compaction Fallback¶
If no standard compaction candidates exist (no bucket meets min_threshold), STCS falls back to single-SSTable tombstone compaction:
- Scans SSTables for droppable tombstone ratio exceeding
tombstone_threshold(default: 0.2) - Selects the largest eligible SSTable
- Rewrites it to purge tombstones
This ensures tombstones are eventually reclaimed even when SSTables lack compaction partners.
Write Amplification Calculation¶
STCS achieves logarithmic write amplification:
Data progression with \(t = 4\) (min_threshold):
| Step | Input | Output |
|---|---|---|
| 1 | \(4 \times 1\text{MB}\) | 4 MB |
| 2 | \(4 \times 4\text{MB}\) | 16 MB |
| 3 | \(4 \times 16\text{MB}\) | 64 MB |
| 4 | \(4 \times 64\text{MB}\) | 256 MB |
Write amplification formula:
Where:
- \(W\) = write amplification
- \(t\) =
min_threshold(default: 4) - \(N\) = total data size
- \(s_f\) = flush size
Example: 1GB dataset with 1MB flushes:
This is significantly lower than LCS's \(10\times\) per level amplification.
Read Amplification Problem¶
The fundamental weakness of STCS is that any partition key may exist in any SSTable:
After extended operation, SSTables accumulate:
[1MB] [4MB] [16MB] [64MB] [256MB] [1GB] [4GB]
Read for partition key K:
1. Check bloom filter on each SSTable
2. For positive results, read index and data
3. Merge all fragments found
Worst case: Every SSTable contains data for K
→ 7+ disk reads for a single partition
Large SSTables compound this problem because they take longer to compact together, leading to long periods with many SSTables.
Benefits¶
Low Write Amplification¶
STCS minimizes how often data is rewritten:
- Logarithmic growth: ~5-10× total amplification for typical datasets
- SSD-friendly: Less wear compared to LCS
- Sustained write throughput: Compaction I/O remains bounded
Simple and Predictable¶
The bucketing algorithm is straightforward:
- Easy to understand and debug
- Predictable compaction sizes
- No complex level management
Efficient for Sequential Writes¶
Append-only and time-series patterns benefit from STCS:
- New data stays in small, recent SSTables
- Old data migrates to large SSTables
- Natural temporal locality
Handles Variable Write Rates¶
STCS adapts to changing workloads:
- Burst writes: Small SSTables accumulate, compact later
- Steady writes: Regular compaction cadence
- Idle periods: Compaction catches up
Drawbacks¶
High Read Amplification¶
The primary cost of STCS is read performance:
- Point queries may touch many SSTables
- No upper bound on SSTable count
- P99 latency degrades as data ages
Large SSTable Accumulation¶
Over time, large SSTables accumulate without compacting:
"The big SSTable problem":
To compact 4GB SSTables, need 4 of them = 16GB similar size
To compact 16GB SSTables, need 4 of them = 64GB similar size
These large SSTables may exist for months without partners,
degrading read performance throughout.
High Space Amplification During Compaction¶
STCS requires temporary space during compaction:
Compacting 4 × 1GB SSTables:
Before: 4GB used
During: 4GB old + 4GB new = 8GB peak
After: 4GB used
Requires 50%+ free space headroom
Unpredictable Read Latency¶
SSTable count varies widely:
- After compaction: Few SSTables, fast reads
- Before compaction: Many SSTables, slow reads
- P99 latency fluctuates with compaction state
Tombstone Accumulation¶
Deleted data persists until the containing SSTable compacts:
- Large SSTables hold tombstones for extended periods
- Space is not reclaimed promptly
- May cause "zombie data" issues if tombstones expire before compaction
When to Use STCS¶
Ideal Use Cases¶
| Workload Pattern | Why STCS Works |
|---|---|
| Write-heavy (>90% writes) | Low write amplification maximizes throughput |
| Append-only logs | Data rarely read, write cost dominates |
| Time-series ingestion | Natural size tiering as data ages |
| Batch ETL | Bulk writes followed by bulk reads |
| HDD storage | Sequential I/O patterns suit spinning disks |
Avoid STCS When¶
| Workload Pattern | Why STCS Is Wrong |
|---|---|
| Read-heavy (<30% writes) | Read amplification dominates performance |
| Point query latency matters | Unpredictable SSTable count |
| Frequently updated rows | Multiple versions scatter across SSTables |
| Limited disk space | Requires 50%+ headroom for compaction |
| Consistent latency required | P99 varies with compaction state |
Bucketing Logic¶
SSTables are grouped into "buckets" based on size similarity:
Bucket boundaries determined by bucket_high and bucket_low:
Average size of bucket: X
- Include SSTables from X × bucket_low to X × bucket_high
- Default: 0.5X to 1.5X
Example with 10MB average:
- Include: 5MB to 15MB SSTables
- Exclude: 4MB (too small), 20MB (too large)
When a bucket reaches min_threshold SSTables, compact them.
Configuration¶
CREATE TABLE my_table (
id uuid PRIMARY KEY,
data text
) WITH compaction = {
'class': 'SizeTieredCompactionStrategy',
-- Minimum SSTables to trigger compaction
-- Lower = more frequent compaction, fewer SSTables
-- Higher = less compaction, more SSTables
'min_threshold': 4, -- Default: 4
-- Maximum SSTables per compaction
-- Limits peak I/O and memory usage
'max_threshold': 32, -- Default: 32
-- Size ratio for grouping into buckets
-- SSTables within bucket_low to bucket_high ratio are grouped
'bucket_high': 1.5, -- Default: 1.5
'bucket_low': 0.5, -- Default: 0.5
-- Minimum SSTable size to consider for compaction
-- Smaller SSTables are grouped together first
'min_sstable_size': 50 -- Default: 50MB
};
Configuration Parameters¶
STCS-Specific Options¶
| Parameter | Default | Description |
|---|---|---|
min_threshold |
4 | Minimum SSTables in a bucket to trigger compaction. Lower values reduce SSTable count but increase compaction frequency. |
max_threshold |
32 | Maximum SSTables to compact at once. Limits peak I/O and memory usage during compaction. |
bucket_high |
1.5 | Upper bound multiplier for bucket membership. An SSTable joins a bucket if its size ≤ average × bucket_high. Must be > bucket_low. |
bucket_low |
0.5 | Lower bound multiplier for bucket membership. An SSTable joins a bucket if its size ≥ average × bucket_low. |
min_sstable_size |
50 MB | SSTables below this size are grouped together regardless of the bucket_low/bucket_high ratio. Prevents proliferation of tiny-SSTable buckets. |
Common Compaction Options¶
These options apply to all compaction strategies:
| Parameter | Default | Description |
|---|---|---|
enabled |
true | Enables background compaction. When false, automatic compaction is disabled but configuration is retained. |
tombstone_threshold |
0.2 | Ratio of droppable tombstones that triggers single-SSTable compaction. Value of 0.2 means 20% tombstones. |
tombstone_compaction_interval |
86400 | Minimum seconds between tombstone compaction attempts for the same SSTable. |
unchecked_tombstone_compaction |
false | Bypasses pre-checking for tombstone compaction eligibility. Tombstones are still only dropped when safe. |
only_purge_repaired_tombstones |
false | Only purge tombstones from repaired SSTables. Prevents data resurrection in clusters with inconsistent repair. |
log_all |
false | Enables detailed compaction logging to a separate log file. |
Validation Constraints¶
The following constraints are enforced during configuration:
bucket_highmust be strictly greater thanbucket_lowmin_sstable_sizemust be non-negativemin_thresholdmust be ≥ 2max_thresholdmust be ≥min_threshold
Write Amplification Analysis¶
Tiered growth with \(t = 4\) (min_threshold):
| Tier | Input | Output |
|---|---|---|
| 0 | \(4 \times 1\text{MB}\) | 4 MB |
| 1 | \(4 \times 4\text{MB}\) | 16 MB |
| 2 | \(4 \times 16\text{MB}\) | 64 MB |
| 3 | \(4 \times 64\text{MB}\) | 256 MB |
For 1GB of data to reach final state:
| Step | Operation | Cumulative Writes |
|---|---|---|
| 1 | Write to memtable | \(1\times\) |
| 2 | Flush to 1MB SSTables | \(1\times\) |
| 3 | Compact to 4MB | \(2\times\) |
| 4 | Compact to 16MB | \(3\times\) |
| 5 | Compact to 64MB | \(4\times\) |
| 6 | Compact to 256MB | \(5\times\) |
| 7 | Compact to 1GB | \(6\times\) |
Total: \(W \approx 6\times\) (logarithmic in data size)
This logarithmic write amplification is significantly lower than LCS, making STCS suitable for write-heavy workloads.
Read Amplification Problem¶
The primary weakness of STCS is read amplification from large SSTable accumulation:
The "big SSTable problem":
After extended operation, SSTables of various sizes accumulate:
Every read must check ALL these SSTables.
Why large SSTables do not compact:
- \(4 \times 16\text{GB}\) SSTables needed to trigger compaction
- That requires 64GB of SSTables at similar size
- Until then, they remain, degrading read performance
Read Path Impact¶
Single partition read with 8 SSTables:
| Operation | Calculation | Time |
|---|---|---|
| Bloom filter checks | \(8 \times 0.1\text{ms}\) | 0.8 ms |
| Index lookups (4 positive) | \(4 \times 0.5\text{ms}\) | 2.0 ms |
| Data reads | \(4 \times 1\text{ms}\) | 4.0 ms |
| Merge results | — | — |
| Total | ~7 ms |
Compare to LCS with \(\sim 9\) SSTable maximum: More predictable latency.
Production Issues¶
Issue 1: Large SSTable Accumulation¶
Symptoms:
- Read latency increasing over months
- Many large SSTables visible in tablestats
- Bloom filter false positive rate increasing
Diagnosis:
# Check SSTable sizes and counts
nodetool tablestats keyspace.table
# Look for large SSTables with no compaction partners
ls -lhS /var/lib/cassandra/data/keyspace/table-*/
Solutions:
-
Run major compaction during maintenance window:
nodetool compact keyspace table -
Lower
min_thresholdto trigger compaction sooner:ALTER TABLE keyspace.table WITH compaction = { 'class': 'SizeTieredCompactionStrategy', 'min_threshold': 2 }; -
Consider switching to LCS if reads are important
Issue 2: Temporary Space During Compaction¶
STCS can require up to 2x disk space temporarily:
Before: [1GB] [1GB] [1GB] [1GB] = 4GB
During: [1GB] [1GB] [1GB] [1GB] + [4GB being written] = 8GB
After: [4GB] = 4GB
Guideline: Maintain at least 50% free disk space with STCS.
Mitigation:
# Check available space before major operations
df -h /var/lib/cassandra/data
# Monitor during compaction
watch 'df -h /var/lib/cassandra/data && nodetool compactionstats'
Issue 3: Uneven SSTable Sizes¶
Symptoms:
- Some tables have many small SSTables that never compact
- Others have few large SSTables
Causes:
- Varying write patterns across tables
min_sstable_sizepreventing small SSTable compaction- Bucket boundaries excluding certain sizes
Solutions:
-- Adjust bucket boundaries for more inclusive grouping
ALTER TABLE keyspace.table WITH compaction = {
'class': 'SizeTieredCompactionStrategy',
'bucket_high': 2.0,
'bucket_low': 0.33
};
-- Lower minimum size threshold
ALTER TABLE keyspace.table WITH compaction = {
'class': 'SizeTieredCompactionStrategy',
'min_sstable_size': 10
};
Tuning Recommendations¶
High Write Throughput¶
ALTER TABLE keyspace.table WITH compaction = {
'class': 'SizeTieredCompactionStrategy',
'min_threshold': 4,
'max_threshold': 64 -- Allow larger compactions
};
Reduce SSTable Count¶
ALTER TABLE keyspace.table WITH compaction = {
'class': 'SizeTieredCompactionStrategy',
'min_threshold': 2, -- Compact sooner
'bucket_high': 2.0, -- Wider buckets
'bucket_low': 0.33
};
Large Dataset (Minimize Compaction I/O)¶
ALTER TABLE keyspace.table WITH compaction = {
'class': 'SizeTieredCompactionStrategy',
'min_threshold': 8, -- Fewer, larger compactions
'max_threshold': 16
};
Monitoring STCS¶
Key Indicators¶
| Metric | Healthy | Investigate |
|---|---|---|
| SSTable count | <20 | >30 |
| Largest SSTable | <10% of total | >25% of total |
| Pending compactions | <10 | >50 |
| Read latency P99 | Stable | Increasing over time |
Commands¶
# SSTable count and sizes
nodetool tablestats keyspace.table | grep -E "SSTable count|Space used"
# Check for compaction activity
nodetool compactionstats
# Analyze SSTable size distribution
ls -lh /var/lib/cassandra/data/keyspace/table-*/*-Data.db | \
awk '{print $5}' | sort | uniq -c
Implementation Internals¶
This section documents implementation details from the Cassandra source code.
Bucket Formation Algorithm¶
The bucket formation algorithm processes SSTables in a specific order:
- Sort SSTables by on-disk size in ascending order (for deterministic results)
- For each SSTable, attempt to match to an existing bucket:
- Match if:
bucket_avg × bucket_low ≤ sstable_size ≤ bucket_avg × bucket_high - OR if both SSTable and bucket average are below
min_sstable_size - Recalculate bucket average when adding new SSTables
- Create new bucket for unmatched SSTables
Task Estimation¶
The estimated number of pending compaction tasks is calculated as:
For each bucket meeting min_threshold:
tasks += ceil(bucket.size() / max_threshold)
Hotness Calculation Details¶
SSTable hotness is computed using read meter statistics:
hotness = (readMeter != null)
? readMeter.twoHourRate() / estimatedKeys()
: 0.0
SSTables without read meters (e.g., newly flushed) have zero hotness, causing them to be deprioritized unless they form the only eligible bucket.
Constants Reference¶
| Constant | Value | Description |
|---|---|---|
| Default min_threshold | 4 | Minimum SSTables per bucket |
| Default max_threshold | 32 | Maximum SSTables per compaction |
| Default bucket_low | 0.5 | Lower size ratio bound |
| Default bucket_high | 1.5 | Upper size ratio bound |
| Default min_sstable_size | 50 MiB | Small SSTable grouping threshold |
Related Documentation¶
- Compaction Overview - Concepts and strategy selection
- Leveled Compaction (LCS) - Alternative for read-heavy workloads
- Time-Window Compaction (TWCS) - Optimized for time-series data with TTL
- Unified Compaction (UCS) - Recommended strategy for Cassandra 5.0+
- Compaction Management - Tuning and troubleshooting