Tombstones¶

Tombstones are deletion markers in Cassandra. Because SSTables are immutable, deleted data cannot be removed immediately—a tombstone is written to mark data as deleted, which is applied during reads and removed during compaction.

Tombstone Types¶

Cell Tombstone¶

Deletes a single column value.

DELETE email FROM users WHERE user_id = 123;

Creates a tombstone for the email column in the specified row.

Row Tombstone¶

Deletes an entire row (all columns for a clustering key).

DELETE FROM events WHERE user_id = 123 AND event_id = 456;

Creates a tombstone for the entire row identified by the clustering key.

Range Tombstone¶

Deletes a range of rows within a partition.

DELETE FROM messages
WHERE conversation_id = 'abc'
  AND sent_at >= '2024-01-01'
  AND sent_at < '2024-02-01';

Creates a single tombstone covering all rows in the specified clustering key range.

Partition Tombstone¶

Deletes an entire partition.

DELETE FROM users WHERE user_id = 123;

If user_id is the partition key, this creates a partition tombstone affecting all rows in the partition.

TTL Tombstone¶

Created automatically when a TTL expires.

INSERT INTO sessions (id, data) VALUES ('xyz', '...') USING TTL 3600;

After 3600 seconds, each cell becomes a tombstone.

Tombstone Lifecycle¶

GC Grace Period¶

The gc_grace_seconds setting determines how long tombstones are preserved before removal.

Default Value¶

-- Default: 10 days (864000 seconds)
SELECT gc_grace_seconds FROM system_schema.tables
WHERE keyspace_name = 'ks' AND table_name = 'table';

Why GC Grace Matters¶

Scenario: Node C is down during a DELETE (RF=3)

Before DELETE:
  Node A: user_id=123 → "Alice"
  Node B: user_id=123 → "Alice"
  Node C: user_id=123 → "Alice"  [OFFLINE]

DELETE user WHERE user_id = 123 (CL=QUORUM)

After DELETE:
  Node A: user_id=123 → TOMBSTONE
  Node B: user_id=123 → TOMBSTONE
  Node C: user_id=123 → "Alice"  [Still has old data]

If Node C returns AFTER gc_grace_seconds:
  - Tombstones on A and B may have been compacted away
  - Node C still has "Alice"
  - Read repair sees Node C has data, A and B do not
  - "Alice" gets RESURRECTED (zombie data)

Configuration Guidelines¶

Scenario	gc_grace_seconds	Repair Frequency
Default	864000 (10 days)	Weekly
Frequent repair	172800 (2 days)	Daily
Time-series with TTL	86400 (1 day)	Daily
High churn (careful)	3600 (1 hour)	Hourly

-- Adjust per table
ALTER TABLE my_table WITH gc_grace_seconds = 172800;

Rule: gc_grace_seconds must exceed maximum expected node downtime plus repair interval.

Tombstone Configuration¶

Warning and Failure Thresholds¶

# cassandra.yaml

# Warn when query scans this many tombstones
tombstone_warn_threshold: 1000

# Fail query when this many tombstones scanned
tombstone_failure_threshold: 100000

When exceeded:

WARN: Read X live rows and Y tombstone cells for query...
ERROR: Scanned over 100000 tombstones; query aborted

Tombstone Problems¶

Problem 1: Tombstone Accumulation¶

Symptoms:

Read latency increasing over time
"Read X live rows and Y tombstone cells" warnings
Query timeouts on specific partitions

Causes:

Deleting many rows without compaction
Wide partitions with frequent deletes
Range deletes creating overlapping tombstones

Investigation:

# Check tombstone warnings
grep "tombstone" /var/log/cassandra/system.log

# Table statistics
nodetool tablestats keyspace.table | grep -i tombstone

# Per-SSTable tombstone analysis
tools/bin/sstablemetadata /path/to/na-*-Data.db | grep -i tombstone

Problem 2: Query Failures¶

Error:

Scanned over 100000 tombstones; query aborted

Solutions (in order of preference):

Fix data model to avoid tombstone accumulation
Force compaction to remove eligible tombstones
Add time-based partitioning to limit partition size
Increase threshold (last resort—hides the problem)

Problem 3: Partition Tombstones with Wide Partitions¶

Bad pattern:

Partition: user_id=123
├── Row: event_1 → data
├── Row: event_2 → data
├── ... 100,000 rows ...
└── Row: event_100000 → data

DELETE FROM events WHERE user_id = 123;

Creates ONE partition tombstone, but on read, must check
against ALL rows in all SSTables = massive read amplification

Better pattern:

-- Partition by user_id AND date
-- Delete smaller partitions
DELETE FROM events
WHERE user_id = 123 AND event_date = '2024-01-15';

Monitoring Tombstones¶

nodetool Commands¶

# Table statistics including tombstone info
nodetool tablestats keyspace.table

# Tombstones per read histogram
nodetool tablehistograms keyspace.table

JMX Metrics¶

org.apache.cassandra.metrics:type=Table,name=TombstoneScannedHistogram
org.apache.cassandra.metrics:type=Table,name=LiveScannedHistogram

SSTable Analysis¶

# Check tombstone counts per SSTable
for f in /var/lib/cassandra/data/ks/table-*/*-Data.db; do
    echo "=== $f ==="
    tools/bin/sstablemetadata "$f" | grep -i tombstone
done

Reducing Tombstones¶

Data Model Changes¶

Avoid wide partitions with deletes
Add time bucketing to partition key
Limit partition size
Use TTL instead of explicit deletes
TTL tombstones are more predictable
Easier to reason about cleanup
Avoid range deletes on large ranges
Delete smaller ranges
Use time-based partitioning

Compaction Strategies¶

TWCS (Time-Window Compaction Strategy):

Best for time-series data with TTL. Entire SSTables drop when all data expires.

ALTER TABLE metrics WITH compaction = {
    'class': 'TimeWindowCompactionStrategy',
    'compaction_window_size': 1,
    'compaction_window_unit': 'DAYS'
};

LCS (Leveled Compaction Strategy):

Keeps SSTable count low, improving tombstone cleanup.

ALTER TABLE events WITH compaction = {
    'class': 'LeveledCompactionStrategy'
};

Manual Compaction¶

Force compaction to remove eligible tombstones:

# Compact specific table
nodetool compact keyspace table

# Major compaction (use sparingly)
nodetool compact --user-defined /path/to/sstables

Tombstone Best Practices¶

Design¶

Partition by time for time-series data
Keep partitions bounded in size
Prefer TTL over explicit deletes when possible

Operations¶

Run repair within gc_grace_seconds
Monitor tombstone counts per read
Investigate tables with high tombstone warnings

Configuration¶

Set gc_grace_seconds based on repair frequency
Set tombstone thresholds appropriately
Use TWCS for TTL-heavy workloads

Storage Engine Overview - Architecture overview
Read Path - How tombstones affect reads
Compaction - Tombstone removal