DELETE¶
The DELETE statement removes rows or column values from Cassandra tables. Unlike traditional databases that immediately remove data, Cassandra writes tombstones—markers indicating data has been deleted. Understanding tombstones is essential for maintaining cluster performance.
Behavioral Guarantees¶
What DELETE Guarantees¶
- DELETE creates a tombstone marker for deleted data
- DELETE with higher timestamp supersedes earlier writes
- Tombstones are replicated like regular data
- Tombstones are not removed before gc_grace_seconds has elapsed
- Range DELETE creates a single range tombstone (Cassandra 3.0+)
What DELETE Does NOT Guarantee¶
Undefined Behavior
The following behaviors are undefined and must not be relied upon:
- Immediate space reclamation: Disk space is only reclaimed after compaction removes tombstones
- Tombstone-free reads: Reads may encounter tombstones even after gc_grace_seconds if compaction hasn't run
- Delete confirmation: DELETE returning success does not guarantee read invisibility at lower consistency levels
- Resurrection prevention without repair: If repair doesn't run within gc_grace_seconds, deleted data may reappear
Tombstone Persistence Contract¶
Critical: gc_grace_seconds Contract
Tombstones must persist for at least gc_grace_seconds to prevent data resurrection:
- If a node is down longer than gc_grace_seconds, and compaction runs on other nodes, the tombstone may be removed
- When the node returns, its old data will propagate as "new" data, causing resurrection
- Repair must run more frequently than gc_grace_seconds on all nodes
Failure Semantics¶
| Failure Mode | Outcome | Client Action |
|---|---|---|
WriteTimeoutException |
Undefined - tombstone may or may not have been written | Query to verify, retry if needed |
UnavailableException |
Not applied | Safe to retry |
WriteFailureException |
Partially applied | Some replicas may have tombstone, others may not |
Version-Specific Behavior¶
| Version | Behavior |
|---|---|
| 2.0+ | IF EXISTS and IF condition for DELETE (CASSANDRA-5062) |
| 3.0+ | Range tombstones for efficient bulk deletes (CASSANDRA-6237) |
| 4.0+ | Improved tombstone handling in reads (CASSANDRA-14227) |
Overview¶
Why Cassandra Uses Tombstones¶
In a distributed system without central coordination, immediate deletion is impossible:
The Distributed Delete Problem:
| Phase | Node A | Node B (offline) | Node C |
|---|---|---|---|
| DELETE id=1 | Deletes id=1 | Unreachable | Deletes id=1 |
| Node B returns | - | Still has id=1 | - |
| Without tombstone | id=1 resurrects! | Repair spreads data | id=1 resurrects! |
| With tombstone | Tombstone propagates | Receives tombstone | Data stays deleted |
Tombstones solve:
- Resurrection prevention: Deleted data doesn't reappear when offline nodes return
- Consistency during repair: Tombstones propagate like regular data
- Timestamp-based resolution: Tombstone with higher timestamp wins
Tombstone Lifecycle¶
Synopsis¶
DELETE [ *column_name* [ , *column_name* ... ] ]
FROM [ *keyspace_name*. ] *table_name*
[ USING TIMESTAMP *microseconds* ]
WHERE *where_clause*
[ IF *condition* [ AND *condition* ... ] | IF EXISTS ]
column_name (optional):
*column_name*
| *column_name* [ *list_index* ]
| *column_name* [ *map_key* ]
Parameters¶
Column Selection¶
Delete Entire Row¶
When no columns specified, the entire row is deleted:
DELETE FROM users WHERE user_id = 123;
Delete Specific Columns¶
Target individual columns (creates cell tombstones):
DELETE email, phone FROM users WHERE user_id = 123;
Delete List Element¶
-- Delete by index
DELETE phone_numbers[0] FROM users WHERE user_id = 123;
DELETE phone_numbers[2] FROM users WHERE user_id = 123;
List Index Deletion
Deleting by index requires reading the list first. This is a read-before-write operation with potential race conditions.
Delete Map Entry¶
-- Delete by key
DELETE preferences['deprecated_setting'] FROM users WHERE user_id = 123;
DELETE preferences['old_key'] FROM users WHERE user_id = 123;
USING TIMESTAMP¶
Specifies the deletion timestamp:
DELETE FROM users
USING TIMESTAMP 1705315800000000
WHERE user_id = 123;
Timestamp effects:
- Only data with timestamps ≤ specified timestamp is deleted
- Data written after this timestamp remains visible
- Useful for replaying deletions from external systems
WHERE Clause¶
Single Row Delete¶
DELETE FROM users WHERE user_id = 123;
Range Delete (within partition)¶
DELETE FROM events
WHERE tenant_id = 'acme'
AND event_time >= '2024-01-01'
AND event_time < '2024-02-01';
Multi-Partition Delete¶
DELETE FROM users WHERE user_id IN (123, 456, 789);
Partition Delete (all rows)¶
-- Deletes all rows with this partition key
DELETE FROM user_events WHERE user_id = 123;
IF EXISTS / IF Condition¶
Conditional deletes using lightweight transactions:
-- Delete only if row exists
DELETE FROM sessions WHERE session_id = ? IF EXISTS;
-- Delete with condition
DELETE FROM users
WHERE user_id = ?
IF status = 'inactive' AND last_login < '2023-01-01';
Tombstone Types¶
Different DELETE operations create different tombstone types:
| Tombstone Type | Example | Effect |
|---|---|---|
| Cell | DELETE email FROM users WHERE id = 1 |
Marks single cell as deleted |
| Row | DELETE FROM users WHERE id = 1 |
Marks entire row as deleted |
| Range | DELETE FROM events WHERE pk = 'x' AND ck >= 1 AND ck < 100 |
Single marker covers entire range |
| Partition | DELETE FROM user_events WHERE user_id = 123 |
Marks entire partition as deleted |
Cell Tombstone¶
DELETE email FROM users WHERE user_id = 123;
- Marks single column value as deleted
- Smallest tombstone type
- Created per-cell
Row Tombstone¶
DELETE FROM users WHERE user_id = 123 AND account_type = 'premium';
- Marks entire row (all columns) as deleted
- Single tombstone regardless of column count
- Created when full primary key specified
Range Tombstone¶
DELETE FROM events
WHERE partition_key = 'x'
AND clustering_col >= 100
AND clustering_col < 200;
- Single marker covers entire range
- Efficient for bulk deletes within partition
- Introduced in Cassandra 3.0+
Partition Tombstone¶
DELETE FROM user_events WHERE user_id = 123;
- Deletes all rows in partition
- Single tombstone regardless of row count
- Most efficient for clearing partition
TTL Tombstone¶
Created automatically when TTL expires:
INSERT INTO sessions (id, token) VALUES (1, 'abc') USING TTL 3600;
-- After 3600 seconds, tombstone created automatically
gc_grace_seconds¶
The gc_grace_seconds table property controls how long tombstones persist:
CREATE TABLE events (
...
) WITH gc_grace_seconds = 864000; -- 10 days (default)
ALTER TABLE events WITH gc_grace_seconds = 86400; -- 1 day
Why gc_grace_seconds Matters¶
Scenario: gc_grace = 10 days
| Timeline | Safe Scenario | Dangerous Scenario |
|---|---|---|
| Day 0 | Node B goes down | Node B goes down |
| Day 7 | Node B returns | - |
| Day 11 | - | Tombstone removed by compaction |
| Day 15 | - | Node B returns |
| Result | Tombstone propagates, data stays deleted ✓ | Old data resurrects! Inconsistency ✗ |
Configuring gc_grace_seconds¶
| Scenario | Recommended Value | Requirements |
|---|---|---|
| Default | 864000 (10 days) | Repair within 10 days |
| Fast cleanup | 86400 (1 day) | Repair runs daily |
| Single DC, RF=1 | 0 | No replication needed |
| High availability | 864000+ | Conservative approach |
Reducing gc_grace_seconds
Before reducing:
- Ensure repairs run more frequently than the new value
- No nodes stay down longer than gc_grace_seconds
- Monitor node health proactively
- Have alerting for extended node outages
Tombstone Performance Impact¶
Read Performance¶
Tombstones must be read and filtered during queries:
Query: SELECT * FROM table WHERE pk = 1
| Data Type | Count | Status |
|---|---|---|
| Live Data | 100 rows | Returned |
| Tombstones | 10,000 | Must be read and filtered |
| Total Read | 10,100 entries | To return 100 rows |
Performance implications:
- All tombstones in query range must be read
- Memory consumed holding tombstones
- Can trigger
TombstoneOverwhelmingException
Tombstone Thresholds¶
Cassandra warns or fails when too many tombstones are encountered:
# cassandra.yaml
tombstone_warn_threshold: 1000
tombstone_failure_threshold: 100000
WARN - Read 5000 live rows and 50000 tombstone cells for query...
ERROR - Scanned over 100000 tombstones; query aborted
Monitoring Tombstones¶
# Table-level tombstone stats
nodetool tablestats keyspace.table
# Per-SSTable tombstone info
nodetool cfstats keyspace.table
Tombstone-Heavy Patterns to Avoid¶
Pattern 1: Queue-Like Tables¶
-- Anti-pattern: Using Cassandra as a queue
CREATE TABLE job_queue (
queue_id TEXT,
job_id TIMEUUID,
payload TEXT,
PRIMARY KEY (queue_id, job_id)
);
-- Process job, then delete
DELETE FROM job_queue WHERE queue_id = 'main' AND job_id = ?;
Problem: Accumulates tombstones at high rate
Solution: Use TTL instead, or dedicated queue system
Pattern 2: Sparse Time-Series¶
-- Anti-pattern: Deleting old time-series data
DELETE FROM metrics
WHERE sensor_id = 'temp-1'
AND timestamp < '2024-01-01';
Problem: Creates massive range tombstone
Solution: Use separate tables per time window (e.g., metrics_2024_01)
Pattern 3: Frequent Column Updates to Null¶
-- Anti-pattern: Setting columns to null repeatedly
UPDATE users SET temp_flag = null WHERE user_id = ?;
Problem: Each null creates tombstone
Solution: Use separate table or UNSET instead of null
Best Practices¶
Use TTL Instead of DELETE¶
Let data expire naturally:
-- Good: Data auto-expires
INSERT INTO sessions (id, token) VALUES (?, ?)
USING TTL 86400;
-- Avoid: Manual cleanup creates tombstones
DELETE FROM sessions WHERE created_at < ?;
Use Range Tombstones¶
Single range tombstone is more efficient than many row tombstones:
-- Good: Single range tombstone
DELETE FROM events
WHERE tenant_id = 'acme'
AND event_date = '2024-01-15';
-- Avoid: Many individual tombstones
-- for row in rows_to_delete:
-- DELETE FROM events WHERE tenant_id = 'acme' AND event_id = row.id;
Partition Per Time Window¶
Design tables to drop entire partitions:
-- Table design: partition by day
CREATE TABLE events_by_day (
tenant_id TEXT,
event_date DATE,
event_id TIMEUUID,
data TEXT,
PRIMARY KEY ((tenant_id, event_date), event_id)
);
-- Clean deletion: entire partition
DELETE FROM events_by_day
WHERE tenant_id = 'acme' AND event_date = '2024-01-15';
Monitor and Repair¶
# Regular repair ensures tombstone propagation
nodetool repair keyspace table
# Check for tombstone warnings
grep -i tombstone /var/log/cassandra/system.log
Restrictions¶
Restrictions
WHERE Clause:
- Partition key required
- Cannot use ALLOW FILTERING
- Range deletes only on clustering columns
Columns:
- Cannot delete primary key columns
- Cannot delete counter columns (use TRUNCATE for counter tables)
- List index deletion requires read-before-write
Conditional Deletes:
- IF cannot be used with USING TIMESTAMP
- IF conditions add Paxos overhead
Counter Tables:
- DELETE not supported for counter columns
- Use TRUNCATE to reset counter table
Examples¶
Delete User Account¶
-- Delete user and all related data (multiple tables)
BEGIN BATCH
DELETE FROM users WHERE user_id = ?;
DELETE FROM user_emails WHERE user_id = ?;
DELETE FROM user_sessions WHERE user_id = ?;
APPLY BATCH;
Archive and Delete Old Events¶
-- First: Copy to archive (application code)
-- Then: Delete from active table
DELETE FROM active_events
WHERE tenant_id = 'acme'
AND event_time < '2024-01-01';
Conditional Session Cleanup¶
DELETE FROM sessions
WHERE session_id = ?
IF user_id = ?; -- Verify ownership before delete
Remove Map Entry¶
DELETE preferences['deprecated_feature'] FROM user_settings
WHERE user_id = ?;
Clean Expired Data Partition¶
-- If table is partitioned by month
DELETE FROM metrics_2024_01
WHERE sensor_id = 'temp-001';
Related Documentation¶
- INSERT - TTL as alternative to DELETE
- UPDATE - Null assignments create tombstones
- BATCH - Atomic multi-table deletes
- Table Options - gc_grace_seconds configuration