Data Manipulation Language (DML)¶
Data Manipulation Language (DML) commands retrieve and modify data in Cassandra tables. Unlike traditional SQL databases, Cassandra's DML operations are designed for distributed, eventually consistent storage with specific semantics around timestamps, tombstones, and partition-aware queries.
Overview¶
DML in Cassandra vs SQL¶
Cassandra's DML differs fundamentally from SQL databases:
| Aspect | SQL Databases | Cassandra |
|---|---|---|
| INSERT semantics | Fails if row exists | Upsert (insert or update) |
| UPDATE semantics | Fails if row doesn't exist | Upsert (insert or update) |
| DELETE behavior | Immediate removal | Tombstone marker |
| Query flexibility | Any column in WHERE | Partition key required |
| Consistency | ACID transactions | Tunable consistency |
| Conflict resolution | Locks | Last-write-wins (timestamp) |
Historical Context¶
CQL (Cassandra Query Language) was introduced in Cassandra 0.8 (2011) as a SQL-like interface to replace the Thrift API. The language evolved significantly:
| Version | CQL Version | Key DML Features |
|---|---|---|
| 0.8 | CQL 1.0 | Basic SELECT, INSERT, UPDATE, DELETE |
| 1.2 | CQL 3.0 | Compound primary keys, collections, prepared statements |
| 2.0 | CQL 3.1 | Lightweight transactions (IF NOT EXISTS/IF condition) |
| 2.1 | CQL 3.2 | User-defined functions, JSON support |
| 2.2 | CQL 3.3 | User-defined aggregates, GROUP BY |
| 3.0 | CQL 3.4 | Materialized views, SASI indexes |
| 4.0 | CQL 3.4.5 | Virtual tables, audit logging |
| 5.0 | CQL 3.4.7 | SAI indexes, vector search |
Write Path Architecture¶
Understanding how writes flow through Cassandra is essential for effective DML usage.
Write Path Steps¶
- Client sends request to any node (coordinator)
- Coordinator parses the CQL statement
- Partition key hashed to determine token and replica nodes
- Request forwarded to replica nodes based on replication factor
- Each replica writes to commit log (durability) then memtable (speed)
- Coordinator waits for acknowledgments based on consistency level
- Response returned when sufficient replicas acknowledge
Write Timestamps¶
Every write in Cassandra carries a timestamp (microseconds since Unix epoch):
-- Cassandra assigns current time if not specified
INSERT INTO users (id, name) VALUES (1, 'Alice');
-- Explicit timestamp
INSERT INTO users (id, name) VALUES (1, 'Alice')
USING TIMESTAMP 1705315800000000;
Timestamps serve critical functions:
- Conflict resolution: Higher timestamp wins (last-write-wins)
- Tombstone expiration: Determines when deleted data can be purged
- Read repair: Identifies which value is most recent
Clock Synchronization
Cassandra relies on synchronized clocks across nodes. Use NTP to keep clocks synchronized within milliseconds. Clock skew can cause:
- Unexpected conflict resolution results
- Data appearing to "come back" after deletion
- Inconsistent reads
Read Path Architecture¶
Reads in Cassandra follow a different path optimized for distributed data retrieval.
Read Path Steps¶
- Client sends query to coordinator
- Coordinator extracts partition key and calculates token
- Request sent to replicas based on consistency level
- Each replica checks:
- Bloom filter (may contain partition?)
- Partition key cache
- Memtable (in-memory recent writes)
- SSTables (on-disk, newest to oldest)
- Coordinator merges results, resolving conflicts by timestamp
- Optional read repair if inconsistencies detected
Partition Key Requirement¶
Cassandra requires the partition key in most queries because it determines which nodes hold the data:
-- Efficient: partition key specified
SELECT * FROM users WHERE user_id = 123;
-- Inefficient: full cluster scan
SELECT * FROM users WHERE name = 'Alice' ALLOW FILTERING;
ALLOW FILTERING
The ALLOW FILTERING clause forces Cassandra to scan all partitions. This operation:
- Contacts every node in the cluster
- Does not scale with cluster size
- Can cause timeouts on large tables
- Should never be used in production application queries
Consistency Levels¶
Consistency levels control how many replicas must respond before a query succeeds.
Write Consistency Levels¶
| Level | Replicas Required | Use Case |
|---|---|---|
ANY |
1 (including hints) | Maximum availability, risk of data loss |
ONE |
1 | Low latency, eventual consistency |
TWO |
2 | Slightly stronger consistency |
THREE |
3 | Stronger consistency |
QUORUM |
(RF/2) + 1 | Balance of consistency and availability |
LOCAL_QUORUM |
Quorum in local DC | Multi-DC with local consistency |
EACH_QUORUM |
Quorum in each DC | Strong multi-DC consistency |
ALL |
All replicas | Maximum consistency, lowest availability |
Read Consistency Levels¶
| Level | Replicas Contacted | Use Case |
|---|---|---|
ONE |
1 | Low latency reads |
TWO |
2 | Slightly stronger consistency |
THREE |
3 | Stronger consistency |
QUORUM |
(RF/2) + 1 | Strong consistency |
LOCAL_QUORUM |
Quorum in local DC | Multi-DC with local reads |
EACH_QUORUM |
Quorum in each DC | Not typically used for reads |
ALL |
All replicas | Maximum consistency |
LOCAL_ONE |
1 in local DC | Low latency local reads |
SERIAL |
Paxos quorum | Linearizable reads (LWT) |
LOCAL_SERIAL |
Local Paxos quorum | Local linearizable reads |
Strong Consistency Formula¶
To achieve strong consistency (read-your-writes):
READ_CL + WRITE_CL > REPLICATION_FACTOR
Common patterns:
| Write CL | Read CL | RF=3 | Guarantee |
|---|---|---|---|
| QUORUM | QUORUM | ✓ | 2 + 2 > 3 |
| ONE | ALL | ✓ | 1 + 3 > 3 |
| ALL | ONE | ✓ | 3 + 1 > 3 |
| ONE | ONE | ✗ | 1 + 1 ≤ 3 |
Tombstones and Deletion¶
Cassandra does not immediately delete data. Instead, it writes tombstones—markers indicating data should be considered deleted.
Why Tombstones Exist¶
In a distributed system without central coordination:
- No global delete propagation: DELETE cannot contact offline nodes
- Prevents resurrection: Tombstones override older data when replicas sync
- Eventually consistent: All replicas eventually see the deletion
Tombstone Types¶
| Type | Created By | Scope |
|---|---|---|
| Cell tombstone | DELETE column FROM table |
Single column value |
| Row tombstone | DELETE FROM table WHERE pk = x |
Entire row |
| Range tombstone | DELETE FROM table WHERE pk = x AND ck > y |
Range of clustering keys |
| Partition tombstone | DELETE FROM table WHERE pk = x (all rows) |
Entire partition |
| TTL tombstone | Automatic when TTL expires | Column or row |
gc_grace_seconds¶
The gc_grace_seconds table property (default: 864000 = 10 days) determines how long tombstones persist before compaction can remove them:
CREATE TABLE events (
...
) WITH gc_grace_seconds = 86400; -- 1 day
Reducing gc_grace_seconds
Before reducing gc_grace_seconds:
- Ensure repairs run more frequently than the new value
- Verify no nodes stay down longer than the new value
- Understand that nodes down longer than gc_grace_seconds may resurrect deleted data
DML Statement Categories¶
| Statement | Purpose | Documentation |
|---|---|---|
| SELECT | Retrieve rows and columns | Query syntax, filtering, paging |
| INSERT | Add or replace rows | Upsert semantics, TTL, JSON |
| UPDATE | Modify column values | Collection operations, counters |
| DELETE | Remove rows or columns | Tombstones, range deletes |
| BATCH | Atomic multi-statement operations | Logged vs unlogged, anti-patterns |
| Secondary Index Queries | Query non-primary-key columns | Index types, performance, anti-patterns |
| Vector Search | Similarity search on embeddings | ANN queries, similarity functions |
| Lightweight Transactions | Compare-and-set operations | Paxos, serial consistency |
Best Practices¶
Write Optimization¶
Write Best Practices
- Batch by partition: Group writes to the same partition for efficiency
- Avoid large batches: Keep batches under 5KB (warn threshold)
- Use UNSET for nulls: Avoid creating tombstones with null values
- Prepare statements: Reuse prepared statements for repeated queries
- Appropriate TTL: Set TTL at write time, not retroactively
Read Optimization¶
Read Best Practices
- Include partition key: Every query should specify partition key
- Avoid ALLOW FILTERING: Design data model for query patterns
- Use token-aware drivers: Route queries directly to replicas
- Limit result sets: Use LIMIT and paging for large results
- Consider LOCAL_* consistency: For multi-DC deployments
Delete Considerations¶
Delete Best Practices
- Minimize deletes: Design data model to avoid frequent deletes
- Use TTL instead: Let data expire naturally when possible
- Range deletes: More efficient than many row deletes
- Monitor tombstones: Use
nodetool tablestatsto track tombstone counts - Regular repairs: Ensure tombstones propagate before gc_grace expires
Related Documentation¶
- DDL Commands - Schema management
- Data Types - CQL type reference
- Functions - Built-in and user-defined functions
- Data Modeling - Query-driven design patterns