Prepared Statements Architecture¶
Prepared statements optimize repeated query execution by separating query parsing from execution. The server parses and validates the query once, returning an identifier that clients use for subsequent executions with different parameter values.
Preparation Model¶
Why Prepare Statements?¶
Without preparation, every query execution requires:
- Parse CQL syntax
- Validate against schema
- Plan execution
- Execute query
With prepared statements:
- Prepare (once): Parse, validate, plan → return ID
- Execute (many): Bind values, execute using ID
Benefits¶
| Benefit | Impact |
|---|---|
| Reduced parsing | Lower server CPU |
| Smaller messages | Less network bandwidth |
| Token-aware routing | Better coordinator selection |
| Type safety | Compile-time validation in drivers |
| Security | No CQL injection possible |
Always Use Prepared Statements
Prepared statements should be used for all production queries. They provide security against injection attacks, enable token-aware routing, and reduce message sizes.
Protocol Flow¶
Preparation Phase¶
Execution Phase¶
Statement Identification¶
Statement ID¶
The prepared statement ID is a hash that uniquely identifies:
- Query string
- Keyspace context
- Protocol-relevant settings
ID Computation:
ID = MD5(query_string + keyspace + settings)
↓
16-byte identifier
Result Metadata ID (Protocol v5+)¶
Protocol v5 adds a result metadata ID for optimization:
Result Metadata ID = MD5(column_specifications)
Purpose:
- Client caches result metadata
- Server can skip sending metadata if unchanged
- Reduces response size for repeated queries
Server-Side Architecture¶
Prepared Statement Cache¶
Cassandra maintains a cache of prepared statements:
# cassandra.yaml
prepared_statements_cache_size_mb: 100 # Maximum cache size
Cache characteristics: - Per-node cache (not distributed) - LRU eviction when full - Survives connection close - Lost on node restart
Cache Structure¶
Cache Entry:
Key: Statement ID (16 bytes)
Value:
- Parsed query tree
- Bound variable metadata
- Result metadata
- Partition key indices
- Keyspace reference
Eviction Behavior¶
When the cache is full:
- LRU statement selected for eviction
- Statement removed from cache
- Future executions receive UNPREPARED error
- Client must re-prepare
Client-Side Architecture¶
Driver Cache¶
Drivers maintain their own prepared statement cache:
# Conceptual driver cache
class PreparedStatementCache:
def __init__(self):
self.by_query = {} # query_string → PreparedStatement
self.by_id = {} # statement_id → PreparedStatement
def get_or_prepare(self, session, query):
if query in self.by_query:
return self.by_query[query]
# Prepare on server
prepared = session.prepare(query)
self.by_query[query] = prepared
self.by_id[prepared.id] = prepared
return prepared
Metadata Caching¶
Prepared statements include metadata that drivers cache:
PreparedStatement:
- Statement ID
- Query string
- Bound variables:
- Name
- Type
- Position
- Result columns:
- Keyspace
- Table
- Name
- Type
- Partition key indices (for routing)
Handling Schema Changes¶
The Reprepare Problem¶
When schema changes, prepared statements may become invalid:
Automatic Reprepare¶
Drivers handle UNPREPARED errors automatically:
Schema Change Events¶
Drivers listen for schema changes to proactively update:
SCHEMA_CHANGE event received:
1. Check affected keyspace/table
2. Invalidate relevant prepared statements
3. Reprepare on next use (lazy) or immediately (eager)
Token-Aware Routing¶
Partition Key Detection¶
Prepared statements enable precise token-aware routing:
SELECT * FROM users WHERE id = ? AND name = ?
↑ Partition key
Preparation response includes:
pk_indices: [0] // First bound variable is partition key
Execution:
1. Driver extracts value at index 0
2. Computes partition token
3. Routes to owning replica
Composite Partition Keys¶
For composite keys:
CREATE TABLE events (
year INT,
month INT,
day INT,
event_id UUID,
PRIMARY KEY ((year, month), day, event_id)
);
SELECT * FROM events WHERE year = ? AND month = ? AND day = ?
↑ ↑ Partition key components
pk_indices: [0, 1] // First two bound variables
Routing Optimization¶
With partition key information:
Without preparation:
Driver parses query → May not determine routing
Falls back to round-robin selection
With preparation:
Driver knows pk_indices
Directly calculates token from bound values
Routes to replica node
Batch Statements¶
Batches with Prepared Statements¶
Batches can mix prepared and unprepared statements:
BATCH {
type: LOGGED
statements: [
{kind: 1, id: <prepared_id>, values: [...]}, // Prepared
{kind: 0, query: "INSERT ...", values: [...]}, // Unprepared
{kind: 1, id: <prepared_id>, values: [...]} // Prepared
]
}
Batch Routing¶
Batches should target a single partition:
Single-partition batch (efficient):
All statements affect same partition
Atomic execution guaranteed
Token-aware routing possible
Multi-partition batch (inefficient):
Coordinator logs batch
Contacts multiple replicas
Higher latency and overhead
Performance Characteristics¶
Preparation Overhead¶
| Operation | Typical Time |
|---|---|
| Parse simple query | 50-200 μs |
| Parse complex query | 200-1000 μs |
| Schema validation | 10-50 μs |
| Plan generation | 10-100 μs |
| Total preparation | 100-1500 μs |
Execution Efficiency¶
| Metric | Unprepared | Prepared |
|---|---|---|
| Server CPU | Higher | Lower |
| Request size | Larger | Smaller |
| Response handling | No metadata caching | Metadata cached |
| Routing | May be suboptimal | Token-aware |
Message Size Comparison¶
Unprepared QUERY:
Query string: "SELECT * FROM users WHERE id = 12345" (38 bytes)
Total: ~50 bytes
Prepared EXECUTE:
Statement ID: 16 bytes
Value count: 2 bytes
Value: 4 bytes (int)
Total: ~22 bytes
Savings: 56% smaller
Best Practices¶
When to Use Prepared Statements¶
Always use for: - Repeated queries (even just twice) - Queries with parameters - Performance-critical paths - Production code
May skip for: - One-time administrative queries - Dynamic schema exploration - Quick debugging
Preparation Strategies¶
Eager preparation:
# Prepare at startup
def initialize():
statements = {
'get_user': session.prepare("SELECT * FROM users WHERE id = ?"),
'insert_user': session.prepare("INSERT INTO users ..."),
}
return statements
Lazy preparation:
# Prepare on first use
@lru_cache
def get_prepared(query):
return session.prepare(query)
# First call prepares, subsequent calls use cache
result = session.execute(get_prepared("SELECT ..."), values)
Avoid Over-Preparation¶
Don't prepare every unique query:
# Bad: Prepares a new statement for each ID
for user_id in user_ids:
stmt = session.prepare(f"SELECT * FROM users WHERE id = {user_id}")
session.execute(stmt)
# Good: Prepare once, execute many
stmt = session.prepare("SELECT * FROM users WHERE id = ?")
for user_id in user_ids:
session.execute(stmt, [user_id])
Troubleshooting¶
Common Issues¶
UNPREPARED errors:
- Statement evicted from server cache
- Schema changed
- Connected to new node
- Solution: Driver should auto-reprepare
Automatic Reprepare
All modern drivers handle UNPREPARED errors transparently by re-preparing the statement and retrying. Applications typically do not need special handling for this scenario.
Statement not found on all nodes: - Prepared on one node only - Solution: Prepare on each connection
Cache exhaustion: - Too many unique queries - Solution: Increase cache size or reduce unique queries
Monitoring¶
| Metric | Healthy Range |
|---|---|
| Preparation rate | Low, stable |
| Reprepare rate | Near zero |
| Cache size | Below limit |
| Cache hit rate | >99% |
Related Documentation¶
- CQL Protocol - PREPARE and EXECUTE opcodes
- Load Balancing - Token-aware routing
- Failure Handling - Handling preparation failures