Retry Policy¶
The retry policy determines whether to retry a failed request and on which node. This policy is critical for application reliability but must be configured carefully to avoid unintended consequences.
Error Classification¶
Not all errors are retryable. The retry policy classifies errors and decides appropriate action:
| Error Type | Typical Cause | Retryable? |
|---|---|---|
| ReadTimeoutException | Replica(s) did not respond in time | Depends on received/required |
| WriteTimeoutException | Write coordinator timeout | Usually no (non-idempotent) |
| UnavailableException | Not enough replicas alive | Retry on different node |
| OverloadedException | Coordinator is overloaded | Retry on different node |
| ServerError | Unexpected server error | Usually no |
| QueryValidationException | Syntax or schema error | No (not transient) |
Read Timeout Details¶
A read timeout includes information about how many replicas responded:
ReadTimeoutException Analysis:
Required replicas (based on CL): 2
Received responses: 1
Data received: false
Interpretation:
- Only 1 of 2 required replicas responded
- No replica sent actual data (only digest)
Retry decision:
- If 0 received: replica might be overloaded, retry elsewhere
- If received >= required but no data: coordinator issue, retry
- If received < required: may or may not help to retry
Write Timeout Details¶
Write timeouts are more complex because writes may have partially succeeded:
WriteTimeoutException Analysis:
Write type: SIMPLE
Required acknowledgments: 2
Received acknowledgments: 1
Interpretation:
- Write reached coordinator
- At least 1 replica acknowledged
- Possibly 2+ replicas received it (timeout ≠ failure)
Danger:
- Retrying may cause duplicate write
- For non-idempotent operations (counters), this corrupts data
Idempotency Consideration¶
The most important factor in retry decisions is whether the operation is idempotent:
| Operation Type | Idempotent? | Safe to Retry? |
|---|---|---|
| SELECT | Yes | Yes |
| INSERT (no LWT) | Yes* | Yes* |
| UPDATE SET column = value | Yes | Yes |
| UPDATE SET counter = counter + 1 | No | No |
| DELETE | Yes | Yes |
| INSERT with TTL | Depends | Depends on timing sensitivity |
*INSERT is idempotent if the same row/values are written. If using auto-generated timestamps, retries produce identical results.
Idempotent vs Non-Idempotent:
Idempotent (safe to retry):
UPDATE users SET name = 'Alice' WHERE id = 123
First attempt: Sets name to 'Alice'
Retry: Sets name to 'Alice' (same result)
Non-Idempotent (dangerous to retry):
UPDATE stats SET views = views + 1 WHERE page_id = 456
First attempt: Increments views (maybe succeeded, maybe not)
Retry: Increments again (double-counted if first succeeded)
Retry Policy Decisions¶
For each failed request, the retry policy returns one of:
| Decision | Meaning |
|---|---|
| RETRY_SAME | Retry on the same node |
| RETRY_NEXT | Retry on the next node in query plan |
| IGNORE | Return empty/default result to application |
| RETHROW | Propagate error to application |
Decision Flow¶
Common Retry Policies¶
Default Retry Policy¶
Most drivers include a default policy that:
- Retries read timeouts if enough replicas responded
- Does not retry write timeouts (assumes non-idempotent)
- Retries unavailable exceptions on next node
- Limited retry attempts (typically 1)
This is a conservative policy suitable for mixed workloads.
Fallthrough (No Retry)¶
Never retries any error—always propagates to application:
// Java driver
.withRetryPolicy(FallthroughRetryPolicy.INSTANCE)
Use when:
- Application handles retries itself
- All operations are non-idempotent
- Debugging (to see all errors)
Aggressive Retry¶
Retries most errors multiple times:
Warning: Aggressive retry policies can cause cascading failures by amplifying load on an already struggling cluster.
Cascading Failure Scenario:
1. Node3 becomes slow (GC, disk issue)
2. Requests to Node3 timeout
3. Aggressive retry policy retries each request 3×
4. Node3 now receives 3× the requests
5. Node3 becomes slower
6. Timeouts increase, more retries triggered
7. Node3 overwhelmed, marks as DOWN
8. Load shifts to Node1, Node2
9. If they were near capacity, they may also degrade
Custom Retry Policies¶
For fine-grained control, implement a custom retry policy:
// Java driver - custom retry policy
public class IdempotentOnlyRetryPolicy implements RetryPolicy {
@Override
public RetryDecision onReadTimeout(
Statement statement,
ConsistencyLevel cl,
int required,
int received,
boolean dataRetrieved,
int retryCount) {
if (retryCount > 0) {
return RetryDecision.rethrow();
}
// Retry reads on different node
return RetryDecision.tryNextHost(cl);
}
@Override
public RetryDecision onWriteTimeout(
Statement statement,
ConsistencyLevel cl,
WriteType writeType,
int required,
int received,
int retryCount) {
// Only retry if statement is explicitly marked idempotent
if (statement.isIdempotent() && retryCount == 0) {
return RetryDecision.tryNextHost(cl);
}
return RetryDecision.rethrow();
}
@Override
public RetryDecision onUnavailable(
Statement statement,
ConsistencyLevel cl,
int required,
int alive,
int retryCount) {
// Retry unavailable on next node (maybe different replica set)
if (retryCount == 0) {
return RetryDecision.tryNextHost(cl);
}
return RetryDecision.rethrow();
}
}
Marking Statements Idempotent¶
// Mark specific statement as idempotent
Statement statement = SimpleStatement.builder("UPDATE users SET name = ? WHERE id = ?")
.addPositionalValues("Alice", userId)
.setIdempotent(true) // Safe to retry
.build();
Per-Statement Policy Override¶
Override retry policy for specific queries:
// No retries for counter updates
Statement counterUpdate = SimpleStatement.builder(
"UPDATE page_stats SET views = views + 1 WHERE page_id = ?")
.addPositionalValue(pageId)
.setRetryPolicy(FallthroughRetryPolicy.INSTANCE) // Never retry
.build();
// Allow retries for idempotent read
Statement userQuery = SimpleStatement.builder(
"SELECT * FROM users WHERE id = ?")
.addPositionalValue(userId)
.setRetryPolicy(DefaultRetryPolicy.INSTANCE) // Standard retry
.build();
Retry Metrics¶
Monitor retry behavior in production:
| Metric | Description | Warning Sign |
|---|---|---|
| Retry rate | Retries per second | Sustained high rate indicates cluster issues |
| Retry success rate | Percentage of retries that succeed | Low success rate means retries are wasteful |
| Retry exhaustion | Requests that failed after all retries | Any occurrence needs investigation |
Best Practices¶
| Practice | Rationale |
|---|---|
| Default to conservative | Better to fail fast than corrupt data |
| Mark idempotent operations explicitly | Enables safe retry for those operations |
| Monitor retry rates | High retry rates indicate underlying issues |
| Don't rely on retries for availability | Fix the root cause instead |
| Consider circuit breakers | Prevent retry storms during outages |
Related Documentation¶
- Load Balancing Policy — Determines which node receives retry
- Speculative Execution — Alternative to retry for latency reduction