Failure Handling Policies¶

Cassandra drivers implement sophisticated failure handling to maintain availability despite node failures, network issues, and transient errors. This includes retry policies, speculative execution, and idempotency awareness.

Driver Intelligence: RDBMS vs Cassandra¶

Traditional RDBMS Error Handling¶

Traditional database drivers (JDBC, ODBC, database-specific libraries) provide minimal failure handling intelligence. Error recovery is almost entirely the application's responsibility:

What RDBMS Drivers Typically Provide:

Feature	RDBMS Driver	Application Must Handle
Connection pooling	Basic pool	Validation, sizing, recovery
Error reporting	Raw exceptions	Classification, retry decisions
Failover	None	Manual primary/replica switching
Retry logic	None	Exponential backoff, limits
Timeout handling	Basic	Appropriate values, recovery
Load balancing	None (or round-robin)	Intelligent routing
Health monitoring	None	Heartbeats, connection testing

Common Application-Level Retry Pattern (RDBMS):

# Application must implement retry logic
def execute_with_retry(connection_pool, query, max_retries=3):
    last_exception = None
    for attempt in range(max_retries):
        try:
            conn = connection_pool.get_connection()
            # Must validate connection isn't stale
            if not validate_connection(conn):
                conn = create_new_connection()

            result = conn.execute(query)
            return result

        except ConnectionError as e:
            last_exception = e
            # Application decides what to do
            connection_pool.invalidate(conn)
            time.sleep(2 ** attempt)  # Manual backoff

        except DatabaseError as e:
            if is_transient_error(e):  # Application classifies
                last_exception = e
                time.sleep(2 ** attempt)
            else:
                raise  # Non-recoverable

    raise last_exception

Cassandra Driver Intelligence¶

Cassandra drivers embed sophisticated failure handling that would require thousands of lines of custom application code with traditional databases:

Built-in Cassandra Driver Capabilities:

Feature	Cassandra Driver Provides
Retry Policies	Configurable policies with error-type awareness
Speculative Execution	Parallel requests for tail latency reduction
Load Balancing	Token-aware, DC-aware, latency-aware routing
Connection Management	Per-node pools with automatic scaling
Health Monitoring	Continuous heartbeats, state tracking
Reconnection	Exponential backoff with configurable limits
Circuit Breakers	Node-level failure isolation
Idempotency Awareness	Safe retry decisions based on operation type
Topology Awareness	Automatic discovery, rack/DC awareness
Metadata Sync	Schema and token ring synchronization

Comparison Summary¶

Aspect	RDBMS	Cassandra
Error handling code	500-2000 lines	Configuration only
Failover implementation	Manual/custom	Automatic
Retry logic	Application responsibility	Driver policy
Node health tracking	External monitoring	Built-in
Load balancing	External (HAProxy, etc.)	Built-in
Connection recovery	Manual validation	Automatic reconnection
Timeout handling	Per-query code	Policy-based
Speculative execution	Not available	Built-in option

Driver Configuration Over Custom Code

With Cassandra drivers, failure handling is configured rather than coded. Instead of implementing retry loops, connection validation, and failover logic, applications configure policies that the driver executes automatically. This reduces application complexity and ensures consistent, tested behavior.

Failure Handling Overview¶

Failure Types¶

Failure Type	Scope	Recovery Strategy
Connection failure	Single connection	Reconnect, try other node
Request timeout	Single request	Retry based on policy
Node down	Single node	Route to other nodes
Coordinator error	Request processing	Retry on same/different node
Consistency failure	Cluster-wide	May not be recoverable

Handling Architecture¶

Retry Policies¶

Policy Interface¶

# Conceptual retry policy interface
class RetryPolicy:
    def on_read_timeout(self, statement, consistency, required, received,
                        data_retrieved, retry_number):
        """Called when a read times out"""
        return RetryDecision.RETHROW  # or RETRY_SAME, RETRY_NEXT, IGNORE

    def on_write_timeout(self, statement, consistency, write_type,
                         required, received, retry_number):
        """Called when a write times out"""
        return RetryDecision.RETHROW

    def on_unavailable(self, statement, consistency,
                       required, alive, retry_number):
        """Called when not enough replicas available"""
        return RetryDecision.RETHROW

    def on_request_error(self, statement, consistency, exception, retry_number):
        """Called on other request errors"""
        return RetryDecision.RETHROW

Retry Decisions¶

Decision	Behavior
RETHROW	Propagate error to application
RETRY_SAME_HOST	Retry on same coordinator
RETRY_NEXT_HOST	Retry on next node in query plan
IGNORE	Return empty result (for reads)

Built-in Policies¶

Default Retry Policy¶

Conservative policy that retries only when safe:

Error	Retry?	Rationale
Read timeout (data received)	Yes, same host	Coordinator has data
Read timeout (no data)	No	Might not be available
Write timeout (BATCH_LOG)	Yes, same host	Safe to retry
Write timeout (other)	No	Risk of duplicate writes
Unavailable	No	Won't succeed

Fallthrough Policy¶

Never retry (application handles everything):

class FallthroughRetryPolicy:
    def on_read_timeout(self, *args):
        return RetryDecision.RETHROW

    def on_write_timeout(self, *args):
        return RetryDecision.RETHROW

    def on_unavailable(self, *args):
        return RetryDecision.RETHROW

Downgrading Consistency Policy¶

Retries at lower consistency when needed:

class DowngradingConsistencyPolicy:
    def on_unavailable(self, statement, consistency, required, alive, retry_num):
        if retry_num > 0:
            return RetryDecision.RETHROW

        # Downgrade to match available replicas
        if alive >= 1:
            if consistency in [QUORUM, LOCAL_QUORUM]:
                return RetryDecision.RETRY_SAME_HOST_WITH_CONSISTENCY(ONE)

        return RetryDecision.RETHROW

Consistency Violation Risk

Downgrading consistency can result in stale reads or lost writes. This policy should only be used when availability is prioritized over consistency, and the application can tolerate eventual consistency.

Error Categories¶

Read Timeout¶

Coordinator didn't receive enough responses in time:

ReadTimeoutException:
  consistency: QUORUM
  required: 2
  received: 1
  data_retrieved: false

Interpretation:
  - 2 responses needed for QUORUM
  - Only 1 replica responded
  - No data was retrieved

Retry strategy: - If data_retrieved=true: Safe to retry same host - If data_retrieved=false: Retry may not help

Write Timeout¶

Coordinator didn't receive enough acknowledgments:

WriteTimeoutException:
  consistency: QUORUM
  required: 2
  received: 1
  write_type: SIMPLE

Write types:
  SIMPLE - Single partition write
  BATCH - Atomic batch
  BATCH_LOG - Batch log write
  UNLOGGED_BATCH - Non-atomic batch
  COUNTER - Counter update
  CAS - Compare-and-set (LWT)

Retry strategy by write_type:

Write Type	Safe to Retry?	Reason
BATCH_LOG	Yes	Batch logged, will complete
SIMPLE	Maybe*	Depends on idempotency
BATCH	Maybe*	Depends on idempotency
UNLOGGED_BATCH	Maybe*	Some writes may have succeeded
COUNTER	No	Non-idempotent
CAS	No	May have succeeded

*Idempotent operations only

Unavailable Exception¶

Not enough replicas alive to satisfy consistency:

UnavailableException:
  consistency: QUORUM
  required: 2
  alive: 1

Interpretation:
  - Cluster knows only 1 replica is up
  - Can't attempt the operation
  - Request never sent to replicas

Retry strategy: - Retry won't help unless topology changes - Consider downgrading consistency - May indicate larger cluster issue

Request Error¶

Connection or protocol level failures:

Error	Retry Appropriate?
Connection closed	Yes, different node
Protocol error	No (bug)
Server error	Maybe, different node
Overloaded	Yes, with backoff

Speculative Execution¶

Concept¶

Send redundant requests to reduce tail latency:

Benefits¶

Speculative execution helps when:

Occasional slow nodes
Network hiccups
GC pauses on nodes
Uneven load distribution

Non-Idempotent Operations

Speculative execution should only be used for idempotent operations. Non-idempotent writes may be executed multiple times, causing data inconsistency.

Configuration¶

# Conceptual speculative execution policy
class SpeculativeExecutionPolicy:
    def new_plan(self, keyspace, statement):
        """Return when to start speculative requests"""
        return SpeculativePlan(
            delay_ms=50,      # Wait 50ms before speculating
            max_speculative=2  # At most 2 speculative requests
        )

Speculative Policies¶

Constant Delay:

Start speculative after fixed delay
Example: 50ms, 100ms thresholds

Percentile-Based:

Start speculative at p99 latency
Adapts to observed performance
Requires latency tracking

No Speculation:

Never send speculative requests
Simplest, safest option

Interaction with Retries¶

Speculative execution and retries are different:

Aspect	Retry	Speculative
Trigger	Error received	Timeout threshold
Original request	Abandoned	Still pending
Goal	Error recovery	Latency reduction
Request count	Serial	Parallel

Idempotency¶

Why Idempotency Matters¶

Non-idempotent operations may cause problems when retried:

Non-idempotent:
  counter += 1    → Retry doubles increment
  INSERT IF NOT EXISTS → Retry may fail unexpectedly

Idempotent:
  SET value = 5   → Retry is safe
  DELETE WHERE... → Retry is safe

Design for Idempotency

Design write operations to be idempotent whenever possible. Use absolute values (SET x = 5) rather than increments (SET x = x + 1) to enable safe retries.

Driver Idempotency Tracking¶

Drivers can track idempotency:

# Mark statement as idempotent
statement = SimpleStatement(
    "UPDATE users SET name = ? WHERE id = ?",
    is_idempotent=True
)

# Prepared statements can have default
prepared = session.prepare("UPDATE users SET name = ? WHERE id = ?")
prepared.is_idempotent = True

Idempotency-Aware Retry¶

class IdempotentAwareRetryPolicy:
    def on_write_timeout(self, statement, consistency, write_type,
                         required, received, retry_number):
        if statement.is_idempotent:
            return RetryDecision.RETRY_NEXT_HOST
        else:
            return RetryDecision.RETHROW

Making Operations Idempotent¶

Operation	Idempotent?	Make Idempotent
INSERT	Yes*	Use fixed values
UPDATE SET x = 5	Yes	N/A
UPDATE SET x = x + 1	No	Use LWT or external tracking
DELETE	Yes	N/A
Counter update	No	Can't easily
LWT (IF...)	No	Application must handle

*INSERT with same PK is idempotent (upsert behavior)

Circuit Breakers¶

Node-Level Circuit Breaker¶

Prevent overwhelming failing nodes:

Implementation¶

class NodeCircuitBreaker:
    def __init__(self, failure_threshold=5, open_duration=30):
        self.failures = 0
        self.state = "CLOSED"
        self.last_failure = None
        self.failure_threshold = failure_threshold
        self.open_duration = open_duration

    def record_success(self):
        self.failures = 0
        self.state = "CLOSED"

    def record_failure(self):
        self.failures += 1
        self.last_failure = time.now()
        if self.failures >= self.failure_threshold:
            self.state = "OPEN"

    def should_try(self):
        if self.state == "CLOSED":
            return True
        if self.state == "OPEN":
            if time.now() - self.last_failure > self.open_duration:
                self.state = "HALF_OPEN"
                return True
            return False
        return True  # HALF_OPEN - allow test

Handling Specific Failures¶

Connection Failures¶

Connection reset by peer:
  1. Mark connection dead
  2. Remove from pool
  3. Retry on different connection
  4. Schedule reconnection

Coordinator Failures¶

Coordinator crashes mid-request:
  1. Connection closes
  2. Request fails with error
  3. Retry on different node
  4. Idempotent operations safe

Partial Failures¶

Some replicas succeed, others fail:

Write to 3 replicas, 2 succeed:
  - Client may see WriteTimeoutException
  - But 2 copies exist
  - Read at QUORUM will succeed
  - Retry may create 4th copy (okay for idempotent)

Lightweight Transaction Failures¶

LWT requires special handling:

CAS operation timeout:
  - May have succeeded
  - May have failed
  - Retry may see "already exists"
  - Application must handle all cases

Timeout Configuration¶

Timeout Hierarchy¶

Connection timeout: Time to establish TCP connection
Request timeout: Total time for request completion
Read timeout: Time waiting for coordinator response

Configuration¶

# Typical timeout configuration
cluster = Cluster(
    connect_timeout=5,      # Connection establishment
    request_timeout=12      # Overall request timeout
)

# Per-statement timeout
statement = SimpleStatement(
    "SELECT * FROM large_table",
    timeout=60  # Override for slow query
)

Timeout Selection¶

Workload	Request Timeout	Rationale
OLTP	1-5s	Fast failure, retry elsewhere
Analytics	60-300s	Long-running queries
Batch load	30-60s	Large writes

Monitoring Failure Handling¶

Key Metrics¶

Metric	Healthy Range	Alert Threshold
Retry rate	< 1%	> 5%
Speculative execution rate	< 5%	> 20%
Circuit breaker opens	0	Any
Timeout rate	< 0.1%	> 1%

Diagnostic Information¶

# Track failure handling stats
class FailureMetrics:
    def __init__(self):
        self.retries = Counter()
        self.speculative = Counter()
        self.circuit_opens = Counter()
        self.errors_by_type = Counter()

    def on_retry(self, error_type, node):
        self.retries.inc(error_type=error_type, node=node)

    def on_speculative(self, node):
        self.speculative.inc(node=node)

Best Practices¶

Retry Policy Selection¶

Use Case	Recommended Policy
General production	Default policy
Strict consistency	Fallthrough (handle in app)
High availability	Downgrading (with caution)
Idempotent workload	Aggressive retry

Error Handling in Application¶

from cassandra import WriteTimeoutException, UnavailableException

try:
    session.execute(statement)
except WriteTimeoutException as e:
    if statement.is_idempotent:
        logger.warning(f"Write timeout, may have succeeded: {e}")
        # Retry or verify
    else:
        logger.error(f"Write timeout, state unknown: {e}")
        # Manual intervention may be needed
except UnavailableException as e:
    logger.error(f"Cluster unhealthy: {e}")
    # Alert operations team

Defensive Programming¶

Mark idempotency explicitly - Don't rely on inference
Set appropriate timeouts - Not too short, not too long
Monitor failure rates - Catch issues early
Test failure scenarios - Chaos engineering
Document retry behavior - Operations team awareness

Load Balancing - Node selection for retries
Async Connections - Connection management
CQL Protocol - Error codes and handling

Failure Handling Policies¶

Driver Intelligence: RDBMS vs Cassandra¶

Traditional RDBMS Error Handling¶

Cassandra Driver Intelligence¶

Comparison Summary¶

Failure Handling Overview¶

Failure Types¶

Handling Architecture¶

Retry Policies¶

Policy Interface¶

Retry Decisions¶

Built-in Policies¶

Default Retry Policy¶

Fallthrough Policy¶

Downgrading Consistency Policy¶

Error Categories¶

Read Timeout¶

Write Timeout¶

Unavailable Exception¶

Request Error¶

Speculative Execution¶

Concept¶

Benefits¶

Configuration¶

Speculative Policies¶

Interaction with Retries¶

Idempotency¶

Why Idempotency Matters¶

Driver Idempotency Tracking¶

Idempotency-Aware Retry¶

Making Operations Idempotent¶

Circuit Breakers¶

Node-Level Circuit Breaker¶

Implementation¶

Handling Specific Failures¶

Connection Failures¶

Coordinator Failures¶

Partial Failures¶

Lightweight Transaction Failures¶

Timeout Configuration¶

Timeout Hierarchy¶

Configuration¶

Timeout Selection¶

Monitoring Failure Handling¶

Key Metrics¶

Diagnostic Information¶

Best Practices¶

Retry Policy Selection¶

Error Handling in Application¶

Defensive Programming¶

Related Documentation¶