Reconnection Policy¶
The reconnection policy controls how the driver attempts to reconnect to nodes after connection failures. This policy affects recovery time after node failures and network partitions.
When Reconnection Occurs¶
The driver initiates reconnection when:
| Trigger | Description |
|---|---|
| Node marked DOWN | All connections to node failed |
| Connection pool depleted | All connections closed due to errors |
| STATUS_CHANGE event | Cassandra reports node status change |
| Startup discovery | Initial connection to contact points |
Reconnection Strategies¶
Constant Delay¶
Attempts reconnection at fixed intervals:
Constant Reconnection (delay = 5s):
Time: 0s 5s 10s 15s 20s
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Attempt Attempt Attempt Attempt Attempt
1 2 3 4 5
| Advantage | Disadvantage |
|---|---|
| Predictable timing | May be too slow for short outages |
| Simple to understand | May be too aggressive for long outages |
Configuration:
// Java driver
ConstantReconnectionPolicy policy =
ConstantReconnectionPolicy.builder()
.withDelay(Duration.ofSeconds(5))
.build();
Exponential Backoff (Recommended)¶
Starts with short delay, increases exponentially up to a maximum:
Exponential Reconnection (base = 1s, max = 60s):
Attempt: 1 2 3 4 5 6 7+
Delay: 1s 2s 4s 8s 16s 32s 60s
│ │ │ │ │ │ │
Timeline: 0s 1s 3s 7s 15s 31s 63s 123s
| Advantage | Disadvantage |
|---|---|
| Fast recovery for transient issues | Slower recovery for sustained outages |
| Reduces load during prolonged outages | More complex to reason about |
| Adapts to outage duration |
Configuration:
// Java driver
ExponentialReconnectionPolicy policy =
ExponentialReconnectionPolicy.builder()
.withBaseDelay(Duration.ofSeconds(1))
.withMaxDelay(Duration.ofMinutes(5))
.build();
# Python driver
from cassandra.policies import ExponentialReconnectionPolicy
policy = ExponentialReconnectionPolicy(base_delay=1.0, max_delay=300.0)
Reconnection During Scenarios¶
Node Restart¶
Single Node Restart:
Time Node State Driver State Reconnection
────────────────────────────────────────────────────────────────
0s Running UP, connected -
5s Stopping Connection fails Marks DOWN
6s Stopped - Attempt 1 (fail)
7s Stopped - Wait (1s)
8s Stopped - Attempt 2 (fail)
10s Starting - Wait (2s)
12s Starting - Attempt 3 (fail)
15s Ready - Wait (4s)
19s Ready Attempt 4 succeeds Marks UP
20s Ready Connections pooled Available
Rolling Restart¶
During rolling restarts, multiple nodes cycle through DOWN/UP states:
Rolling Restart (3 nodes, exponential backoff):
Time Node1 Node2 Node3 Notes
─────────────────────────────────────────────────────────
0s UP UP UP Normal operation
5s restarting UP UP Node1 DOWN, reconnecting
25s UP UP UP Node1 recovered
30s UP restarting UP Node2 DOWN, reconnecting
50s UP UP UP Node2 recovered
55s UP UP restarting Node3 DOWN, reconnecting
75s UP UP UP All nodes recovered
Exponential backoff resets when a node comes back up, so each node recovers independently.
Network Partition¶
Network Partition (application isolated from DC):
Time Network State Driver State Impact
────────────────────────────────────────────────────────────
0s Connected All nodes UP Normal
5s Partition starts Connections fail All nodes → DOWN
6s Partitioned Reconnect attempts All fail
... Partitioned Backoff increasing No requests possible
60s Partition heals Attempts succeed Nodes → UP
61s Connected Connections pool Normal operation
Recovery time depends on:
- Backoff schedule at partition heal time
- Number of nodes (parallel reconnection)
Connection vs Node Reconnection¶
Drivers distinguish between:
| Level | Trigger | Policy Used |
|---|---|---|
| Connection reconnection | Single connection fails | Usually immediate retry, then pool management |
| Node reconnection | All connections to node fail | Reconnection policy |
Connection Failure Handling:
Single connection fails:
│
├─► Other connections in pool still work
│ └─► Node stays UP
│ └─► Driver opens new connection (pool management)
│
└─► All connections fail
└─► Node marked DOWN
└─► Reconnection policy activated
Reconnection and Load Balancing Interaction¶
While a node is DOWN and reconnecting:
- Load balancing excludes node — No requests routed there
- Requests go to other nodes — May increase their load
- Successful reconnect — Node immediately available to load balancer
Load Distribution During Reconnection:
Normal (3 nodes):
Node1: 33% requests
Node2: 33% requests
Node3: 33% requests
Node2 DOWN:
Node1: 50% requests (+17%)
Node2: 0% requests (reconnecting)
Node3: 50% requests (+17%)
Node2 recovers:
Node1: 33% requests (back to normal)
Node2: 33% requests
Node3: 33% requests
Configuration Recommendations¶
| Scenario | Base Delay | Max Delay | Rationale |
|---|---|---|---|
| Low-latency requirement | 500ms | 30s | Fast recovery for transient issues |
| Standard production | 1s | 5min | Balance between recovery and load |
| Unstable network | 2s | 10min | Reduce reconnection load |
| Maintenance windows | 5s | 15min | Accommodate planned restarts |
Anti-Patterns¶
| Anti-Pattern | Problem |
|---|---|
| No max delay | Delay grows unboundedly; very slow recovery |
| Very short max delay | Excessive reconnection attempts during outages |
| Same delay for all environments | Dev settings may not suit production |
Monitoring Reconnection¶
| Metric | Description | Warning Sign |
|---|---|---|
| Reconnection attempts | Count per node | Sustained attempts to single node |
| Reconnection successes | Successful reconnections | Low success rate indicates persistent issue |
| Time in DOWN state | Duration nodes spend unreachable | Prolonged DOWN states |
| Nodes in reconnecting state | Count of nodes currently reconnecting | Many nodes simultaneously |
Related Documentation¶
- Connection Management — Connection lifecycle and pooling
- Load Balancing Policy — How requests are routed during node recovery