Node Replacement¶

This section examines the architectural considerations for handling failed nodes in a Cassandra cluster. Understanding failure classification, recovery strategies, and the mechanics of node replacement is essential for designing resilient cluster topologies.

For operational procedures and step-by-step instructions, see Operations: Cluster Management.

Failure Classification¶

Types of Node Failures¶

Node failures fall into distinct categories, each requiring different recovery approaches:

Failure Type	Characteristics	Recovery Approach
Transient	Node temporarily unavailable, data intact	Wait for restart, hints replay
Recoverable	Node can restart, possible data issues	Restart + repair
Permanent	Hardware failed, node cannot return	Replace or remove

Failure Detection¶

Cassandra detects node failures through the gossip protocol's Phi Accrual Failure Detector. When a node stops responding to gossip messages, other nodes independently calculate a suspicion level (φ). Once φ exceeds the configured threshold (default: 8), the node is marked DOWN locally.

Key characteristics: - Local determination: Each node independently decides if a peer is DOWN - Not gossiped: DOWN status is not propagated; each node must observe failure directly - Adaptive: Phi Accrual adjusts to network latency variations

See Gossip Protocol for failure detection mechanics.

Decision Framework¶

Replace vs Remove¶

When a node permanently fails, the architectural decision between replacement and removal affects cluster capacity, data distribution, and recovery time:

Comparison of Approaches¶

Factor	Replace	Remove
Cluster capacity	Maintained	Reduced
Data movement	Stream to new node only	Redistribute across all remaining nodes
Token ownership	New node assumes dead node's tokens	Tokens redistributed to existing nodes
Disk usage impact	Isolated to new node	Increased on all remaining nodes
Time to complete	Longer (full data stream to one node)	Shorter (parallel redistribution)
Network impact	Concentrated streaming	Distributed streaming

Capacity Constraints¶

Before choosing an approach, verify capacity constraints:

Constraint	Replace	Remove
Minimum nodes	N ≥ RF after operation	N - 1 ≥ RF after operation
Disk headroom	New node needs capacity for its share	Remaining nodes need capacity for redistributed data
Network bandwidth	Streaming to single node	Parallel streaming between remaining nodes

Replacement Architecture¶

Token Assumption Model¶

Node replacement operates on the principle of token assumption—the new node takes ownership of the failed node's token ranges without redistributing tokens across the cluster:

Advantages of token assumption: - No token recalculation required - Other nodes unaffected - Predictable data movement - Faster recovery than full rebalance

Replacement Process Flow¶

Replace Address Mechanism¶

The replace_address_first_boot JVM option instructs a new node to assume a dead node's identity:

Option	Behavior	Recommendation
`replace_address_first_boot`	Cleared after successful first boot	Preferred—prevents accidental re-replacement
`replace_address`	Persists across restarts	Legacy—can cause issues on restart

Architectural behavior: 1. New node contacts seeds with replacement intent 2. Cluster validates dead node is actually DOWN 3. New node receives dead node's token assignments 4. Streaming begins from surviving replicas 5. Upon completion, new node announces NORMAL status 6. Dead node's gossip state is eventually purged

Same-IP vs Different-IP Replacement¶

Scenario	Token Handling	Host ID	Configuration
Same IP	Assumed from dead node	New ID generated	`replace_address_first_boot` with same IP
Different IP	Assumed from dead node	New ID generated	`replace_address_first_boot` with dead node's IP

Both scenarios require the replace_address_first_boot option—the IP in the option always refers to the dead node's address, regardless of the new node's IP.

Removal Architecture¶

Token Redistribution Model¶

Node removal triggers token redistribution—the dead node's token ranges are reassigned to remaining nodes:

Removal vs Decommission¶

Aspect	removenode	decommission
Precondition	Node is DOWN	Node is UP and operational
Initiated from	Any live node	The departing node itself
Data source	Remaining replicas stream to each other	Departing node streams its data out
Coordination	Distributed among remaining nodes	Centralized on departing node
Use case	Unplanned failure	Planned capacity reduction

Assassinate Operation¶

The assassinate operation forcibly removes a node from gossip state without data redistribution:

Characteristic	Description
Purpose	Emergency removal of stuck nodes
Data handling	None—no streaming occurs
Risk	Potential data loss if replicas insufficient
Post-action	Full repair required on all nodes

Use only when: - Node is unresponsive but not fully DOWN - removenode fails or hangs indefinitely - Emergency cluster recovery is required

Multiple Failure Scenarios¶

Concurrent Failure Analysis¶

When multiple nodes fail, data availability depends on the relationship between failures and replication:

Quorum Impact Matrix¶

Failed Nodes	RF=3 Availability	Recovery Strategy
1	QUORUM available (2 of 3)	Standard replacement
2	ONE available only	Urgent replacement, sequential
3 (same range)	Data unavailable	Backup restoration required

Recovery Ordering¶

When multiple nodes require replacement:

Assess scope: Identify which token ranges are affected
Prioritize: Replace nodes affecting most-critical ranges first
Sequential execution: Complete one replacement before starting next
Repair between: Run repair after each replacement to ensure consistency
Verify coverage: Confirm all ranges have sufficient replicas before proceeding

Data Consistency Considerations¶

Streaming During Replacement¶

During replacement, the new node receives data from surviving replicas:

Source Selection	Criteria
Prefer local DC	Minimize cross-datacenter traffic
Prefer least loaded	Distribute streaming impact
Avoid concurrent streamers	Prevent resource contention

Post-Replacement Consistency¶

After replacement completes, data consistency may require repair:

Scenario	Repair Recommendation
Short downtime (< hint window)	Hints cover gap; repair optional
Extended downtime (> hint window)	Repair required for missed writes
Multiple failures	Full repair recommended
Consistency-critical data	Always run repair

The hint window (default: 3 hours) determines whether hinted handoff can cover writes during downtime. Beyond this window, hints expire and repair becomes necessary.

Node Lifecycle - Bootstrap, decommission, and state transitions
Gossip Protocol - Failure detection mechanics
Data Streaming - Streaming architecture during replacement
Scaling Operations - Capacity planning after node removal
Fault Tolerance - Failure scenarios and recovery patterns
Operations: Cluster Management - Step-by-step procedures