Playbook: Replace a Dead Node¶

This playbook provides step-by-step instructions for replacing a failed Cassandra node that cannot be recovered.

Overview¶

Attribute	Value
Estimated Duration	30 minutes - 4 hours (depends on data size)
Risk Level	Medium
Requires Downtime	No (cluster remains available)
Prerequisites	Replacement hardware ready, network accessible

When to Use This Playbook¶

Node hardware has failed and cannot be recovered
Node has been down longer than max_hint_window_in_ms (default: 3 hours)
Corrupted data that cannot be repaired
Decommissioning and replacing simultaneously

Prerequisites¶

[ ] Replacement server provisioned with same specifications
[ ] Cassandra installed (same version as cluster)
[ ] Network connectivity verified to all cluster nodes
[ ] Firewall rules configured (ports 7000, 7001, 9042, 7199)
[ ] NTP synchronized
[ ] Sufficient disk space (at least equal to failed node)

Step 1: Confirm Node Status¶

1.1 Verify Node is Down¶

# On any live node
nodetool status

Expected output: Dead node shows as DN (Down/Normal):

Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns   Host ID                               Rack
UN  10.0.0.1     256.12 KiB  256     ?      550e8400-e29b-41d4-a716-446655440000  rack1
DN  10.0.0.2     267.89 KiB  256     ?      660e8400-e29b-41d4-a716-446655440001  rack1  <-- Dead node
UN  10.0.0.3     245.34 KiB  256     ?      770e8400-e29b-41d4-a716-446655440002  rack1

1.2 Record Dead Node Information¶

Critical: Note these values from the dead node:

# Get Host ID of dead node
nodetool status | grep DN
# Record: Host ID = 660e8400-e29b-41d4-a716-446655440001

# Get IP address
# Record: IP = 10.0.0.2

1.3 Check Hints¶

# Check if hints exist for the dead node
nodetool getendpoints my_keyspace my_table some_partition_key

Step 2: Prepare Replacement Node¶

2.1 Install Cassandra¶

# On replacement node (10.0.0.4)
sudo apt-get update
sudo apt-get install cassandra
# OR
sudo yum install cassandra

2.2 Verify Version Match¶

# On replacement node
cassandra -v

# On existing node
nodetool version

Both must match.

2.3 Clear Data Directories¶

# On replacement node - ensure clean state
sudo rm -rf /var/lib/cassandra/data/*
sudo rm -rf /var/lib/cassandra/commitlog/*
sudo rm -rf /var/lib/cassandra/saved_caches/*
sudo rm -rf /var/lib/cassandra/hints/*

Step 3: Configure Replacement Node¶

3.1 Edit cassandra.yaml¶

sudo vi /etc/cassandra/cassandra.yaml

Critical settings (must match cluster, except IPs):

cluster_name: 'ProductionCluster'  # Must match exactly!

# Set to replacement node's IP
listen_address: 10.0.0.4
rpc_address: 10.0.0.4

# Same seeds as cluster (do not include dead node)
seed_provider:
  - class_name: org.apache.cassandra.locator.SimpleSeedProvider
    parameters:
      - seeds: "10.0.0.1,10.0.0.3"

# Same snitch as cluster
endpoint_snitch: GossipingPropertyFileSnitch

3.2 Configure Rack/DC¶

sudo vi /etc/cassandra/cassandra-rackdc.properties

dc=dc1        # Same as dead node
rack=rack1    # Same as dead node

3.3 Set JVM Options¶

Ensure JVM settings match other nodes in jvm.options or jvm11-server.options.

Step 4: Start Replacement with Replace Flag¶

4.1 Set Replace Address¶

Option A: Set in JVM options file

sudo vi /etc/cassandra/jvm-server.options
# Add this line:
-Dcassandra.replace_address_first_boot=10.0.0.2

Option B: Set via environment variable

export JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address_first_boot=10.0.0.2"

4.2 Start Cassandra¶

sudo systemctl start cassandra

4.3 Monitor Bootstrap Progress¶

# Watch the logs
tail -f /var/log/cassandra/system.log

# Check progress
nodetool netstats

# On another node, check status
nodetool status

Expected log messages:

INFO  [main] StorageService.java - Replacing a node with token(s): [-9223372036854775808, ...]
INFO  [main] StorageService.java - Nodes [/10.0.0.2] are marked dead
INFO  [main] Gossiper.java - Replacing /10.0.0.2 with /10.0.0.4

Step 5: Verify Replacement¶

5.1 Check Node Status¶

nodetool status

Expected: New node shows as UN:

Datacenter: dc1
===============
UN  10.0.0.1     256.12 KiB  256     33.3%  550e8400-e29b-41d4-a716-446655440000  rack1
UN  10.0.0.4     267.89 KiB  256     33.3%  880e8400-e29b-41d4-a716-446655440003  rack1  <-- New node
UN  10.0.0.3     245.34 KiB  256     33.4%  770e8400-e29b-41d4-a716-446655440002  rack1

Note: Dead node (10.0.0.2) is no longer listed.

5.2 Verify Data Streaming Complete¶

# On new node
nodetool netstats

Expected: No active streams:

Mode: NORMAL
Not sending any streams.
Not receiving any streams.

5.3 Check Gossip Info¶

nodetool gossipinfo

Verify new node appears with correct STATUS=NORMAL.

Step 6: Post-Replacement Cleanup¶

6.1 Remove Replace Flag¶

Important: Remove the replace flag to prevent issues on restart.

sudo vi /etc/cassandra/jvm-server.options
# Remove or comment out:
# -Dcassandra.replace_address_first_boot=10.0.0.2

6.2 Run Repair on New Node¶

# Run repair to ensure data consistency
nodetool repair -pr

6.3 Update Seed List (if applicable)¶

If the dead node was a seed, update all nodes:

# On all nodes
sudo vi /etc/cassandra/cassandra.yaml
# Update seeds list to exclude dead node, include new node if desired

6.4 Update Monitoring/Alerting¶

Update monitoring to track new node IP
Remove dead node from monitoring
Update documentation

Troubleshooting¶

Bootstrap Hangs¶

Symptom: Node stuck in UJ (Up/Joining) state.

# Check progress
nodetool netstats

# Check for errors
tail -100 /var/log/cassandra/system.log | grep -i error

Solutions: - Wait (large datasets take time) - Check network connectivity to other nodes - Check disk space on new node

Node Shows Wrong Tokens¶

Symptom: Token distribution uneven after replace.

# Check token distribution
nodetool ring

Solution: Run repair, then consider running nodetool cleanup on other nodes.

Streaming Failures¶

Symptom: "Streaming error" in logs.

grep -i "stream" /var/log/cassandra/system.log | tail -50

Solutions: - Check network connectivity - Increase streaming timeout - Restart bootstrap (clear data, try again)

Old Node Reappears¶

Symptom: Dead node shows up again after replacement.

# Remove via assassinate (use with caution)
nodetool assassinate 10.0.0.2

Rollback Procedure¶

If replacement fails and recovery is needed:

Stop the replacement node:
```
sudo systemctl stop cassandra
```

If original node can be recovered, bring it back:

# On original node
sudo systemctl start cassandra

Clear replacement node data:
```
sudo rm -rf /var/lib/cassandra/data/*
```
Run repair on recovered node:
```
nodetool repair
```

Checklist Summary¶

[ ] Confirmed node is dead (DN status)
[ ] Recorded dead node Host ID and IP
[ ] Prepared replacement node hardware
[ ] Installed matching Cassandra version
[ ] Configured cassandra.yaml with replace flag
[ ] Configured rack/DC properties
[ ] Started node and monitored bootstrap
[ ] Verified UN status and data streaming complete
[ ] Removed replace flag from configuration
[ ] Ran repair on new node
[ ] Updated seed list if needed
[ ] Updated monitoring and documentation