nodetool rebuild¶
Rebuilds data on a node by streaming from other datacenters, used when adding nodes to a new datacenter.
Synopsis¶
nodetool [connection_options] rebuild [options] [source_datacenter]
Description¶
nodetool rebuild streams all data that belongs to this node from another datacenter. This is used when:
- Adding nodes to a new datacenter
- Recovering a node without using bootstrap
- Repopulating a datacenter after total loss
Unlike bootstrap, rebuild does not require the node to be in JOINING state and can be run on a node that's already part of the ring.
Rebuild Streams from One Replica Only
The rebuild command streams data from a single replica for each token range, not from all replicas. This means:
- Data may be inconsistent if the source replica was not fully up-to-date
- Deleted data (tombstones) that only existed on other replicas will not be streamed
- The rebuilt node may have stale or missing data
Always run nodetool repair after rebuild completes to ensure full consistency with all replicas. The recommended workflow is:
- Run
rebuildto quickly populate the node with data - Run
repairto synchronize with all replicas and resolve inconsistencies
This two-step approach is faster than repair alone for large datasets, as rebuild streams entire SSTables while repair performs merkle tree comparisons.
Arguments¶
| Argument | Description |
|---|---|
source_datacenter |
Datacenter to stream data from. If omitted, streams from all DCs |
Options¶
| Option | Description |
|---|---|
-ks, --keyspace |
Specific keyspace to rebuild |
-ts, --tokens |
Specific token ranges to rebuild |
-s, --sources |
Specific source nodes to stream from |
--mode |
Rebuild mode: ALL, NORMAL, REFETCH |
When to Use¶
Adding New Datacenter¶
When expanding to a new datacenter:
# Step 1: Configure nodes in new DC
# Step 2: Update keyspace RF to include new DC
ALTER KEYSPACE my_keyspace WITH replication = {
'class': 'NetworkTopologyStrategy',
'dc1': 3,
'dc2': 3 -- New DC
};
# Step 3: On each node in new DC, rebuild from existing DC
nodetool rebuild dc1
Node Recovery Without Bootstrap¶
If a node lost data but is still in the ring:
nodetool rebuild
Not a Substitute for Repair
Rebuild streams data from other DCs. For single-DC clusters or to sync from same-DC replicas, use nodetool repair instead.
After Datacenter Recovery¶
After recovering all nodes in a datacenter that was completely down:
# On each recovered node
nodetool rebuild <source_dc>
When NOT to Use¶
Single Datacenter Clusters¶
Requires Multiple DCs
rebuild streams from other datacenters. For single-DC clusters:
# Use repair instead
nodetool repair -pr
Normal Bootstrap Scenarios¶
When adding nodes to an existing DC, use bootstrap (normal node startup) instead:
# Just start the node - bootstrap happens automatically
sudo systemctl start cassandra
While Node is Bootstrapping¶
Don't run rebuild on a node that's currently bootstrapping.
Rebuild Process¶
- New DC node calculates token ranges to receive
- New DC node requests data from Source DC nodes
- Source DC nodes stream SSTables to new DC node
- Once all data is received, new DC node resumes normal operations
Rebuild Behavior
The node streams data for ALL token ranges it owns from the source datacenter.
Examples¶
Rebuild from Specific Datacenter¶
nodetool rebuild dc1
Streams all data this node should own from dc1.
Rebuild Specific Keyspace¶
nodetool rebuild -ks my_keyspace dc1
Rebuild from All DCs¶
nodetool rebuild
Streams from all available datacenters.
Monitor Progress¶
# Watch streaming progress
nodetool netstats
Multi-DC Expansion Workflow¶
Complete Process¶
# 1. Add nodes to new DC (don't start Cassandra yet)
# 2. Configure cassandra.yaml on new nodes:
# - Same cluster_name
# - Different dc/rack in GossipingPropertyFileSnitch
# 3. Start first node in new DC
sudo systemctl start cassandra
# 4. Update keyspace replication
ALTER KEYSPACE my_keyspace WITH replication = {
'class': 'NetworkTopologyStrategy',
'dc1': 3,
'dc2': 3
};
# 5. Rebuild on first node
nodetool rebuild dc1
# 6. Start remaining nodes in new DC one at a time
# 7. Run rebuild on each after it joins
Verification¶
# Check node status
nodetool status
# Verify data
nodetool tablestats my_keyspace | grep "Space used"
# Run repair to ensure consistency
nodetool repair -pr
Monitoring Rebuild¶
During Rebuild¶
# Streaming progress
nodetool netstats
# Thread pool activity
nodetool tpstats | grep -i stream
Estimated Duration¶
| Data Size | Network | Approximate Time |
|---|---|---|
| 100 GB | 1 Gbps | 15-30 minutes |
| 500 GB | 1 Gbps | 1-2 hours |
| 1 TB | 1 Gbps | 3-5 hours |
| 1 TB | 10 Gbps | 30-60 minutes |
Logs¶
tail -f /var/log/cassandra/system.log | grep -i rebuild
Common Issues¶
"No such datacenter"¶
ERROR: No such datacenter: dc2
The specified datacenter doesn't exist:
# Check available DCs
nodetool status
Rebuild Stuck¶
If rebuild doesn't progress:
-
Check streaming:
nodetool netstats -
Check source nodes are healthy:
ssh <source_node> "nodetool status" -
Check network connectivity between DCs
-
Check throughput settings:
nodetool getstreamthroughput nodetool getinterdcstreamthroughput
Insufficient Disk Space¶
Rebuild requires space for incoming data:
# Check disk space
df -h /var/lib/cassandra
# May need to clear old data or add storage
Rebuild Fails Midway¶
If rebuild fails partway through:
- Check logs for error cause
- Fix the issue
- Restart rebuild (it will re-stream needed data)
Rebuild vs. Other Operations¶
| Operation | Use Case |
|---|---|
rebuild |
Stream from other DCs to populate data |
repair |
Sync data between replicas |
bootstrap |
New node joining cluster for first time |
removenode |
Remove dead node from cluster |
Bootstrap vs. Rebuild¶
| Aspect | Bootstrap | Rebuild |
|---|---|---|
| When | New node joining | Existing node needs data |
| Auto-trigger | On first start | Manual command |
| State | JOINING | NORMAL |
| Source | Same DC (primary) | Other DCs |
Performance Considerations¶
Throttling¶
Control rebuild speed:
# Check current settings
nodetool getstreamthroughput
nodetool getinterdcstreamthroughput
# Increase for faster rebuild
nodetool setstreamthroughput 400
nodetool setinterdcstreamthroughput 100
Impact on Source DC¶
Source DC Load
Rebuild reads from source DC nodes, impacting their performance:
- Run during off-peak hours
- Consider throttling
- Monitor source DC latencies
Best Practices¶
Rebuild Guidelines
- Plan for duration - Large datasets take hours
- Off-peak timing - Reduce impact on production
- One node at a time - Minimize cluster impact
- Monitor progress - Watch netstats continuously
- Verify afterward - Check tablestats and run repair
- Consider throttling - Balance speed vs. impact
Related Commands¶
| Command | Relationship |
|---|---|
| repair | Sync replicas within/across DCs |
| netstats | Monitor streaming progress |
| status | Check node/DC status |
| setstreamthroughput | Control streaming speed |