nodetool rebuild¶

Rebuilds data on a node by streaming from other datacenters, used when adding nodes to a new datacenter.

Synopsis¶

nodetool [connection_options] rebuild [options] [source_datacenter]

Description¶

nodetool rebuild streams all data that belongs to this node from another datacenter. This is used when:

Adding nodes to a new datacenter
Recovering a node without using bootstrap
Repopulating a datacenter after total loss

Unlike bootstrap, rebuild does not require the node to be in JOINING state and can be run on a node that's already part of the ring.

Rebuild Streams from One Replica Only

The rebuild command streams data from a single replica for each token range, not from all replicas. This means:

Data may be inconsistent if the source replica was not fully up-to-date
Deleted data (tombstones) that only existed on other replicas will not be streamed
The rebuilt node may have stale or missing data

Always run nodetool repair after rebuild completes to ensure full consistency with all replicas. The recommended workflow is:

Run rebuild to quickly populate the node with data
Run repair to synchronize with all replicas and resolve inconsistencies

This two-step approach is faster than repair alone for large datasets, as rebuild streams entire SSTables while repair performs merkle tree comparisons.

Arguments¶

Argument	Description
`source_datacenter`	Datacenter to stream data from. If omitted, streams from all DCs

Options¶

Option	Description
`-ks, --keyspace`	Specific keyspace to rebuild
`-ts, --tokens`	Specific token ranges to rebuild
`-s, --sources`	Specific source nodes to stream from
`--mode`	Rebuild mode: ALL, NORMAL, REFETCH

When to Use¶

Adding New Datacenter¶

When expanding to a new datacenter:

# Step 1: Configure nodes in new DC
# Step 2: Update keyspace RF to include new DC
ALTER KEYSPACE my_keyspace WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3,
    'dc2': 3  -- New DC
};

# Step 3: On each node in new DC, rebuild from existing DC
nodetool rebuild dc1

Node Recovery Without Bootstrap¶

If a node lost data but is still in the ring:

nodetool rebuild

Not a Substitute for Repair

Rebuild streams data from other DCs. For single-DC clusters or to sync from same-DC replicas, use nodetool repair instead.

After Datacenter Recovery¶

After recovering all nodes in a datacenter that was completely down:

# On each recovered node
nodetool rebuild <source_dc>

When NOT to Use¶

Single Datacenter Clusters¶

Requires Multiple DCs

rebuild streams from other datacenters. For single-DC clusters:

# Use repair instead
nodetool repair -pr

Normal Bootstrap Scenarios¶

When adding nodes to an existing DC, use bootstrap (normal node startup) instead:

# Just start the node - bootstrap happens automatically
sudo systemctl start cassandra

While Node is Bootstrapping¶

Don't run rebuild on a node that's currently bootstrapping.

Rebuild Process¶

New DC node calculates token ranges to receive
New DC node requests data from Source DC nodes
Source DC nodes stream SSTables to new DC node
Once all data is received, new DC node resumes normal operations

Rebuild Behavior

The node streams data for ALL token ranges it owns from the source datacenter.

Examples¶

Rebuild from Specific Datacenter¶

nodetool rebuild dc1

Streams all data this node should own from dc1.

Rebuild Specific Keyspace¶

nodetool rebuild -ks my_keyspace dc1

Rebuild from All DCs¶

nodetool rebuild

Streams from all available datacenters.

Monitor Progress¶

# Watch streaming progress
nodetool netstats

Multi-DC Expansion Workflow¶

Complete Process¶

# 1. Add nodes to new DC (don't start Cassandra yet)

# 2. Configure cassandra.yaml on new nodes:
#    - Same cluster_name
#    - Different dc/rack in GossipingPropertyFileSnitch

# 3. Start first node in new DC
sudo systemctl start cassandra

# 4. Update keyspace replication
ALTER KEYSPACE my_keyspace WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3,
    'dc2': 3
};

# 5. Rebuild on first node
nodetool rebuild dc1

# 6. Start remaining nodes in new DC one at a time
# 7. Run rebuild on each after it joins

Verification¶

# Check node status
nodetool status

# Verify data
nodetool tablestats my_keyspace | grep "Space used"

# Run repair to ensure consistency
nodetool repair -pr

Monitoring Rebuild¶

During Rebuild¶

# Streaming progress
nodetool netstats

# Thread pool activity
nodetool tpstats | grep -i stream

Estimated Duration¶

Data Size	Network	Approximate Time
100 GB	1 Gbps	15-30 minutes
500 GB	1 Gbps	1-2 hours
1 TB	1 Gbps	3-5 hours
1 TB	10 Gbps	30-60 minutes

Logs¶

tail -f /var/log/cassandra/system.log | grep -i rebuild

Common Issues¶

"No such datacenter"¶

ERROR: No such datacenter: dc2

The specified datacenter doesn't exist:

# Check available DCs
nodetool status

Rebuild Stuck¶

If rebuild doesn't progress:

Check streaming:
```
nodetool netstats
```
Check source nodes are healthy:
```
ssh <source_node> "nodetool status"
```
Check network connectivity between DCs

Check throughput settings:

nodetool getstreamthroughput
nodetool getinterdcstreamthroughput

Insufficient Disk Space¶

Rebuild requires space for incoming data:

# Check disk space
df -h /var/lib/cassandra

# May need to clear old data or add storage

Rebuild Fails Midway¶

If rebuild fails partway through:

Check logs for error cause
Fix the issue
Restart rebuild (it will re-stream needed data)

Rebuild vs. Other Operations¶

Operation	Use Case
`rebuild`	Stream from other DCs to populate data
`repair`	Sync data between replicas
`bootstrap`	New node joining cluster for first time
`removenode`	Remove dead node from cluster

Bootstrap vs. Rebuild¶

Aspect	Bootstrap	Rebuild
When	New node joining	Existing node needs data
Auto-trigger	On first start	Manual command
State	JOINING	NORMAL
Source	Same DC (primary)	Other DCs

Performance Considerations¶

Throttling¶

Control rebuild speed:

# Check current settings
nodetool getstreamthroughput
nodetool getinterdcstreamthroughput

# Increase for faster rebuild
nodetool setstreamthroughput 400
nodetool setinterdcstreamthroughput 100

Impact on Source DC¶

Source DC Load

Rebuild reads from source DC nodes, impacting their performance:

Run during off-peak hours
Consider throttling
Monitor source DC latencies

Best Practices¶

Rebuild Guidelines

Plan for duration - Large datasets take hours
Off-peak timing - Reduce impact on production
One node at a time - Minimize cluster impact
Monitor progress - Watch netstats continuously
Verify afterward - Check tablestats and run repair
Consider throttling - Balance speed vs. impact

Command	Relationship
repair	Sync replicas within/across DCs
netstats	Monitor streaming progress
status	Check node/DC status
setstreamthroughput	Control streaming speed