Scaling Operations¶

This section covers horizontal scaling procedures for Cassandra clusters, including capacity planning, node addition and removal strategies, and operational best practices for maintaining cluster balance during topology changes.

Scaling Model¶

Horizontal vs Vertical Scaling¶

Cassandra is designed for horizontal scaling—adding more nodes rather than larger nodes:

Approach	Advantages	Disadvantages
Horizontal	Linear scaling, fault tolerance	Data movement required
Vertical	No data movement	Hardware limits, no added fault tolerance

Capacity Planning¶

Key Metrics¶

Before scaling, assess current cluster state:

Metric	Target Range	Action if Exceeded
Disk usage	< 50%	Add nodes before 60%
CPU utilization	< 60%	Add nodes or optimize queries
Memory pressure	Heap < 80%	Tune JVM or add nodes
Read latency (p99)	< 50ms	Add nodes or review data model
Write latency (p99)	< 25ms	Add nodes or reduce batch size

Data Distribution Formula¶

For a cluster with N nodes and replication factor RF:

Data per node = (Total data × RF) / N

Example:
- Total unique data: 1 TB
- RF: 3
- Nodes: 6
- Data per node: (1TB × 3) / 6 = 500 GB

Scaling Calculations¶

Adding nodes:

New data per node = Current data per node × (N / (N + new_nodes))

Example: Adding 3 nodes to 6-node cluster
- Current: 500 GB per node
- New: 500 GB × (6 / 9) = 333 GB per node

Data movement estimation:

Data to stream = (Total data × RF × new_nodes) / (N + new_nodes)

Example: Adding 3 nodes
- Data to stream: (1TB × 3 × 3) / 9 = 1 TB total

Scaling Up (Adding Nodes)¶

Planning Node Addition¶

Addition Best Practices¶

Practice	Rationale
Add multiple of RF	Maintains balanced distribution
Add one at a time	Reduces cluster stress
Wait for NORMAL	Ensure complete bootstrap
Schedule during low traffic	Minimize impact
Monitor existing nodes	Watch for compaction backlog

Multi-Datacenter Scaling¶

When scaling across datacenters:

# Ensure proportional capacity per DC
# Example: RF=3 per DC

DC1: 6 nodes (3 replicas distributed across 6)
DC2: 6 nodes (3 replicas distributed across 6)

# Add 2 nodes to each DC to maintain balance
DC1: 8 nodes
DC2: 8 nodes

Adding Nodes Procedure¶

# Step 1: Verify cluster health
nodetool status  # All UN

# Step 2: Configure new node
# cassandra.yaml with cluster settings
# Ensure same cluster_name, seeds, snitch

# Step 3: Start new node
sudo systemctl start cassandra

# Step 4: Monitor bootstrap
watch -n 5 'nodetool netstats'

# Step 5: Wait for completion (status shows UN)
nodetool status

# Step 6: Run repair on new node
nodetool repair -pr

# Step 7: (If adding multiple) Repeat for next node

vnodes and Token Distribution¶

With vnodes (recommended), token allocation is automatic:

# cassandra.yaml
num_tokens: 16  # Default, provides good distribution

# Optional: Allocate tokens for specific keyspace
allocate_tokens_for_keyspace: my_keyspace

vnodes advantages: - Automatic token allocation - Better load distribution - Faster streaming (parallel) - Easier operations

Scaling Down (Removing Nodes)¶

Planning Node Removal¶

Removal Constraints¶

Before removing nodes, verify:

Constraint	Check	Consequence if Violated
RF satisfied	Nodes remaining ≥ RF	Decommission blocked
Disk capacity	Remaining nodes have space	Disk full errors
Performance	Load can be handled	Latency increase

Removing Nodes Procedure¶

# Step 1: Verify remaining capacity
# Calculate new disk usage per node

# Step 2: Check RF requirements
nodetool describering <keyspace>

# Step 3: On node being removed, run decommission
nodetool decommission

# This:
# - Streams all data to remaining nodes
# - Takes significant time (hours for large datasets)
# - Cannot be cancelled

# Step 4: Monitor progress
watch -n 10 'nodetool netstats'

# Step 5: Verify completion
nodetool status
# Removed node should no longer appear

# Step 6: Update seed list if node was a seed
# Edit cassandra.yaml on all nodes
# Rolling restart (optional but recommended)

Rebalancing¶

When Rebalancing is Needed¶

Token rebalancing may be necessary when: - Historical uneven additions created imbalance - Single-token clusters need redistribution - Significant data skew exists

Rebalancing with vnodes¶

With vnodes, adding/removing nodes automatically rebalances. Manual intervention is rarely needed.

Detecting Imbalance¶

# Check data distribution
nodetool status

# Look for significant variation in "Load" column
# Example of imbalanced cluster:
# UN  10.0.1.1  500 GB   ...
# UN  10.0.1.2  800 GB   ...  # 60% more data
# UN  10.0.1.3  450 GB   ...

# Check token ownership
nodetool ring

Addressing Imbalance¶

Imbalance Cause	Solution
Partition key hotspot	Review data model
Token distribution	Add nodes (vnodes auto-balance)
Uneven node specs	Standardize hardware

Scaling Strategies¶

Double the Cluster¶

Adding nodes equal to current count provides optimal rebalancing:

Initial: 3 nodes, 1TB each
Add: 3 nodes
Result: 6 nodes, 500GB each

Data movement: 50% of data moves (optimal)

Incremental Addition¶

Adding fewer nodes is possible but less efficient:

Initial: 3 nodes, 1TB each
Add: 1 node
Result: 4 nodes, 750GB each

Data movement: 25% of data moves
New node load: May still be uneven with vnodes

Multi-Region Scaling¶

Guidelines: - Scale regions proportionally - Maintain RF per region - Consider latency requirements

Streaming Impact¶

During Scale-Up¶

Adding nodes creates streaming load:

Impact	Mitigation
Network bandwidth	Throttle with `stream_throughput_outbound_megabits_per_sec`
Disk I/O	Schedule during low-traffic hours
CPU (compaction)	Monitor pending compactions
Memory	Ensure heap headroom

# cassandra.yaml - Streaming throttle
stream_throughput_outbound_megabits_per_sec: 200  # Adjust based on network
inter_dc_stream_throughput_outbound_megabits_per_sec: 50

During Scale-Down¶

Decommissioning creates similar streaming load, plus: - Remaining nodes receive additional data - Compaction increases post-streaming

Operational Checklist¶

Before Scaling¶

[ ] Current cluster health (all nodes UN)
[ ] No pending repairs or streaming
[ ] Disk usage < 50% on all nodes
[ ] Network bandwidth available for streaming
[ ] Maintenance window scheduled
[ ] Backups current (if applicable)

During Scaling¶

[ ] One node at a time
[ ] Monitor streaming progress
[ ] Watch for errors in logs
[ ] Track compaction backlog
[ ] Verify each node reaches NORMAL

After Scaling¶

[ ] All nodes show UN status
[ ] Data distribution is balanced
[ ] Performance metrics nominal
[ ] Repair completed on new nodes
[ ] Seed list updated (if applicable)
[ ] Documentation updated

Monitoring During Scaling¶

Key Metrics to Watch¶

# Streaming progress
nodetool netstats

# Compaction status
nodetool compactionstats

# Thread pool status
nodetool tpstats

# Overall status
nodetool status

Alert Thresholds During Scaling¶

Metric	Warning	Critical
Streaming stuck	No progress for 30 min	No progress for 1 hour
Pending compactions	> 50	> 100
Dropped messages	> 0.1%	> 1%
Disk usage	> 70%	> 80%

Node Lifecycle - Bootstrap and decommission details
Data Streaming - Streaming mechanics
Partitioning - Token distribution
Seeds and Discovery - Seed management during scaling