Inter-Node Data Streaming¶
Data streaming is the mechanism by which Cassandra nodes transfer SSTable data directly between each other. This inter-node data transfer occurs during cluster topology changes, repair operations, and failure recovery scenarios. Understanding the streaming subsystem is essential for capacity planning, operational troubleshooting, and performance optimization.
Overview¶
What is Streaming?¶
Streaming is Cassandra's bulk data transfer protocol for moving SSTable segments between nodes. Unlike the normal read/write path that operates on individual rows, streaming transfers entire SSTable files or portions thereof at the file level.
When Streaming Occurs¶
| Operation | Direction | Trigger |
|---|---|---|
| Bootstrap | Existing → New node | New node joins cluster |
| Decommission | Leaving → Remaining nodes | Node removal initiated |
| Repair | Bidirectional between replicas | Manual or scheduled repair |
| Rebuild | Existing → Rebuilt node | nodetool rebuild command |
| Host replacement | Existing → Replacement node | Dead node replaced |
| Hinted handoff | Coordinator → Recovered node | Node recovers after failure |
Streaming Architecture¶
Protocol Stack¶
Cassandra's streaming protocol operates as a separate subsystem from the CQL protocol:
Streaming Session Lifecycle¶
A streaming session progresses through distinct phases:
Phase descriptions:
| Phase | Operations |
|---|---|
| INITIALIZED | Session created, peers identified |
| PREPARING | Token ranges calculated, SSTable selection, streaming plan exchanged |
| STREAMING | File transfers in progress, progress tracking |
| COMPLETE | All transfers successful, SSTables integrated |
| FAILED | Error occurred, partial cleanup, retry may follow |
Zero-Copy Streaming (Cassandra 4.0+)¶
Cassandra 4.0 introduced zero-copy streaming, which transfers entire SSTable components without deserialization:
| Characteristic | Traditional | Zero-Copy |
|---|---|---|
| CPU usage | High (ser/deser) | Minimal |
| Heap pressure | Significant | Negligible |
| Throughput | Limited by CPU | Limited by network/disk |
| Compatibility | All SSTables | Same-version SSTables only |
| TLS support | Yes | No |
Zero-Copy Requirements and Limitations
Zero-copy streaming has specific requirements that, when not met, cause automatic fallback to traditional streaming:
- SSTable format compatibility: Source and target nodes must use compatible SSTable formats. During rolling upgrades with format changes, traditional streaming is used.
- Inter-node encryption (TLS): Zero-copy streaming is disabled when inter-node encryption is enabled. TLS requires data to pass through the encryption layer, necessitating memory copies for encryption/decryption operations. Clusters with
server_encryption_optionsenabled will always use traditional streaming. - Configuration: Zero-copy must be enabled via
stream_entire_sstables: true(default in 4.0+).
For security-conscious deployments requiring TLS, account for the additional CPU and memory overhead of traditional streaming during capacity planning for bootstrap, decommission, and repair operations.
Bootstrap¶
Bootstrap is the process by which a new node joins the cluster and receives its share of data from existing nodes.
Bootstrap Sequence¶
Token Range Calculation¶
During bootstrap, the new node must determine which token ranges it will own:
Streaming Source Selection¶
The bootstrap coordinator selects source nodes based on:
- Token ownership: Nodes currently owning required ranges
- Replica set: For each range, any replica can serve as source
- Node state: Only UP nodes considered
- Load balancing: Distribute streaming load across sources
| Selection Criterion | Rationale |
|---|---|
| Prefer local datacenter | Lower network latency |
| Prefer least-loaded nodes | Minimize impact on production |
| Avoid nodes already streaming | Prevent overload |
| Round-robin across replicas | Balance source load |
Bootstrap Configuration¶
# cassandra.yaml bootstrap parameters
# Number of concurrent streaming sessions per source
streaming_connections_per_host: 1
# Throughput limit (MB/s, 0 = unlimited)
stream_throughput_outbound_megabits_per_sec: 200
# Enable zero-copy streaming
stream_entire_sstables: true
# Bootstrap timeout
streaming_keep_alive_period_in_secs: 300
Decommission¶
Decommission is the orderly removal of a node from the cluster, streaming all locally-owned data to remaining nodes before shutdown.
Decommission Sequence¶
Range Redistribution¶
During decommission, the departing node's ranges must be redistributed:
Decommission vs RemoveNode¶
| Operation | Use Case | Data Handling |
|---|---|---|
decommission |
Node healthy, orderly removal | Streams data before leaving |
removenode |
Node dead/unrecoverable | No streaming; repair required after |
assassinate |
Force remove stuck node | Emergency only; data loss possible |
Decommission Requirements
Decommission can only proceed if the remaining cluster can satisfy the replication factor. Attempting to decommission when RF nodes would remain results in an error.
Repair Streaming¶
Repair operations use streaming to synchronize data between replicas that have diverged.
Repair Types and Streaming¶
| Repair Type | Streaming Behavior |
|---|---|
| Full repair | Compare all data, stream differences |
| Incremental repair | Compare only unrepaired SSTables |
| Preview repair | Calculate differences only, no streaming |
| Subrange repair | Repair specific token ranges |
Merkle Tree Exchange¶
Before streaming, repair uses Merkle trees to identify divergent ranges:
Streaming During Repair¶
Incremental Repair Optimization¶
Incremental repair tracks which SSTables have been repaired, reducing future repair scope:
| SSTable State | Description | Repair Behavior |
|---|---|---|
| Unrepaired | Never included in repair | Included in next repair |
| Pending | Currently being repaired | Excluded from new repairs |
| Repaired | Successfully repaired | Excluded from incremental repair |
Hinted Handoff¶
Hinted handoff is a lightweight streaming mechanism that delivers missed writes to nodes that were temporarily unavailable.
Hint Storage and Delivery¶
Hint Structure¶
Hints are stored locally on the coordinator node:
Hint Configuration¶
# cassandra.yaml hint parameters
# Enable/disable hinted handoff
hinted_handoff_enabled: true
# Maximum time to store hints (default: 3 hours)
max_hint_window_in_ms: 10800000
# Directory for hint files
hints_directory: /var/lib/cassandra/hints
# Hint delivery throttle (KB/s per destination)
hinted_handoff_throttle_in_kb: 1024
# Maximum hints delivery threads
max_hints_delivery_threads: 2
# Hint compression
hints_compression:
- class_name: LZ4Compressor
Hint Delivery Streaming¶
Unlike full SSTable streaming used in bootstrap and repair, hint delivery uses a lighter-weight mutation replay mechanism. Hints are streamed as individual mutations rather than file segments.
Delivery mechanism details:
| Aspect | Description |
|---|---|
| Transport | Uses standard inter-node messaging (not dedicated streaming port) |
| Serialization | Hints deserialized and sent as mutation messages |
| Ordering | Delivered in timestamp order (oldest first) |
| Batching | Multiple hints may be batched per network round-trip |
| Retries | Failed deliveries retried with exponential backoff |
Hint Streaming Parameters¶
The following parameters control hint delivery throughput and resource usage:
| Parameter | Default | Description |
|---|---|---|
hinted_handoff_throttle_in_kb |
1024 | Maximum throughput per destination (KB/s) |
max_hints_delivery_threads |
2 | Concurrent delivery threads |
hints_flush_period_in_ms |
10000 | How often hint buffers flush to disk |
max_hints_file_size_in_mb |
128 | Maximum size per hint file |
Throttling calculation:
Effective hint throughput = hinted_handoff_throttle_in_kb × max_hints_delivery_threads
Example with defaults:
1024 KB/s × 2 threads = 2048 KB/s = ~2 MB/s total hint delivery capacity
# cassandra.yaml - Hint delivery tuning
# Increase for faster hint delivery (impacts production traffic)
hinted_handoff_throttle_in_kb: 2048
# More threads for parallel delivery to multiple recovering nodes
max_hints_delivery_threads: 4
# Smaller files for more granular cleanup
max_hints_file_size_in_mb: 64
Hint Delivery Process¶
| Phase | Operation |
|---|---|
| Detection | Gossip announces target node UP |
| Scheduling | HintsDispatcher assigns delivery thread |
| Reading | Hints read from local hint files in timestamp order |
| Throttling | Delivery rate limited by hinted_handoff_throttle_in_kb |
| Streaming | Mutations sent via messaging service |
| Application | Target node applies mutations to memtable |
| Acknowledgment | Target confirms receipt |
| Cleanup | Delivered hints deleted from source |
Hint Delivery vs SSTable Streaming¶
| Characteristic | Hint Delivery | SSTable Streaming |
|---|---|---|
| Data unit | Individual mutations | SSTable file segments |
| Transport | Messaging service | Dedicated streaming protocol |
| Throughput | KB/s (throttled) | MB/s to GB/s |
| CPU usage | Moderate (deserialization) | Low (zero-copy) or high (traditional) |
| Use case | Small data volumes, short outages | Large data volumes, topology changes |
| Port | native_transport_port / storage_port |
storage_port |
Hint Window Limitations
Hints are only stored for max_hint_window_in_ms duration (default: 3 hours). Nodes down longer than this window will not receive hints and require repair to restore consistency. For extended outages, full repair is necessary.
Streaming Internals¶
File Transfer Protocol¶
SSTable streaming operates on file segments:
Segment Size and Buffering¶
| Parameter | Default | Description |
|---|---|---|
| Segment size | 64 KB | Chunk size for file transfer |
| Send buffer | 1 MB | Outbound buffering per session |
| Receive buffer | 4 MB | Inbound buffering per session |
| Max concurrent transfers | 1 per host | Parallelism limit |
Compression¶
Streaming data is compressed in transit:
Progress Tracking¶
Streaming progress is tracked at multiple granularities:
# View active streams
nodetool netstats
# Detailed streaming information
nodetool netstats -H
# Per-session progress
Mode: JOINING
/10.0.1.2
Receiving 15 files, 1.2 GB total. Already received 8 files, 650 MB
/10.0.1.3
Receiving 12 files, 980 MB total. Already received 12 files, 980 MB
JMX Metrics¶
# Streaming metrics (JMX)
org.apache.cassandra.metrics:type=Streaming,name=TotalIncomingBytes
org.apache.cassandra.metrics:type=Streaming,name=TotalOutgoingBytes
org.apache.cassandra.metrics:type=Streaming,name=ActiveStreams
org.apache.cassandra.metrics:type=Streaming,name=StreamingTime
Performance Considerations¶
Network Impact¶
Streaming operations can saturate network capacity:
| Factor | Impact | Mitigation |
|---|---|---|
| Bootstrap | Sustained high throughput | Schedule during low-traffic periods |
| Decommission | Sustained high throughput | Rate limit with stream_throughput_outbound_megabits_per_sec |
| Repair | Variable, depends on divergence | Use incremental repair |
| Hints | Lower throughput | Generally minimal impact |
Disk I/O Impact¶
Memory Pressure¶
| Streaming Mode | Heap Usage | Recommendation |
|---|---|---|
| Zero-copy | Minimal | Preferred when compatible |
| Traditional | Significant | Monitor GC during large operations |
Throttling Configuration¶
# Limit streaming to prevent impact on production workload
# Outbound throughput limit (Mb/s)
stream_throughput_outbound_megabits_per_sec: 200
# Inter-datacenter streaming limit
inter_dc_stream_throughput_outbound_megabits_per_sec: 25
# Compaction throughput (affects post-streaming compaction)
compaction_throughput_mb_per_sec: 64
Operational Procedures¶
Monitoring Streaming¶
# Active streams summary
nodetool netstats
# Streaming with progress
nodetool netstats -H
# Bootstrap progress
nodetool describecluster | grep -A5 "Bootstrapping"
# Repair progress
nodetool repair_admin list
Troubleshooting¶
| Symptom | Possible Cause | Resolution |
|---|---|---|
| Streaming stuck | Network partition | Check connectivity between nodes |
| Slow streaming | Disk I/O saturation | Reduce throttle, check disk health |
| Streaming failures | Timeout | Increase streaming_socket_timeout_in_ms |
| OOM during streaming | Traditional mode on large data | Enable zero-copy or increase heap |
Recovery from Failed Streaming¶
# If bootstrap fails
# Option 1: Clear data and retry
sudo rm -rf /var/lib/cassandra/data/*
nodetool bootstrap resume
# Option 2: Wipe and start fresh
sudo rm -rf /var/lib/cassandra/*
# Edit cassandra.yaml: auto_bootstrap: true
# Restart node
# If decommission fails
nodetool decommission # Retry
# If repair streaming fails
nodetool repair -pr keyspace # Retry affected ranges
Related Documentation¶
- Replica Synchronization - Anti-entropy repair details
- Consistency - Consistency levels and guarantees
- Gossip Protocol - Cluster state dissemination
- Compaction - Post-streaming compaction