Replica Synchronization¶
In distributed systems, replicas can diverge when they receive updates at different times—due to network partitions, node failures, or message ordering. Synchronization mechanisms detect these divergences and propagate missing information to restore convergence.
These mechanisms are sometimes called "anti-entropy" in distributed systems literature. The term originates from thermodynamics, where entropy measures disorder. In Cassandra's context, these processes reduce "disorder" by ensuring all replicas eventually hold identical data.
Overview¶
Cassandra employs three mechanisms to maintain replica convergence:
| Mechanism | Function | Trigger | Scope |
|---|---|---|---|
| Hinted handoff | Deferred write delivery | Write to unavailable replica | Single mutation |
| Read reconciliation | Divergence detection during reads | Query execution | Single partition |
| Merkle tree synchronization | Full dataset comparison | Scheduled maintenance | Token range |
These mechanisms operate at different timescales and granularities, collectively providing eventual consistency while preserving availability during partial failures.
Hinted Handoff¶
Hinted handoff handles writes to temporarily unavailable replicas by storing the write locally and replaying it when the replica recovers.
How It Works¶
Configuration¶
# cassandra.yaml
# Enable/disable hinted handoff
hinted_handoff_enabled: true
# How long to store hints (default 3 hours)
# Hints older than this are dropped
max_hint_window_in_ms: 10800000
# Throttle hint delivery to avoid overwhelming recovering nodes
hinted_handoff_throttle_in_kb: 1024
# Maximum delivery threads
max_hints_delivery_threads: 2
# Hint storage directory
hints_directory: /var/lib/cassandra/hints
Hint Window¶
The max_hint_window_in_ms setting is critical:
Node down for 2 hours:
- Hints accumulated for 2 hours
- Node recovers
- All hints replayed ✓
Node down for 5 hours (default window = 3 hours):
- Hints accumulated for first 3 hours
- After 3 hours, new writes stop generating hints
- Node recovers
- Only 3 hours of hints replayed
- 2 hours of writes are missing ✗
Solution: Run repair to recover missing data
Hints and Consistency Level ANY¶
Consistency level ANY counts hints as acknowledgments:
All replicas down:
- Write cannot reach any replica
- Coordinator stores hint locally
- Returns SUCCESS to client
DANGER:
- If coordinator crashes before hint delivery → DATA LOST
- Hint is only on coordinator, not replicated
ANY should almost never be used for important data.
Monitoring Hints¶
# Check hint accumulation
nodetool tpstats | grep Hint
# View hint files
ls -la /var/lib/cassandra/hints/
# Force hint delivery
nodetool handoffwindow
JMX Metrics:
org.apache.cassandra.metrics:type=Storage,name=TotalHints
org.apache.cassandra.metrics:type=HintsService,name=HintsSucceeded
org.apache.cassandra.metrics:type=HintsService,name=HintsFailed
org.apache.cassandra.metrics:type=HintsService,name=HintsTimedOut
Hint Limitations¶
| Limitation | Implication |
|---|---|
| Hints expire | Long outages require repair |
| Hints use disk | Large hint backlog consumes storage |
| Single copy | Coordinator failure loses hints |
| Not for schema | DDL changes do not use hints |
Read Reconciliation¶
During read operations, the coordinator may receive different versions of the same data from multiple replicas. When this divergence is detected, the coordinator determines the authoritative version (based on timestamp) and propagates it to replicas holding stale data.
How It Works¶
QUORUM read discovers inconsistency:
Client reads user_id=123 with QUORUM (2 of 3 replicas):
N1: user_id=123 → name='Alice', timestamp=1000
N2: user_id=123 → name='Alicia', timestamp=2000 ← Newer
Coordinator:
1. Compares timestamps
2. Returns 'Alicia' (newer) to client
3. Sends repair to N1: "update to 'Alicia' with ts=2000"
After repair:
N1: user_id=123 → name='Alicia', timestamp=2000 ← Fixed
N2: user_id=123 → name='Alicia', timestamp=2000
Reconciliation Modes¶
Blocking Reconciliation (pre-Cassandra 4.0 default):
Divergence resolution completes BEFORE returning result to client.
- Higher read latency
- Immediate consistency after read
Background Reconciliation (Cassandra 4.0+):
Result returned immediately, propagation occurs asynchronously.
- Lower read latency
- Stale replicas updated after response
Configuration¶
-- Per-table reconciliation mode (deprecated in 4.0+)
ALTER TABLE my_table WITH read_repair = 'BLOCKING';
-- or
ALTER TABLE my_table WITH read_repair = 'NONE';
# cassandra.yaml (Cassandra 4.0+)
# Reconciliation is always attempted when divergence is detected
# No configuration needed
Limitations¶
| Limitation | Implication |
|---|---|
| Query-driven | Only data that is read undergoes reconciliation |
| Single partition scope | Does not propagate across partition boundaries |
| Requires multiple replicas | CL=ONE reads contact only one replica (no comparison possible) |
Merkle Tree Synchronization¶
Merkle tree synchronization (nodetool repair) is the definitive mechanism for achieving replica convergence across the entire dataset. Unlike hinted handoff and read reconciliation, which operate opportunistically, this process systematically compares and synchronizes all data within a token range.
Why Scheduled Synchronization Is Necessary¶
Scenarios where hinted handoff and read reconciliation are insufficient:
1. Node down longer than hint window
- Hints expired
- Data never written to that replica
2. Data never read
- Read reconciliation only affects data that is queried
- Cold data remains divergent
3. Deleted data (tombstones)
- Tombstones must propagate to all replicas
- If tombstone doesn't reach a replica before gc_grace_seconds,
deleted data can resurrect
4. Schema changes during outage
- Hints do not handle schema changes
- Synchronization aligns data to current schema
The Merkle Tree Algorithm¶
Merkle trees, introduced by Ralph Merkle (Merkle, R., 1987, "A Digital Signature Based on a Conventional Encryption Function"), enable efficient comparison of large datasets by hierarchically hashing data segments. When two replicas exchange only their root hashes, a mismatch indicates divergence somewhere in the dataset. By recursively comparing child hashes, the algorithm identifies exactly which segments differ—requiring only O(log n) comparisons rather than comparing every record.
Synchronization Modes¶
Full Synchronization:
nodetool repair -full keyspace_name
- Compares all data regardless of previous synchronization state
- Required after topology changes (node additions, removals)
- Most thorough, highest resource consumption
Incremental Synchronization:
nodetool repair keyspace_name
- Compares only data written since last synchronization
- SSTables are marked as "repaired" or "unrepaired"
- Lower overhead for regular maintenance
Subrange Synchronization:
nodetool repair -st <start_token> -et <end_token> keyspace_name
- Limits comparison to specific token range
- Enables parallelization across nodes
The gc_grace_seconds Constraint¶
Tombstones (deletion markers) have a finite lifespan defined by gc_grace_seconds. This parameter creates a critical constraint on synchronization frequency.
gc_grace_seconds defines tombstone retention period.
Default: 864000 (10 days)
CRITICAL CONSTRAINT:
Synchronization must complete on every node within gc_grace_seconds.
Failure scenario:
Day 0: Application deletes row on N1, N2 (tombstone created)
N3 is unavailable, does not receive tombstone
Day 11: Tombstones expire on N1, N2 (gc_grace = 10 days)
Compaction purges tombstones
Day 12: N3 returns to service
N3 retains the "deleted" row (no tombstone received)
Reconciliation propagates N3's data to N1, N2
DELETED DATA REAPPEARS
Prevention: Complete synchronization cycle within gc_grace_seconds
Scheduling Synchronization¶
Recommended schedule:
For gc_grace_seconds = 10 days:
- Complete full cycle every 7 days (provides buffer for failures)
- Synchronization can run in parallel across nodes
- Stagger node synchronization to distribute I/O load
Example rotation for 6-node cluster:
Day 1: Synchronize node 1
Day 2: Synchronize node 2
Day 3: Synchronize node 3
Day 4: Synchronize node 4
Day 5: Synchronize node 5
Day 6: Synchronize node 6
Day 7: Rest
Repeat
Commands¶
# Full synchronization of keyspace
nodetool repair -full my_keyspace
# Synchronize single table
nodetool repair my_keyspace my_table
# Parallel mode (multiple nodes simultaneously)
nodetool repair -par my_keyspace
# Sequential mode (one node at a time, lower impact)
nodetool repair -seq my_keyspace
# Check synchronization status
nodetool repair_admin list
# Cancel running synchronization
nodetool repair_admin cancel <repair_id>
# View synchronization history
nodetool repair_admin summary
Monitoring¶
# During synchronization
nodetool compactionstats # Shows validation compaction
nodetool netstats # Shows data streaming
# Check pending operations
nodetool repair_admin list
# JMX metrics
org.apache.cassandra.metrics:type=Repair,name=*
Resource Impact¶
| Resource | Cause | Mitigation |
|---|---|---|
| CPU | Merkle tree computation | Schedule during low-traffic periods |
| Disk I/O | Reading data for hashing | Throttle with -Dcassandra.repair_command_pool_size |
| Network | Streaming divergent data | Throttle with nodetool setstreamthroughput |
| Memory | Merkle tree storage | Reduce -Dcassandra.repair_session_max_tree_depth |
Automation Tools¶
AxonOps¶
AxonOps provides automated synchronization scheduling and monitoring, eliminating manual management overhead.
Reaper (Open Source)¶
Apache Cassandra Reaper provides:
- Scheduled synchronization campaigns
- Incremental mode support
- Web UI for monitoring
- Segment-level parallelism
Manual Scheduling¶
For environments without automation:
# Cron job example (one node per day)
0 2 * * 0 nodetool repair -full keyspace1 # Sunday
0 2 * * 1 nodetool repair -full keyspace2 # Monday
# etc.
Troubleshooting¶
Hints Accumulating¶
# Check hint backlog
ls -la /var/lib/cassandra/hints/
# If hints are not being delivered:
# 1. Check target node is reachable
nodetool status
# 2. Check hint delivery threads
nodetool tpstats | grep Hint
# 3. Force hint delivery (use with caution)
# Restart the node to trigger hint replay
Synchronization Failures¶
| Error | Cause | Resolution |
|---|---|---|
| "Repair session failed" | Node unreachable | Ensure all replicas are available |
| "Merkle tree building failed" | Disk/memory pressure | Reduce scope, add resources |
| "Streaming failed" | Network issues | Check connectivity, increase timeouts |
| "Validation compaction stuck" | Large partitions | Increase compaction_throughput_mb_per_sec |
Data Resurrection¶
Symptom: Deleted data reappears
Diagnosis:
1. Check gc_grace_seconds: Is it shorter than synchronization interval?
2. Check synchronization history: Was a cycle missed during outage?
3. Check tombstone count: Are tombstones being purged prematurely?
Resolution:
1. Run full synchronization on affected keyspace
2. Adjust gc_grace_seconds if needed
3. Ensure regular synchronization schedule
Best Practices¶
Hinted Handoff¶
| Practice | Rationale |
|---|---|
| Keep enabled | Handles transient failures automatically |
| Set appropriate window | Match to expected outage duration |
| Monitor hint accumulation | Large backlogs indicate availability problems |
| Do not rely solely on hints | Scheduled synchronization is still required |
Read Reconciliation¶
| Practice | Rationale |
|---|---|
| Use QUORUM for important reads | Enables divergence detection |
| Monitor reconciliation rate | High rate indicates systemic problems |
| Do not depend solely on reads | Cold data requires scheduled synchronization |
Merkle Tree Synchronization¶
| Practice | Rationale |
|---|---|
| Complete within gc_grace_seconds | Prevents data resurrection |
| Use incremental mode for regular maintenance | Lower resource consumption |
| Use full mode after topology changes | Ensures complete convergence |
| Automate with AxonOps or Reaper | Eliminates human error |
| Monitor duration trends | Detects degradation early |
Related Documentation¶
- Distributed Data Overview - How synchronization fits in the distributed architecture
- Consistency - How consistency levels interact with convergence
- Tombstones - Deletion markers and gc_grace_seconds
- Repair Operations - Operational procedures for synchronization