Replica Synchronization¶

In distributed systems, replicas can diverge when they receive updates at different times—due to network partitions, node failures, or message ordering. Synchronization mechanisms detect these divergences and propagate missing information to restore convergence.

These mechanisms are sometimes called "anti-entropy" in distributed systems literature. The term originates from thermodynamics, where entropy measures disorder. In Cassandra's context, these processes reduce "disorder" by ensuring all replicas eventually hold identical data.

Overview¶

Cassandra employs three mechanisms to maintain replica convergence:

Mechanism	Function	Trigger	Scope
Hinted handoff	Deferred write delivery	Write to unavailable replica	Single mutation
Read reconciliation	Divergence detection during reads	Query execution	Single partition
Merkle tree synchronization	Full dataset comparison	Scheduled maintenance	Token range

These mechanisms operate at different timescales and granularities, collectively providing eventual consistency while preserving availability during partial failures.

Hinted Handoff¶

Hinted handoff handles writes to temporarily unavailable replicas by storing the write locally and replaying it when the replica recovers.

How It Works¶

Configuration¶

# cassandra.yaml

# Enable/disable hinted handoff
hinted_handoff_enabled: true

# How long to store hints (default 3 hours)
# Hints older than this are dropped
max_hint_window_in_ms: 10800000

# Throttle hint delivery to avoid overwhelming recovering nodes
hinted_handoff_throttle_in_kb: 1024

# Maximum delivery threads
max_hints_delivery_threads: 2

# Hint storage directory
hints_directory: /var/lib/cassandra/hints

Hint Window¶

The max_hint_window_in_ms setting is critical:

Node down for 2 hours:
- Hints accumulated for 2 hours
- Node recovers
- All hints replayed ✓

Node down for 5 hours (default window = 3 hours):
- Hints accumulated for first 3 hours
- After 3 hours, new writes stop generating hints
- Node recovers
- Only 3 hours of hints replayed
- 2 hours of writes are missing ✗

Solution: Run repair to recover missing data

Hints and Consistency Level ANY¶

Consistency level ANY counts hints as acknowledgments:

All replicas down:
- Write cannot reach any replica
- Coordinator stores hint locally
- Returns SUCCESS to client

DANGER:
- If coordinator crashes before hint delivery → DATA LOST
- Hint is only on coordinator, not replicated

ANY should almost never be used for important data.

Monitoring Hints¶

# Check hint accumulation
nodetool tpstats | grep Hint

# View hint files
ls -la /var/lib/cassandra/hints/

# Force hint delivery
nodetool handoffwindow

JMX Metrics:

org.apache.cassandra.metrics:type=Storage,name=TotalHints
org.apache.cassandra.metrics:type=HintsService,name=HintsSucceeded
org.apache.cassandra.metrics:type=HintsService,name=HintsFailed
org.apache.cassandra.metrics:type=HintsService,name=HintsTimedOut

Hint Limitations¶

Limitation	Implication
Hints expire	Long outages require repair
Hints use disk	Large hint backlog consumes storage
Single copy	Coordinator failure loses hints
Not for schema	DDL changes do not use hints

Read Reconciliation¶

During read operations, the coordinator may receive different versions of the same data from multiple replicas. When this divergence is detected, the coordinator determines the authoritative version (based on timestamp) and propagates it to replicas holding stale data.

How It Works¶

QUORUM read discovers inconsistency:

Client reads user_id=123 with QUORUM (2 of 3 replicas):

N1: user_id=123 → name='Alice', timestamp=1000
N2: user_id=123 → name='Alicia', timestamp=2000  ← Newer

Coordinator:
1. Compares timestamps
2. Returns 'Alicia' (newer) to client
3. Sends repair to N1: "update to 'Alicia' with ts=2000"

After repair:
N1: user_id=123 → name='Alicia', timestamp=2000  ← Fixed
N2: user_id=123 → name='Alicia', timestamp=2000

Reconciliation Modes¶

Blocking Reconciliation (pre-Cassandra 4.0 default):

Divergence resolution completes BEFORE returning result to client.
- Higher read latency
- Immediate consistency after read

Background Reconciliation (Cassandra 4.0+):

Result returned immediately, propagation occurs asynchronously.
- Lower read latency
- Stale replicas updated after response

Configuration¶

-- Per-table reconciliation mode (deprecated in 4.0+)
ALTER TABLE my_table WITH read_repair = 'BLOCKING';
-- or
ALTER TABLE my_table WITH read_repair = 'NONE';

# cassandra.yaml (Cassandra 4.0+)
# Reconciliation is always attempted when divergence is detected
# No configuration needed

Limitations¶

Limitation	Implication
Query-driven	Only data that is read undergoes reconciliation
Single partition scope	Does not propagate across partition boundaries
Requires multiple replicas	CL=ONE reads contact only one replica (no comparison possible)

Merkle Tree Synchronization¶

Merkle tree synchronization (nodetool repair) is the definitive mechanism for achieving replica convergence across the entire dataset. Unlike hinted handoff and read reconciliation, which operate opportunistically, this process systematically compares and synchronizes all data within a token range.

Why Scheduled Synchronization Is Necessary¶

Scenarios where hinted handoff and read reconciliation are insufficient:

1. Node down longer than hint window
   - Hints expired
   - Data never written to that replica

2. Data never read
   - Read reconciliation only affects data that is queried
   - Cold data remains divergent

3. Deleted data (tombstones)
   - Tombstones must propagate to all replicas
   - If tombstone doesn't reach a replica before gc_grace_seconds,
     deleted data can resurrect

4. Schema changes during outage
   - Hints do not handle schema changes
   - Synchronization aligns data to current schema

The Merkle Tree Algorithm¶

Merkle trees, introduced by Ralph Merkle (Merkle, R., 1987, "A Digital Signature Based on a Conventional Encryption Function"), enable efficient comparison of large datasets by hierarchically hashing data segments. When two replicas exchange only their root hashes, a mismatch indicates divergence somewhere in the dataset. By recursively comparing child hashes, the algorithm identifies exactly which segments differ—requiring only O(log n) comparisons rather than comparing every record.

Synchronization Modes¶

Full Synchronization:

nodetool repair -full keyspace_name

Compares all data regardless of previous synchronization state
Required after topology changes (node additions, removals)
Most thorough, highest resource consumption

Incremental Synchronization:

nodetool repair keyspace_name

Compares only data written since last synchronization
SSTables are marked as "repaired" or "unrepaired"
Lower overhead for regular maintenance

Subrange Synchronization:

nodetool repair -st <start_token> -et <end_token> keyspace_name

Limits comparison to specific token range
Enables parallelization across nodes

The gc_grace_seconds Constraint¶

Tombstones (deletion markers) have a finite lifespan defined by gc_grace_seconds. This parameter creates a critical constraint on synchronization frequency.

gc_grace_seconds defines tombstone retention period.
Default: 864000 (10 days)

CRITICAL CONSTRAINT:
Synchronization must complete on every node within gc_grace_seconds.

Failure scenario:

Day 0: Application deletes row on N1, N2 (tombstone created)
       N3 is unavailable, does not receive tombstone

Day 11: Tombstones expire on N1, N2 (gc_grace = 10 days)
        Compaction purges tombstones

Day 12: N3 returns to service
        N3 retains the "deleted" row (no tombstone received)
        Reconciliation propagates N3's data to N1, N2
        DELETED DATA REAPPEARS

Prevention: Complete synchronization cycle within gc_grace_seconds

Scheduling Synchronization¶

Recommended schedule:

For gc_grace_seconds = 10 days:
- Complete full cycle every 7 days (provides buffer for failures)
- Synchronization can run in parallel across nodes
- Stagger node synchronization to distribute I/O load

Example rotation for 6-node cluster:
Day 1: Synchronize node 1
Day 2: Synchronize node 2
Day 3: Synchronize node 3
Day 4: Synchronize node 4
Day 5: Synchronize node 5
Day 6: Synchronize node 6
Day 7: Rest
Repeat

Commands¶

# Full synchronization of keyspace
nodetool repair -full my_keyspace

# Synchronize single table
nodetool repair my_keyspace my_table

# Parallel mode (multiple nodes simultaneously)
nodetool repair -par my_keyspace

# Sequential mode (one node at a time, lower impact)
nodetool repair -seq my_keyspace

# Check synchronization status
nodetool repair_admin list

# Cancel running synchronization
nodetool repair_admin cancel <repair_id>

# View synchronization history
nodetool repair_admin summary

Monitoring¶

# During synchronization
nodetool compactionstats  # Shows validation compaction
nodetool netstats         # Shows data streaming

# Check pending operations
nodetool repair_admin list

# JMX metrics
org.apache.cassandra.metrics:type=Repair,name=*

Resource Impact¶

Resource	Cause	Mitigation
CPU	Merkle tree computation	Schedule during low-traffic periods
Disk I/O	Reading data for hashing	Throttle with `-Dcassandra.repair_command_pool_size`
Network	Streaming divergent data	Throttle with `nodetool setstreamthroughput`
Memory	Merkle tree storage	Reduce `-Dcassandra.repair_session_max_tree_depth`

Automation Tools¶

AxonOps¶

AxonOps provides automated synchronization scheduling and monitoring, eliminating manual management overhead.

Reaper (Open Source)¶

Apache Cassandra Reaper provides:

Scheduled synchronization campaigns
Incremental mode support
Web UI for monitoring
Segment-level parallelism

Manual Scheduling¶

For environments without automation:

# Cron job example (one node per day)
0 2 * * 0 nodetool repair -full keyspace1  # Sunday
0 2 * * 1 nodetool repair -full keyspace2  # Monday
# etc.

Troubleshooting¶

Hints Accumulating¶

# Check hint backlog
ls -la /var/lib/cassandra/hints/

# If hints are not being delivered:
# 1. Check target node is reachable
nodetool status

# 2. Check hint delivery threads
nodetool tpstats | grep Hint

# 3. Force hint delivery (use with caution)
# Restart the node to trigger hint replay

Synchronization Failures¶

Error	Cause	Resolution
"Repair session failed"	Node unreachable	Ensure all replicas are available
"Merkle tree building failed"	Disk/memory pressure	Reduce scope, add resources
"Streaming failed"	Network issues	Check connectivity, increase timeouts
"Validation compaction stuck"	Large partitions	Increase `compaction_throughput_mb_per_sec`

Data Resurrection¶

Symptom: Deleted data reappears

Diagnosis:
1. Check gc_grace_seconds: Is it shorter than synchronization interval?
2. Check synchronization history: Was a cycle missed during outage?
3. Check tombstone count: Are tombstones being purged prematurely?

Resolution:
1. Run full synchronization on affected keyspace
2. Adjust gc_grace_seconds if needed
3. Ensure regular synchronization schedule

Best Practices¶

Hinted Handoff¶

Practice	Rationale
Keep enabled	Handles transient failures automatically
Set appropriate window	Match to expected outage duration
Monitor hint accumulation	Large backlogs indicate availability problems
Do not rely solely on hints	Scheduled synchronization is still required

Read Reconciliation¶

Practice	Rationale
Use QUORUM for important reads	Enables divergence detection
Monitor reconciliation rate	High rate indicates systemic problems
Do not depend solely on reads	Cold data requires scheduled synchronization

Merkle Tree Synchronization¶

Practice	Rationale
Complete within gc_grace_seconds	Prevents data resurrection
Use incremental mode for regular maintenance	Lower resource consumption
Use full mode after topology changes	Ensures complete convergence
Automate with AxonOps or Reaper	Eliminates human error
Monitor duration trends	Detects degradation early

Distributed Data Overview - How synchronization fits in the distributed architecture
Consistency - How consistency levels interact with convergence
Tombstones - Deletion markers and gc_grace_seconds
Repair Operations - Operational procedures for synchronization

Replica Synchronization¶

Overview¶

Hinted Handoff¶

How It Works¶

Configuration¶

Hint Window¶

Hints and Consistency Level ANY¶

Monitoring Hints¶

Hint Limitations¶

Read Reconciliation¶

How It Works¶

Reconciliation Modes¶

Configuration¶

Limitations¶

Merkle Tree Synchronization¶

Why Scheduled Synchronization Is Necessary¶

The Merkle Tree Algorithm¶

Synchronization Modes¶

The gc_grace_seconds Constraint¶

Scheduling Synchronization¶

Commands¶

Monitoring¶

Resource Impact¶

Automation Tools¶

AxonOps¶

Reaper (Open Source)¶

Manual Scheduling¶

Troubleshooting¶

Hints Accumulating¶

Synchronization Failures¶

Data Resurrection¶

Best Practices¶

Hinted Handoff¶

Read Reconciliation¶

Merkle Tree Synchronization¶

Related Documentation¶