Repair Concepts¶
This page explains the fundamental concepts behind Cassandra repair operations, including how the repair process works internally, what triggers the need for repair, and the mechanisms Cassandra uses to detect and resolve data inconsistencies.
What is Repair?¶
Repair is Cassandra's anti-entropy mechanism for synchronizing data across replica nodes. In a distributed system where writes may not reach all replicas (due to node failures, network partitions, or hint expiration), repair ensures that all replicas eventually contain identical data.
| Node A | Node B | Node C | |
|---|---|---|---|
| Before Repair | Row 1: v1 | Row 1: v1 | Row 1: v1 |
| Row 2: v2 | Row 2: v1 ⚠️ stale | Row 2: v2 | |
| Row 3: v3 | Row 3: v3 | Row 3: ❌ missing | |
| After Repair | Row 1: v1 | Row 1: v1 | Row 1: v1 |
| Row 2: v2 | Row 2: v2 ✓ | Row 2: v2 | |
| Row 3: v3 | Row 3: v3 | Row 3: v3 ✓ |
Why Repair is Necessary¶
Sources of Data Inconsistency¶
Data inconsistencies arise from several scenarios in distributed systems:
Write Path Failures
| Scenario | Description | Recovery Mechanism |
|---|---|---|
| Node unavailable during write | Write succeeds on available replicas only | Hinted handoff, read repair, anti-entropy repair |
| Network partition | Replicas in different partitions receive different writes | Anti-entropy repair |
| Coordinator timeout | Write acknowledged but some replicas slow | Read repair, anti-entropy repair |
Recovery Limitations
| Scenario | Description | Recovery Mechanism |
|---|---|---|
| Hints expired | Node was down longer than max_hint_window_in_ms (default 3 hours) |
Anti-entropy repair only |
| Hinted handoff disabled | Hints not stored for unavailable replicas | Anti-entropy repair only |
Operational Events
| Scenario | Description | Recovery Mechanism |
|---|---|---|
| New node bootstrap | Node joins but may not have all data | Streaming + repair |
| Node replacement | Replacement node needs data from replicas | Streaming + repair |
The gc_grace_seconds Constraint¶
The most critical aspect of repair scheduling is the relationship with gc_grace_seconds. This parameter defines how long tombstones (deletion markers) are retained before garbage collection.
Zombie data resurrection scenario:
- Data is deleted on Node A, creating a tombstone
- Tombstone replicates to Node B
- Node C is down and misses the delete
- After
gc_grace_seconds, the tombstone is garbage collected from A and B - Node C comes back online with the original (pre-delete) data
- Without the tombstone, the deleted data "resurrects" during read repair
Prevention: Run repair on all nodes within gc_grace_seconds to ensure tombstones propagate before deletion.
How Repair Works¶
The Merkle Tree Process¶
Cassandra uses Merkle trees (hash trees) to efficiently detect differences between replicas without comparing every row.
Merkle tree segments:
The token range being repaired is divided into segments, with each segment represented as a leaf node in the Merkle tree. By default, Cassandra creates approximately 32,768 (2^15) segments per repair session. The tree depth is controlled by repair_session_max_tree_depth (default: 18 in Cassandra 4.0+).
Streaming granularity:
When a mismatch is detected, the entire segment is streamed—not individual rows. This means a single differing row causes the entire segment to be transferred:
| Table Size / Node | Segments | Min Stream Unit (1 segment) |
|---|---|---|
| 100 GB | 32,768 | ~3 MB |
| 500 GB | 32,768 | ~15 MB |
| 1 TB | 32,768 | ~30 MB |
For example, with a 500 GB table, if one row is inconsistent, the entire ~15 MB segment containing that row must be streamed. If inconsistencies are spread across many segments, streaming volumes increase proportionally.
Merkle tree comparison and streaming:
The comparison process:
- Compare root hashes—if they match, replicas are identical (no streaming needed)
- If root hashes differ, compare child hashes recursively
- Drill down only into subtrees with mismatching hashes
- At the leaf level, stream the entire segment for any mismatching hash
Repair Session Lifecycle¶
Repair Types¶
Full Repair¶
Full repair is the original repair mechanism in Cassandra. It compares all data in the specified token ranges across all replicas, regardless of whether the data has been previously repaired.
How it works:
- The repair coordinator builds a Merkle tree from all SSTables in the repair range
- Each replica builds its own Merkle tree from all its SSTables
- Trees are compared to identify differences
- Differing data is streamed between replicas to synchronize
Advantages:
- Simple and reliable - no complex state tracking
- Guarantees complete consistency check across all data
- No risk of repaired/unrepaired state corruption
- Works correctly regardless of previous repair history
- Required after certain failure scenarios
Disadvantages:
- Re-validates already-consistent data unnecessarily
- Longer duration as data volume grows
- Higher resource consumption (CPU, memory, network, disk I/O)
When to use full repair:
- After node replacement or rebuild
- After recovering from data corruption
- When incremental repair state is suspect or corrupted
- Before major version upgrades
- As periodic validation (e.g., monthly) alongside incremental repairs
Incremental Repair¶
Incremental repair was introduced in Cassandra 2.1 via CASSANDRA-5351 to address the scalability limitations of full repair. It tracks which SSTables have been previously repaired and only validates new (unrepaired) data.
History and evolution:
Incremental repair had a troubled history in early versions. While the concept was sound, the implementation suffered from numerous bugs that could lead to data inconsistency, silent corruption of the repaired/unrepaired state, and operational challenges. Many operators avoided incremental repair entirely in versions prior to 4.0, preferring the slower but more reliable full repair.
| Version | Status | Notes |
|---|---|---|
| 2.1 | Introduced | Initial implementation; significant bugs and edge cases |
| 2.2 - 3.x | Problematic | Ongoing fixes but still unreliable for production use; many operators avoided it |
| 4.0+ | Production ready | Major rework; became default behavior; full repair requires -full flag |
Recommendation: For clusters running Cassandra 4.0 or later, incremental repair is the recommended approach for routine maintenance. For earlier versions, evaluate carefully and consider using full repair if stability is a concern.
How it works:
- Each SSTable has a
repairedAtmetadata field (0 = unrepaired, timestamp = repaired) - During incremental repair, only SSTables with
repairedAt = 0are included in Merkle tree generation - After successful repair, participating SSTables are marked with a
repairedAttimestamp - Subsequent repairs skip already-repaired SSTables
- Anti-compaction separates repaired and unrepaired data when SSTables contain both
Advantages:
- Faster execution - only validates new data since last repair
- Lower resource consumption for routine maintenance
- Scales better with large datasets
- Enables more frequent repair cycles
- Reduces repair window, making it easier to complete within
gc_grace_seconds
Disadvantages:
- Anti-compaction overhead after repair completion (see below)
- More complex operational model to understand and troubleshoot
Anti-compaction considerations:
After incremental repair completes, Cassandra runs anti-compaction to split SSTables that contain both repaired and unrepaired data. This process:
- Reads the SSTable and writes two new SSTables (one repaired, one unrepaired)
- Consumes disk I/O and temporary disk space (up to 2x the SSTable size during the split)
- Adds to compaction pending tasks
- Can delay the start of normal compaction work
Operational guidance:
- Monitor
CompactionManagerpending tasks during and after repair - Ensure sufficient disk headroom (anti-compaction temporarily increases disk usage)
- On I/O-constrained systems, consider scheduling repairs during low-traffic periods
- The
nodetool compactionstatscommand shows anti-compaction progress
When to use incremental repair:
- Routine scheduled maintenance (default choice for Cassandra 4.0+)
- Clusters with large data volumes where full repair is impractical
- When repair must complete within tight time windows
Comparison Summary¶
| Aspect | Full Repair | Incremental Repair |
|---|---|---|
| Scope | All data in range | Only unrepaired SSTables |
| SSTable marking | Does not modify SSTable metadata | Marks SSTables with repairedAt timestamp |
| Duration | Longer (proportional to total data) | Shorter (proportional to new data) |
| Resource usage | Higher | Lower for routine runs |
| Complexity | Simple | Requires state tracking |
| Use case | Recovery, validation, periodic full check | Regular maintenance |
| Default (4.0+) | Must specify -full flag |
Default behavior |
Repaired vs Unrepaired SSTables¶
Incremental repair tracks repair state at the SSTable level:
Token Ranges and Repair Scope¶
Understanding Token Ranges¶
Cassandra partitions data across nodes using a token ring. Each node is responsible for specific token ranges.
Primary Range Repair (-pr)¶
The -pr flag limits repair to only the primary token ranges owned by the node:
Recommendation: Always use -pr for routine maintenance. Running -pr on each node in sequence ensures every range is repaired exactly once.
Repair Coordination¶
Behavior Without Keyspace or Table Specification¶
When running repair without specifying tables, Cassandra iterates through all tables in the keyspace:
Important considerations:
- Tables are repaired sequentially by default
- Use
-j <threads>to repair multiple tables in parallel - Large tables dominate repair duration
- Consider repairing critical tables separately
Anti-Compaction¶
After incremental repair, Cassandra performs anti-compaction to separate repaired and unrepaired data:
Anti-compaction ensures clean separation between repaired and unrepaired data, enabling efficient future incremental repairs.
Paxos Repairs¶
While standard repairs reconcile user table data across replicas, Paxos repairs specifically reconcile the Paxos state used by lightweight transactions (LWTs). LWTs are statements that include IF conditions (such as INSERT ... IF NOT EXISTS or UPDATE ... IF column = value), which provide linearizable consistency guarantees.
Paxos repairs maintain LWT linearizability and correctness, especially across topology changes such as bootstrap, decommission, replace, and move operations.
When Paxos Repairs Are Required¶
Paxos repairs are only relevant for keyspaces that use LWTs. For keyspaces that never use LWTs, Paxos state does not affect correctness, and operators MAY safely skip Paxos repairs for those keyspaces.
In Cassandra 4.1+, Paxos repairs run automatically every 5 minutes by default. Operators SHOULD ensure Paxos repairs run regularly on clusters where LWTs are in use. See Paxos Repairs in the Repair Strategies guide for operational details.
Paxos Repairs and Topology Changes¶
In Cassandra 4.1 and later, a Paxos repair gate runs before certain topology changes complete (for example, node bootstrap). This gate ensures that Paxos state is consistent across all replicas for the affected token ranges before the topology change finalizes.
If Paxos repair cannot complete for the affected ranges and keyspaces—for example, because nodes are overloaded, have very large partitions, or some replicas are unavailable—the topology change MUST fail to avoid violating LWT correctness guarantees.
Operators MAY encounter errors such as PaxosCleanupException with message CANCELLED when overloaded replicas cannot finish Paxos cleanup within the allowed time. This typically indicates that the cluster is under too much load or that specific partitions are too large for Paxos cleanup to complete successfully.
Paxos v2¶
Cassandra 4.1+ introduces Paxos v2, an updated Paxos implementation for lightweight transactions. Paxos v2 provides several improvements:
- Reduced network round-trips for LWT reads and writes
- Improved behavior under contention when multiple clients compete for the same partition
- Works in conjunction with regular Paxos repairs and Paxos state purging
Paxos v2 is selected via the paxos_variant setting in cassandra.yaml (values: v1 or v2).
To safely take full advantage of Paxos v2, operators MUST ensure:
- Regular Paxos repairs are running on all nodes
- Paxos state purging is configured appropriately (see Paxos-related cassandra.yaml configuration in the Repair Strategies guide)
Detailed configuration options and upgrade guidance are covered in the Repair Strategies documentation.
Next Steps¶
- Options Reference - Detailed explanation of all repair options
- Repair Strategies - Real-world implementation scenarios
- Scheduling Guide - Planning repair schedules within gc_grace_seconds