Distributed Data¶
Cassandra distributes data across multiple nodes to achieve fault tolerance and horizontal scalability. This section covers the three interconnected mechanisms that govern how data is distributed, replicated, and accessed consistently across the cluster.
Theoretical Foundations¶
Cassandra's distributed architecture derives from Amazon's Dynamo paper (DeCandia et al., 2007, "Dynamo: Amazon's Highly Available Key-value Store"), which introduced techniques for building highly available distributed systems.
| Dynamo Concept | Purpose | Cassandra Implementation |
|---|---|---|
| Consistent hashing | Distribute data across nodes | Token ring with partitioners |
| Virtual nodes | Even distribution, incremental scaling | vnodes (num_tokens) |
| Replication | Fault tolerance | Configurable replication factor |
| Sloppy quorum | Availability during failures | Hinted handoff |
| Vector clocks | Conflict resolution | Timestamps (last-write-wins) |
| Merkle trees | Efficient synchronization | Merkle tree synchronization |
| Gossip protocol | Failure detection, membership | Gossip protocol |
| CAP theorem | Consistency/availability trade-off | Tunable consistency per operation |
Cassandra implements most Dynamo concepts but makes different trade-offs in some areas.
Timestamps vs Vector Clocks
Cassandra uses timestamps for conflict resolution rather than vector clocks, choosing simplicity over preserving all conflicting versions. This means concurrent writes to the same key result in last-write-wins semantics based on timestamp, rather than preserving both versions for application-level resolution.
Masterless Architecture¶
Cassandra's defining architectural characteristic is its masterless (or "peer-to-peer") design. Every node in a Cassandra cluster is identical in function—there is no primary, leader, master, or coordinator node that holds special responsibility for the cluster. Unlike systems that require distinct node roles (primary/replica, leader/follower, master/slave), Cassandra nodes are homogeneous: same configuration, same binary, same capabilities. This architectural simplicity translates directly to operational simplicity—adding capacity means deploying identical nodes, and any node can be replaced without special failover procedures.
What Masterless Means¶
Traditional Master-Based System:
Cassandra Masterless Design:
In Cassandra: - Any node can serve as coordinator for any request - The coordinator role is assigned per-request, not per-cluster - No node holds special metadata or routing responsibility - Cluster continues operating if any node (or multiple nodes) fails
Comparison with Master-Based Systems¶
| System | Architecture | Write Path | Failure Behavior |
|---|---|---|---|
| Cassandra | Masterless | Any node accepts writes for any partition | No failover needed; remaining nodes continue |
| MongoDB | Primary/Secondary | Writes to primary only; primary replicates to secondaries | Election required; brief write unavailability |
| MySQL (Group Replication) | Single-primary or Multi-primary | Single-primary: one node; Multi-primary: any node | Primary election on failure |
| PostgreSQL (Streaming) | Primary/Standby | Primary only; standbys are read-only | Manual or automatic failover required |
| CockroachDB | Raft consensus | Leader per range; leader accepts writes | Raft leader election per range |
| TiDB | Raft consensus | Leader per region; leader accepts writes | Raft leader election per region |
| Redis Cluster | Primary/Replica per slot | Primary for each hash slot | Failover election per slot |
How Coordinator Selection Works¶
Cassandra drivers establish connections to all nodes in the local datacenter (and optionally remote DCs). The driver maintains a connection pool to each node and load balances requests across them. For each request:
1. Driver selects a coordinator node (load balancing across all connected nodes)
2. That node becomes the COORDINATOR for this request
3. Coordinator determines which nodes hold the data (using token ring)
4. Coordinator forwards request to appropriate replica nodes
5. Coordinator collects responses and returns result to client
Request 1: Client → Node A (coordinator) → Nodes B, C, D (replicas)
Request 2: Client → Node C (coordinator) → Nodes A, E, F (replicas)
Request 3: Client → Node B (coordinator) → Nodes C, D, A (replicas)
Each request can use a different coordinator.
No node is "special" or required for cluster operation.
Benefits of Masterless Design¶
| Benefit | Description |
|---|---|
| No single point of failure | Any node can fail without affecting cluster availability |
| No failover delay | No election process; operations continue immediately |
| Linear write scalability | All nodes accept writes; adding nodes increases write capacity |
| Simpler operations | No primary/replica distinction to manage |
| Geographic distribution | Each datacenter is autonomous; no cross-DC leader election |
Trade-offs of Masterless Design¶
| Trade-off | Description | Mitigation |
|---|---|---|
| Conflict resolution | Concurrent writes to same key can conflict | Last-write-wins (timestamps); LWT for critical operations |
| No single source of truth | No authoritative primary for reads | Quorum reads; repair for convergence |
| Coordination overhead | Each request requires multi-node coordination | Token-aware routing; LOCAL_* consistency levels |
| Complexity in ordering | No global write ordering | Per-partition ordering; application-level sequencing |
Consistency Guarantees: Master-Based vs Masterless¶
Master-based architectures achieve consistency through a single authoritative node—but at the cost of availability during failures and a write throughput ceiling. Cassandra provides the same guarantees when needed, while allowing flexibility to optimize for availability or performance when strong consistency is not required.
| Requirement | Master-Based Approach | Cassandra Approach |
|---|---|---|
| Strong consistency | All writes through primary (single point of failure, write bottleneck) | SERIAL consistency via Paxos (no single point of failure, scales horizontally) |
| Conflict resolution | Primary is authoritative (unavailable during failover) | Last-write-wins by default; LWT for compare-and-set when needed |
| Read-your-writes | Read from primary (adds latency, primary overload risk) | QUORUM reads + writes (R + W > N), load balanced across replicas |
Key Difference
Master-based systems force strong consistency at all times with corresponding availability trade-offs. Cassandra allows per-query tuning—strong consistency for financial transactions, eventual consistency for analytics or caching.
Multi-Datacenter Masterless Operation¶
Cassandra's masterless design extends across datacenters:
- Single cluster spans multiple datacenters with topology-aware replication
- No cross-DC leader election required
- LOCAL_QUORUM enables DC-local consistency without cross-DC coordination
- Cluster continues operating if one DC fails entirely
Multi-DC Latency Advantage
Systems using Raft or Paxos consensus across datacenters require cross-DC communication for every write, adding latency proportional to the distance between datacenters. Cassandra with LOCAL_QUORUM avoids this cross-DC round-trip for most operations.
Core Concepts¶
Partitioning¶
Partitioning determines which node stores a given piece of data. Cassandra uses consistent hashing to map partition keys to tokens, and tokens to nodes on a ring.
Partition Key → Hash Function → Token → Node(s)
Example:
"user:123" → Murmur3Hash → -7509452495886106294 → Node B
The partitioner (hash function) ensures even data distribution regardless of key patterns. This prevents hot spots where sequential keys would otherwise concentrate on a single node.
See Partitioning for details on consistent hashing, the token ring, and partitioner options.
Replication¶
Replication copies each partition to multiple nodes for fault tolerance. The replication factor (RF) determines how many copies exist.
The replication strategy determines how replicas are placed—whether they respect datacenter and rack boundaries to survive infrastructure failures.
See Replication for details on strategies, snitches, and configuration.
Consistency¶
Consistency determines how many replicas must acknowledge reads and writes. Because replicas may temporarily diverge, the consistency level controls the trade-off between consistency, availability, and latency.
Write with QUORUM (RF=3):
- Send write to all 3 replicas
- Wait for 2 acknowledgments (majority)
- Return success to client
Read with QUORUM (RF=3):
- Contact 2 replicas
- Compare responses, return newest
- Propagate newest version to stale replicas
Strong Consistency Formula
The formula R + W > N (reads + writes > replication factor) guarantees that reads see the latest writes. With RF=3, using QUORUM (2) for both reads and writes satisfies this: 2 + 2 = 4 > 3.
See Consistency for details on consistency levels and guarantees.
Replica Synchronization¶
When replicas diverge due to failures or timing differences, synchronization mechanisms detect the divergence and propagate missing data to restore convergence:
| Mechanism | Trigger | Function |
|---|---|---|
| Hinted handoff | Write to unavailable replica | Deferred delivery when replica recovers |
| Read reconciliation | Query execution | Propagates newest version to stale replicas |
| Merkle tree synchronization | Scheduled maintenance | Full dataset comparison and convergence |
See Replica Synchronization for details on convergence mechanisms.
How They Work Together¶
A single write operation involves all three mechanisms:
INSERT INTO users (id, name) VALUES (123, 'Alice')
WITH CONSISTENCY QUORUM
1. PARTITIONING
Coordinator hashes partition key:
token(123) = -7509452495886106294
2. REPLICATION
Look up replicas for this token:
RF=3, NetworkTopologyStrategy → Nodes A, B, C (different racks)
3. CONSISTENCY
Send write to all replicas, wait for QUORUM (2):
Node A: ACK ✓
Node B: ACK ✓
Node C: (still writing, but QUORUM met)
Return SUCCESS to client
4. ANTI-ENTROPY
If Node C was temporarily down:
- Coordinator stores hint
- When C recovers, hint is delivered
- If hints expire, scheduled synchronization restores convergence
CAP Theorem¶
The CAP theorem, formulated by Eric Brewer in 2000 and proven by Gilbert and Lynch in 2002 (Gilbert & Lynch, 2002, "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services"), states that a distributed data store can provide at most two of three guarantees simultaneously:
The Three Properties¶
| Property | Definition | Implication |
|---|---|---|
| Consistency (C) | All nodes see the same data at the same time | Reads always return the most recent write |
| Availability (A) | Every request receives a non-error response | No request times out or returns failure |
| Partition Tolerance (P) | System continues operating despite network partitions | Nodes can be split into groups that cannot communicate |
Partition Tolerance Is Not Optional¶
Network Partitions Are Inevitable
In real distributed systems, network partitions are inevitable—switches fail, cables are cut, datacenters lose connectivity. A system that cannot tolerate partitions is not a distributed system; it is a single-node system with remote storage.
Therefore, the practical choice is between:
- CP (Consistency + Partition Tolerance): During a partition, reject operations that cannot guarantee consistency
- AP (Availability + Partition Tolerance): During a partition, continue accepting operations even if nodes may diverge
Database Classification¶
| Category | Behavior During Partition | Examples |
|---|---|---|
| CP | Reject operations that cannot be consistently applied | PostgreSQL, MySQL, MongoDB (default), CockroachDB, Spanner |
| AP | Accept operations; resolve conflicts later | Cassandra (default), DynamoDB, Riak, CouchDB |
Where Cassandra Sits¶
Cassandra is typically classified as AP—it prioritizes availability over consistency by default. However, this classification oversimplifies Cassandra's capabilities.
Tunable, Not Fixed
Unlike most databases that are permanently CP or AP, Cassandra allows choosing the consistency-availability trade-off on a per-operation basis. The same cluster can serve AP workloads (analytics, caching) and CP workloads (transactions, user data) simultaneously.
Default behavior (AP):
- Writes succeed if any replica is available
- Reads return data even if replicas disagree
- Conflicts resolved by timestamp (last-write-wins)
With tunable consistency, Cassandra can behave as CP:
QUORUMreads and writes ensureR + W > N(overlapping quorums)ALLrequires all replicas to respondSERIALprovides linearizable consistency via Paxos
Tunable Consistency: Per-Operation CAP Position¶
Unlike databases that enforce a single consistency model, Cassandra allows choosing the consistency-availability trade-off for each operation:
| Consistency Level | CAP Position | Behavior During Partition |
|---|---|---|
ANY |
AP | Write succeeds if any node (including coordinator) receives it |
ONE |
AP | Write/read succeeds if one replica responds |
QUORUM |
CP | Requires majority of replicas; may reject if quorum unavailable |
ALL |
CP | Requires all replicas; rejects if any replica unavailable |
SERIAL |
CP | Linearizable via Paxos; rejects if consensus cannot be reached |
Practical implications:
Partition splits cluster: Nodes {A, B} | {C, D, E}
RF = 3, replicas on nodes A, C, E
With QUORUM (requires 2 of 3 replicas):
- Left side (A): Can reach 1 replica → QUORUM fails
- Right side (C, E): Can reach 2 replicas → QUORUM succeeds
- Result: CP behavior, partial availability
With ONE:
- Both sides can reach at least 1 replica
- Result: AP behavior, full availability, possible inconsistency
CAP During Normal Operation¶
CAP Only Applies During Partitions
The CAP theorem applies specifically during network partitions. During normal operation (no partitions), Cassandra provides both consistency and availability—the trade-off only manifests when partitions occur.
| State | Consistency | Availability | Notes |
|---|---|---|---|
| Normal operation | ✓ (with QUORUM) | ✓ | No trade-off required |
| During partition | Choose one | Choose one | CAP trade-off applies |
| After partition heals | ✓ (eventually) | ✓ | Repair restores consistency |
PACELC: Beyond CAP¶
The PACELC theorem (Abadi, 2012) extends CAP to address behavior during normal operation:
Partition → Availability vs Consistency Else → Latency vs Consistency
During normal operation, the trade-off is between latency and consistency:
| System | During Partition (PAC) | Normal Operation (ELC) |
|---|---|---|
| Cassandra (ONE) | PA | EL (low latency, eventual consistency) |
| Cassandra (QUORUM) | PC | EC (higher latency, strong consistency) |
| PostgreSQL | PC | EC |
| DynamoDB | PA | EL |
Cassandra's tunable consistency allows choosing different PACELC positions for different operations within the same cluster.
Documentation Structure¶
| Section | Description |
|---|---|
| Partitioning | Consistent hashing, token ring, partitioners |
| Replication | Strategies, snitches, replication factor |
| Consistency | Consistency levels, guarantees, LWT |
| Consensus and Paxos | Consensus algorithms and Paxos for lightweight transactions |
| Replica Synchronization | Hinted handoff, read reconciliation, Merkle trees |
| Secondary Index Queries | Distributed query execution with indexes |
| Materialized Views | Distributed MV coordination and consistency challenges |
| Data Streaming | Bootstrap, decommission, repair, and hinted handoff streaming |
| Counters | Distributed counting, CRDTs, counter operations |
Related Documentation¶
- Gossip Protocol - Cluster membership and failure detection
- Storage Engine - How data is stored on each node
- Indexes - Secondary index types and selection
- Operations - Maintenance and operational procedures