Cassandra Architecture¶
Cassandra's architecture explains its behavior. The partition key determines which node stores data—get it wrong, and queries become slow or impossible. Deletes write tombstones instead of removing data immediately—ignore this, and deleted records can reappear. Nodes can disagree on data temporarily—skip repair, and that disagreement becomes permanent.
The design combines Amazon Dynamo's distribution approach (masterless ring, gossip protocol, tunable consistency) with Google BigTable's storage approach (LSM-tree, SSTables, memtables). Understanding both sides leads to better decisions about data modeling and operations.
Architecture Overview¶
Apache Cassandra is a distributed, peer-to-peer database designed for:
- High Availability: No single point of failure
- Linear Scalability: Add nodes to increase capacity
- Geographic Distribution: Multi-datacenter replication
- Tunable Consistency: Balance between consistency and availability
Core Concepts¶
Ring Architecture¶
Cassandra organizes nodes in a logical ring structure using consistent hashing:
- Each node owns a range of tokens on the ring
- Data is assigned to nodes based on partition key hash
- Data is replicated to multiple nodes for fault tolerance
Partitioner and Hash Calculation¶
The partitioner is responsible for computing a hash (token) value from the partition key. Cassandra uses this hash to determine data placement on both write and read operations:
- Murmur3Partitioner (default): Uses the MurmurHash3 algorithm, producing a 64-bit hash value. Token range spans from -2^63 to 2^63-1.
- RandomPartitioner: Uses MD5 hashing, producing a 128-bit hash. Token range spans from 0 to 2^127-1.
Write operations: When inserting data, the coordinator node computes hash(partition_key) to produce a token value. This token maps to a specific position on the ring, identifying the primary replica node. Additional replicas are selected by traversing the ring clockwise.
Read operations: The same hash calculation occurs during SELECT queries. The coordinator computes hash(partition_key) from the WHERE clause to locate the exact nodes holding the requested data.
This deterministic hashing ensures that:
- The same partition key always maps to the same token
- Any node can calculate which nodes own a given partition
- No central directory or lookup service is required
Partition Key¶
The partition key determines which node stores the data:
CREATE TABLE users (
user_id UUID, -- Partition key
name TEXT,
email TEXT,
PRIMARY KEY (user_id)
);
How it works:
1. Partition key value is hashed: hash(user_id) → token
2. Token maps to a token range
3. Node owning that range stores the data
Replication¶
Data is replicated across multiple nodes for fault tolerance:
CREATE KEYSPACE my_app WITH replication = {
'class': 'NetworkTopologyStrategy',
'dc1': 3, -- 3 copies in dc1
'dc2': 3 -- 3 copies in dc2
};
Replication Factor (RF): Number of copies across the cluster.
Documentation Structure¶
Data Distribution¶
- Partitioning - How data is distributed across nodes
- Replication - SimpleStrategy vs NetworkTopologyStrategy
- Consistency Levels - Tuning consistency vs availability
- Replica Synchronization - Repair, hinted handoff, read repair
Memory Management¶
- JVM - Java Virtual Machine configuration and garbage collection
- Cassandra Memory - Heap, off-heap, and page cache
- Linux - Kernel settings, swap, THP, and NUMA
Storage Engine¶
- Storage Engine Overview - How Cassandra stores data
- Write Path - Memtables and commit log
- Read Path - How reads are served
- SSTables - On-disk storage format
- Tombstones - How deletes work
Compaction¶
- Compaction Overview - Why compaction matters
- STCS - Size-Tiered Compaction Strategy
- LCS - Leveled Compaction Strategy
- TWCS - Time-Window Compaction Strategy
- UCS - Unified Compaction Strategy (5.0+)
Cluster Communication¶
- Gossip Protocol - How nodes discover each other
Client Connections¶
- Client Connection Architecture - How clients communicate with Cassandra
- CQL Protocol - Binary wire protocol specification
- Async Connections - Connection pooling and multiplexing
- Authentication - SASL authentication mechanisms
- Load Balancing - Request routing policies
- Prepared Statements - Statement preparation lifecycle
- Pagination - Query result paging
- Throttling - Rate limiting and backpressure
- Compression - Protocol compression
- Failure Handling - Retry and speculative execution
Key Architecture Components¶
Write Path¶
- Commit Log: Write-ahead log for durability
- Memtable: In-memory sorted structure
- SSTable: Immutable on-disk file (created when memtable flushes)
Read Path¶
SSTable Structure¶
SSTable Files:
├── data.db # Actual data
├── index.db # Partition index
├── filter.db # Bloom filter
├── summary.db # Index summary
├── statistics.db # Table statistics
├── compression.db # Compression info
└── toc.txt # Table of contents
Consistency Model¶
Cassandra offers tunable consistency per operation:
Consistency Levels¶
| Level | Reads | Writes |
|---|---|---|
ONE |
1 replica | 1 replica |
QUORUM |
⌊RF/2⌋ + 1 | ⌊RF/2⌋ + 1 |
LOCAL_QUORUM |
Quorum in local DC | Quorum in local DC |
ALL |
All replicas | All replicas |
Strong Consistency Formula¶
For strong consistency (read-your-writes):
R + W > RF
Where:
R = Read consistency level (number of replicas)
W = Write consistency level (number of replicas)
RF = Replication factor
Example with RF=3:
- QUORUM reads (2) + QUORUM writes (2) = 4 > 3 ✓
- ONE reads (1) + ONE writes (1) = 2 < 3 ✗
Fault Tolerance¶
By storing multiple copies of data across nodes, Cassandra tolerates node failures without loss of service or data. As described in the Consistency Model section, operations succeed as long as enough replicas respond to satisfy the consistency level—allowing nodes, racks, or entire datacenters to fail while maintaining availability.
For details on how Cassandra detects failures and synchronizes replicas after recovery, see Replica Synchronization.
Multi-Datacenter Replication¶
NetworkTopologyStrategy enables per-datacenter replication:
CREATE KEYSPACE production WITH replication = {
'class': 'NetworkTopologyStrategy',
'us-east': 3,
'eu-west': 3
};
With LOCAL_QUORUM, each datacenter operates independently—surviving node failures, network partitions, or complete datacenter outages without losing availability or data.
Performance Characteristics¶
Write Performance¶
| Factor | Impact |
|---|---|
| Commit log sync | periodic (default) vs batch |
| Memtable size | Larger = fewer flushes |
| Compaction | Background CPU/IO |
| Replication | More replicas = more writes |
Read Performance¶
| Factor | Impact |
|---|---|
| Partition size | Smaller is better |
| SSTable count | Fewer is better |
| Bloom filters | Reduce disk reads |
| Caching | Key/row cache hit rate |
| Data model | Query-driven design |
Further Reading¶
Deep Dives¶
- Data Distribution Deep Dive
- Compaction Strategies Explained
- Consistency Levels Guide
- Storage Engine Internals
Related Topics¶
- Data Modeling - Design for Cassandra
- Performance Tuning - Optimize cluster performance
- Operations - Day-to-day management