Cassandra Architecture¶

Cassandra's architecture explains its behavior. The partition key determines which node stores data—get it wrong, and queries become slow or impossible. Deletes write tombstones instead of removing data immediately—ignore this, and deleted records can reappear. Nodes can disagree on data temporarily—skip repair, and that disagreement becomes permanent.

The design combines Amazon Dynamo's distribution approach (masterless ring, gossip protocol, tunable consistency) with Google BigTable's storage approach (LSM-tree, SSTables, memtables). Understanding both sides leads to better decisions about data modeling and operations.

Architecture Overview¶

Apache Cassandra is a distributed, peer-to-peer database designed for:

High Availability: No single point of failure
Linear Scalability: Add nodes to increase capacity
Geographic Distribution: Multi-datacenter replication
Tunable Consistency: Balance between consistency and availability

Core Concepts¶

Ring Architecture¶

Cassandra organizes nodes in a logical ring structure using consistent hashing:

Each node owns a range of tokens on the ring
Data is assigned to nodes based on partition key hash
Data is replicated to multiple nodes for fault tolerance

Partitioner and Hash Calculation¶

The partitioner is responsible for computing a hash (token) value from the partition key. Cassandra uses this hash to determine data placement on both write and read operations:

Murmur3Partitioner (default): Uses the MurmurHash3 algorithm, producing a 64-bit hash value. Token range spans from -2^63 to 2^63-1.
RandomPartitioner: Uses MD5 hashing, producing a 128-bit hash. Token range spans from 0 to 2^127-1.

Write operations: When inserting data, the coordinator node computes hash(partition_key) to produce a token value. This token maps to a specific position on the ring, identifying the primary replica node. Additional replicas are selected by traversing the ring clockwise.

Read operations: The same hash calculation occurs during SELECT queries. The coordinator computes hash(partition_key) from the WHERE clause to locate the exact nodes holding the requested data.

This deterministic hashing ensures that:

The same partition key always maps to the same token
Any node can calculate which nodes own a given partition
No central directory or lookup service is required

Partition Key¶

The partition key determines which node stores the data:

CREATE TABLE users (
    user_id UUID,          -- Partition key
    name TEXT,
    email TEXT,
    PRIMARY KEY (user_id)
);

How it works: 1. Partition key value is hashed: hash(user_id) → token 2. Token maps to a token range 3. Node owning that range stores the data

Replication¶

Data is replicated across multiple nodes for fault tolerance:

CREATE KEYSPACE my_app WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3,   -- 3 copies in dc1
    'dc2': 3    -- 3 copies in dc2
};

Replication Factor (RF): Number of copies across the cluster.

Documentation Structure¶

Data Distribution¶

Partitioning - How data is distributed across nodes
Replication - SimpleStrategy vs NetworkTopologyStrategy
Consistency Levels - Tuning consistency vs availability
Replica Synchronization - Repair, hinted handoff, read repair

Memory Management¶

JVM - Java Virtual Machine configuration and garbage collection
Cassandra Memory - Heap, off-heap, and page cache
Linux - Kernel settings, swap, THP, and NUMA

Storage Engine¶

Storage Engine Overview - How Cassandra stores data
Write Path - Memtables and commit log
Read Path - How reads are served
SSTables - On-disk storage format
Tombstones - How deletes work

Compaction¶

Compaction Overview - Why compaction matters
STCS - Size-Tiered Compaction Strategy
LCS - Leveled Compaction Strategy
TWCS - Time-Window Compaction Strategy
UCS - Unified Compaction Strategy (5.0+)

Cluster Communication¶

Gossip Protocol - How nodes discover each other

Client Connections¶

Client Connection Architecture - How clients communicate with Cassandra
CQL Protocol - Binary wire protocol specification
Async Connections - Connection pooling and multiplexing
Authentication - SASL authentication mechanisms
Load Balancing - Request routing policies
Prepared Statements - Statement preparation lifecycle
Pagination - Query result paging
Throttling - Rate limiting and backpressure
Compression - Protocol compression
Failure Handling - Retry and speculative execution

Key Architecture Components¶

Write Path¶

Commit Log: Write-ahead log for durability
Memtable: In-memory sorted structure
SSTable: Immutable on-disk file (created when memtable flushes)

Read Path¶

SSTable Structure¶

SSTable Files:
├── data.db         # Actual data
├── index.db        # Partition index
├── filter.db       # Bloom filter
├── summary.db      # Index summary
├── statistics.db   # Table statistics
├── compression.db  # Compression info
└── toc.txt         # Table of contents

Consistency Model¶

Cassandra offers tunable consistency per operation:

Consistency Levels¶

Level	Reads	Writes
`ONE`	1 replica	1 replica
`QUORUM`	⌊RF/2⌋ + 1	⌊RF/2⌋ + 1
`LOCAL_QUORUM`	Quorum in local DC	Quorum in local DC
`ALL`	All replicas	All replicas

Strong Consistency Formula¶

For strong consistency (read-your-writes):

R + W > RF

Where:
R = Read consistency level (number of replicas)
W = Write consistency level (number of replicas)
RF = Replication factor

Example with RF=3: - QUORUM reads (2) + QUORUM writes (2) = 4 > 3 ✓ - ONE reads (1) + ONE writes (1) = 2 < 3 ✗

Fault Tolerance¶

By storing multiple copies of data across nodes, Cassandra tolerates node failures without loss of service or data. As described in the Consistency Model section, operations succeed as long as enough replicas respond to satisfy the consistency level—allowing nodes, racks, or entire datacenters to fail while maintaining availability.

For details on how Cassandra detects failures and synchronizes replicas after recovery, see Replica Synchronization.

Multi-Datacenter Replication¶

NetworkTopologyStrategy enables per-datacenter replication:

CREATE KEYSPACE production WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east': 3,
    'eu-west': 3
};

With LOCAL_QUORUM, each datacenter operates independently—surviving node failures, network partitions, or complete datacenter outages without losing availability or data.

Performance Characteristics¶

Write Performance¶

Factor	Impact
Commit log sync	`periodic` (default) vs `batch`
Memtable size	Larger = fewer flushes
Compaction	Background CPU/IO
Replication	More replicas = more writes

Read Performance¶

Factor	Impact
Partition size	Smaller is better
SSTable count	Fewer is better
Bloom filters	Reduce disk reads
Caching	Key/row cache hit rate
Data model	Query-driven design

Cassandra Architecture¶

Architecture Overview¶

Core Concepts¶

Ring Architecture¶

Partitioner and Hash Calculation¶

Partition Key¶

Replication¶

Documentation Structure¶

Data Distribution¶

Memory Management¶

Storage Engine¶

Compaction¶

Cluster Communication¶

Client Connections¶

Key Architecture Components¶

Write Path¶

Read Path¶

SSTable Structure¶

Consistency Model¶

Consistency Levels¶

Strong Consistency Formula¶

Fault Tolerance¶

Multi-Datacenter Replication¶

Performance Characteristics¶

Write Performance¶

Read Performance¶

Further Reading¶

Deep Dives¶

Cassandra Architecture¶

Architecture Overview¶

Core Concepts¶

Ring Architecture¶

Partitioner and Hash Calculation¶

Partition Key¶

Replication¶

Documentation Structure¶

Data Distribution¶

Memory Management¶

Storage Engine¶

Compaction¶

Cluster Communication¶

Client Connections¶

Key Architecture Components¶

Write Path¶

Read Path¶

SSTable Structure¶

Consistency Model¶

Consistency Levels¶

Strong Consistency Formula¶

Fault Tolerance¶

Multi-Datacenter Replication¶

Performance Characteristics¶

Write Performance¶

Read Performance¶

Further Reading¶

Deep Dives¶

Related Topics¶