Skip to content

Cassandra Architecture

Cassandra's architecture explains its behavior. The partition key determines which node stores data—get it wrong, and queries become slow or impossible. Deletes write tombstones instead of removing data immediately—ignore this, and deleted records can reappear. Nodes can disagree on data temporarily—skip repair, and that disagreement becomes permanent.

The design combines Amazon Dynamo's distribution approach (masterless ring, gossip protocol, tunable consistency) with Google BigTable's storage approach (LSM-tree, SSTables, memtables). Understanding both sides leads to better decisions about data modeling and operations.

Architecture Overview

Apache Cassandra is a distributed, peer-to-peer database designed for:

  • High Availability: No single point of failure
  • Linear Scalability: Add nodes to increase capacity
  • Geographic Distribution: Multi-datacenter replication
  • Tunable Consistency: Balance between consistency and availability

CassandraClusterN1Node 1N2Node 2N1->N2N3Node 3N1->N3N4Node 4N1->N4N5Node 5N1->N5N6Node 6N1->N6N2->N3N2->N4N2->N5N2->N6N3->N4N3->N5N3->N6N4->N5N4->N6N5->N6gossipGossip Protocol(Peer-to-Peer Communication)


Core Concepts

Ring Architecture

Cassandra organizes nodes in a logical ring structure using consistent hashing:

  1. Each node owns a range of tokens on the ring
  2. Data is assigned to nodes based on partition key hash
  3. Data is replicated to multiple nodes for fault tolerance

Partitioner and Hash Calculation

The partitioner is responsible for computing a hash (token) value from the partition key. Cassandra uses this hash to determine data placement on both write and read operations:

  • Murmur3Partitioner (default): Uses the MurmurHash3 algorithm, producing a 64-bit hash value. Token range spans from -2^63 to 2^63-1.
  • RandomPartitioner: Uses MD5 hashing, producing a 128-bit hash. Token range spans from 0 to 2^127-1.

Write operations: When inserting data, the coordinator node computes hash(partition_key) to produce a token value. This token maps to a specific position on the ring, identifying the primary replica node. Additional replicas are selected by traversing the ring clockwise.

Read operations: The same hash calculation occurs during SELECT queries. The coordinator computes hash(partition_key) from the WHERE clause to locate the exact nodes holding the requested data.

This deterministic hashing ensures that:

  • The same partition key always maps to the same token
  • Any node can calculate which nodes own a given partition
  • No central directory or lookup service is required

TokenRingToken Ring (Consistent Hashing)Simplified: actual range is -2^63 to 2^63-1ANode ATokens: 0-25BNode BTokens: 26-50A->BCNode CTokens: 51-75B->CDNode DTokens: 76-100C->DD->A

Partition Key

The partition key determines which node stores the data:

CREATE TABLE users (
    user_id UUID,          -- Partition key
    name TEXT,
    email TEXT,
    PRIMARY KEY (user_id)
);

How it works: 1. Partition key value is hashed: hash(user_id) → token 2. Token maps to a token range 3. Node owning that range stores the data

Replication

Data is replicated across multiple nodes for fault tolerance:

CREATE KEYSPACE my_app WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3,   -- 3 copies in dc1
    'dc2': 3    -- 3 copies in dc2
};

Replication Factor (RF): Number of copies across the cluster.


Documentation Structure

Data Distribution

Memory Management

  • JVM - Java Virtual Machine configuration and garbage collection
  • Cassandra Memory - Heap, off-heap, and page cache
  • Linux - Kernel settings, swap, THP, and NUMA

Storage Engine

Compaction

  • Compaction Overview - Why compaction matters
  • STCS - Size-Tiered Compaction Strategy
  • LCS - Leveled Compaction Strategy
  • TWCS - Time-Window Compaction Strategy
  • UCS - Unified Compaction Strategy (5.0+)

Cluster Communication

Client Connections


Key Architecture Components

Write Path

uml diagram

  1. Commit Log: Write-ahead log for durability
  2. Memtable: In-memory sorted structure
  3. SSTable: Immutable on-disk file (created when memtable flushes)

Read Path

uml diagram

SSTable Structure

SSTable Files:
├── data.db         # Actual data
├── index.db        # Partition index
├── filter.db       # Bloom filter
├── summary.db      # Index summary
├── statistics.db   # Table statistics
├── compression.db  # Compression info
└── toc.txt         # Table of contents

Consistency Model

Cassandra offers tunable consistency per operation:

Consistency Levels

Level Reads Writes
ONE 1 replica 1 replica
QUORUM ⌊RF/2⌋ + 1 ⌊RF/2⌋ + 1
LOCAL_QUORUM Quorum in local DC Quorum in local DC
ALL All replicas All replicas

Strong Consistency Formula

For strong consistency (read-your-writes):

R + W > RF

Where:
R = Read consistency level (number of replicas)
W = Write consistency level (number of replicas)
RF = Replication factor

Example with RF=3: - QUORUM reads (2) + QUORUM writes (2) = 4 > 3 ✓ - ONE reads (1) + ONE writes (1) = 2 < 3 ✗


Fault Tolerance

By storing multiple copies of data across nodes, Cassandra tolerates node failures without loss of service or data. As described in the Consistency Model section, operations succeed as long as enough replicas respond to satisfy the consistency level—allowing nodes, racks, or entire datacenters to fail while maintaining availability.

For details on how Cassandra detects failures and synchronizes replicas after recovery, see Replica Synchronization.

Multi-Datacenter Replication

NetworkTopologyStrategy enables per-datacenter replication:

CREATE KEYSPACE production WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east': 3,
    'eu-west': 3
};

With LOCAL_QUORUM, each datacenter operates independently—surviving node failures, network partitions, or complete datacenter outages without losing availability or data.


Performance Characteristics

Write Performance

Factor Impact
Commit log sync periodic (default) vs batch
Memtable size Larger = fewer flushes
Compaction Background CPU/IO
Replication More replicas = more writes

Read Performance

Factor Impact
Partition size Smaller is better
SSTable count Fewer is better
Bloom filters Reduce disk reads
Caching Key/row cache hit rate
Data model Query-driven design

Further Reading

Deep Dives