Replication¶

Replication is fundamental to Cassandra's architecture. Every write is automatically copied to multiple nodes with no special configuration, no external tools, no application logic required. This built-in redundancy means nodes can fail, disks can die, and entire datacenters can go offline, yet the data remains available and intact.

Unlike traditional databases that treat replication as an add-on feature, Cassandra was designed from the ground up with replication as a core primitive. The system assumes failures will happen and handles them transparently. A node crashes during a write? The replicas have the data. Network partitions a datacenter? The other datacenters continue serving requests. This design enables true 24/7 availability without the operational complexity of failover procedures.

The replication factor (RF) determines how many copies of each partition exist, and the replication strategy determines where those copies are placed across the cluster topology.

Replication Factor¶

The replication factor specifies how many nodes store a copy of each partition.

Choosing Replication Factor¶

RF	Fault Tolerance	Trade-off
1	None—any node failure loses data	No redundancy
2	Single node failure	No quorum possible with one node down
3	Single node failure with quorum	Production minimum (recommended)
5	Two node failures with quorum	Critical data requiring extreme durability
>5	Diminishing returns	Rarely justified

RF = 3 is the production standard:

RF = 3 with QUORUM:
- One node down: Still have quorum (2 of 3)
- Can run repair with one node down
- Balances durability, availability, and storage cost

RF and Cluster Size¶

Rule: RF ≤ nodes_in_smallest_dc

If DC has 2 nodes, max RF = 2 (each partition on both nodes)
If DC has 3 nodes, RF = 3 means every node has every partition
If DC has 10 nodes, RF = 3 means each partition on 3 of 10 nodes

ANTI-PATTERN:
DC with 3 nodes, RF = 5  ← Cannot place 5 replicas on 3 nodes
Cassandra will place 3 replicas, but report RF=5
This causes Unavailable exceptions for QUORUM (needs 3)

Replication Strategies¶

SimpleStrategy (Development Only)¶

SimpleStrategy places replicas on consecutive nodes around the ring with no awareness of racks or datacenters:

Algorithm:

Hash partition key → token
Find node that owns this token (primary replica)
Walk clockwise, place replicas on next (RF-1) nodes

Rack Unawareness

SimpleStrategy has no rack awareness. Nodes B, C, D might all be on the same rack—if that rack loses power, all replicas are lost.

-- SimpleStrategy configuration
CREATE KEYSPACE dev_keyspace WITH replication = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
};

Never use SimpleStrategy in production—it has no rack awareness.

NetworkTopologyStrategy (Production Standard)¶

NetworkTopologyStrategy (NTS) places replicas while respecting datacenter and rack boundaries.

Algorithm (for each datacenter):

Hash partition key → token
Find node in this DC that owns token (primary replica)
Walk clockwise, selecting nodes on different racks
Continue until RF replicas placed in this DC
Repeat for each DC

-- NetworkTopologyStrategy configuration
CREATE KEYSPACE production WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3,
    'dc2': 3
};

-- Single DC with rack awareness
CREATE KEYSPACE single_dc WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'datacenter1': 3
};

NTS Replica Placement Algorithm¶

For a partition with token T in a DC with RF=3:

Step	Action	Result
1	Find first node clockwise from T	Replica 1 (e.g., Rack A)
2	Walk clockwise, find node on different rack	Replica 2 (Rack B or C)
3	Continue clockwise, find node on third rack	Replica 3

Rack availability impact:

Racks Available	Replica Distribution
3+ racks	Full diversity—each replica on different rack
2 racks	Two replicas share a rack
1 rack	All replicas on same rack (no diversity)

Rack Diversity

With RF=3, at least 3 racks are needed for full rack diversity.

Snitches: Topology Awareness¶

Why Snitches Exist¶

NetworkTopologyStrategy places replicas on different racks to survive hardware failures—but Cassandra has no inherent knowledge of physical infrastructure. IP addresses alone reveal nothing about which nodes share a rack, power supply, or network switch.

The snitch solves this problem by mapping physical infrastructure to logical Cassandra topology. Given any node's IP address, the snitch returns that node's datacenter and rack. This mapping enables:

Function	How Snitch Enables It
Replica placement	NTS uses rack information to spread replicas across failure domains
Request routing	Coordinators prefer nodes in the local datacenter for lower latency
Consistency enforcement	LOCAL_QUORUM identifies which nodes are "local" via datacenter membership

Without accurate snitch configuration, Cassandra cannot distinguish between nodes in the same rack versus different racks, potentially placing all replicas in a single failure domain.

Configuration Considerations¶

The snitch must be configured during initial cluster deployment, before starting the node for the first time. Once a node joins the cluster with a particular datacenter and rack assignment, changing this topology is operationally complex and requires careful coordination (see Snitch Configuration Issues).

Two categories of snitches exist:

Manual configuration — The administrator explicitly defines each node's datacenter and rack (e.g., GossipingPropertyFileSnitch)
Automatic detection — The snitch queries cloud provider metadata APIs to determine topology (e.g., Ec2Snitch, GoogleCloudSnitch)

GossipingPropertyFileSnitch is recommended for most deployments because it provides full flexibility: topology names can match organizational conventions, nodes can be moved between logical racks without infrastructure changes, and the configuration works identically across on-premises, cloud, and hybrid environments.

Available Snitches¶

Snitch	Use Case	Topology Source
GossipingPropertyFileSnitch	Production (recommended)	Local properties file
Ec2Snitch	AWS single region	EC2 metadata API
Ec2MultiRegionSnitch	AWS multi-region	EC2 metadata API + public IPs
GoogleCloudSnitch	Google Cloud Platform	GCE metadata API
AzureSnitch	Microsoft Azure	Azure metadata API
SimpleSnitch	Single-node development	None (all nodes in same DC/rack)
PropertyFileSnitch	Legacy	Central topology file (deprecated)

How Snitches Work¶

Each node runs a snitch implementation that:

Determines local topology — On startup, the snitch identifies the local node's datacenter and rack (from configuration file or cloud metadata API)
Propagates via gossip — The local topology is included in gossip messages, so all nodes learn each other's DC/rack membership
Resolves queries — When Cassandra needs to know any node's location, it queries the snitch (which returns cached gossip data for remote nodes)

Snitch query flow:

Application:  getDatacenter(10.0.1.5) → "us-east"
              getRack(10.0.1.5)       → "rack-a"

Internal lookup:
  Local node?  → Read from configuration
  Remote node? → Return cached gossip state

GossipingPropertyFileSnitch (Recommended)¶

Each node reads its own DC/rack from a local file, then gossips it to others:

# cassandra.yaml
endpoint_snitch: GossipingPropertyFileSnitch

# conf/cassandra-rackdc.properties
dc=us-east-1
rack=rack-a
# Optional: prefer_local=true (prefer connecting to local DC)

Why GossipingPropertyFileSnitch is recommended:

Advantage	Description
Simple configuration	One file per node
Universal	Works anywhere (cloud, on-prem, containers)
No dependencies	No external services required
Automatic propagation	Topology shared via gossip

Cloud Snitches¶

Ec2Snitch (AWS Single Region)¶

# cassandra.yaml
endpoint_snitch: Ec2Snitch

Automatically detects:

Datacenter: AWS region (e.g., us-east-1)
Rack: Availability zone (e.g., us-east-1a)

Keyspace must use region name:

CREATE KEYSPACE my_ks WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east-1': 3  -- Must match EC2 region name
};

Ec2MultiRegionSnitch (AWS Multi-Region)¶

# cassandra.yaml
endpoint_snitch: Ec2MultiRegionSnitch

# REQUIRED: Node's public IP for cross-region communication
broadcast_address: <public_ip>
broadcast_rpc_address: <public_ip>

# Listen on all interfaces
listen_address: <private_ip>

Critical requirement: Security groups must allow cross-region traffic on:

Port 7000 (inter-node)
Port 7001 (inter-node SSL)
Port 9042 (native transport, if clients cross regions)

GoogleCloudSnitch (GCP)¶

# cassandra.yaml
endpoint_snitch: GoogleCloudSnitch

Automatically detects:

Datacenter: <project>:<region> (e.g., myproject:us-central1)
Rack: Zone (e.g., us-central1-a)

Snitch Configuration Issues¶

Problem 1: Changing snitches on existing cluster

WRONG: Simply changing the snitch class

What happens:
- Node restarts with new snitch
- Reports different DC/rack name
- Cassandra thinks it is a NEW node
- Data starts streaming (wrong!)

CORRECT: Change snitch, then change topology step by step
1. Stop node
2. Change snitch in cassandra.yaml
3. Update cassandra-rackdc.properties to SAME DC/rack as before
4. Restart
5. Repeat for all nodes
6. Only then update DC/rack names one at a time

Problem 2: Inconsistent DC/rack names

Node 1: dc=US-EAST, rack=rack1
Node 2: dc=us-east, rack=rack1   ← Different case!
Node 3: dc=us_east, rack=rack1   ← Different format!

Result: Cassandra sees 3 different DCs
        Replication is completely wrong

Always verify topology:

nodetool status
# Should show expected DC names and node distribution

nodetool describecluster
# Shows DC info and schema agreement

Multi-DC Replication Patterns¶

Active-Active¶

Both datacenters serve traffic with full replication:

CREATE KEYSPACE active_active WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east': 3,
    'us-west': 3
};

Characteristic	Value
Consistency	LOCAL_QUORUM for low latency
Total storage	6× raw data
Failure tolerance	Either DC can serve all traffic

Three-Region Global¶

Global distribution with local consistency:

CREATE KEYSPACE global WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east': 3,
    'us-west': 3,
    'eu-west': 3
};

Characteristic	Value
Consistency	LOCAL_QUORUM for regional, QUORUM for global
Total storage	9× raw data
Use case	Global applications with regional users

Analytics Replica¶

Separate datacenter for analytics workloads:

CREATE KEYSPACE with_analytics WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'production': 3,
    'analytics': 2
};

Characteristic	Value
Analytics DC	Runs Spark jobs, never serves production traffic
Lower RF	Acceptable for read-only analytics

Changing Replication¶

Increasing Replication Factor¶

Increasing RF is operationally simple but has significant consequences that require careful planning.

-- Current: RF=2, Target: RF=3

-- Step 1: Alter keyspace (changes metadata only)
ALTER KEYSPACE my_keyspace WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'datacenter1': 3
};

# Step 2: Run repair to stream data to new replicas on all nodes
nodetool repair -full my_keyspace

# This streams data to the third replica for each partition
# Can take hours/days depending on data size

Critical warning: The ALTER KEYSPACE command changes metadata immediately, but new replicas contain no data. To populate the new replica endpoints, the repair process must be executed to stream data from existing replicas. This process takes a significant amount of time—hours to days depending on data volume.

During this repair window, queries will fail or return incomplete data:

Issue	Consequence
New replicas are empty	Reads from new replicas return no data
QUORUM uses new RF	QUORUM now requires ⌊3/2⌋+1 = 2 nodes, but only 2 have data
Read repair is insufficient	Only helps for rows that are read; most data remains missing

This operation requires careful planning and should be scheduled during low-traffic periods with appropriate consistency level adjustments.

Decreasing Replication Factor¶

-- Current: RF=3, Target: RF=2

-- Step 1: Alter keyspace
ALTER KEYSPACE my_keyspace WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'datacenter1': 2
};

# Step 2: Run cleanup to remove extra replicas
nodetool cleanup my_keyspace

# This deletes data that nodes no longer own
# Required on every node

Adding a Datacenter¶

# Step 1: Configure new DC nodes
# cassandra.yaml: Same cluster_name, correct seeds
# cassandra-rackdc.properties: Correct DC/rack names

# Step 2: Start new nodes (they join empty)

-- Step 3: Update keyspace to include new DC
ALTER KEYSPACE my_keyspace WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3,
    'dc2': 3  -- New DC
};

# Step 4: Rebuild new DC from existing DC
# Run on EACH node in the new DC:
nodetool rebuild -- dc1

# Streams all data from dc1 to the new node
# Faster than repair (streams only, no comparisons)

Removing a Datacenter¶

-- Step 1: Update keyspace to remove DC
ALTER KEYSPACE my_keyspace WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3
    -- dc2 removed
};

# Step 2: Run repair on remaining DC
nodetool repair -full my_keyspace

# Step 3: Decommission nodes in removed DC
nodetool decommission  # On each node in dc2

# Step 4: Update seed list to remove dc2 nodes

Monitoring Replication¶

Cluster Topology¶

# Node status and ownership
nodetool status my_keyspace

# Output:
# Datacenter: dc1
# ==============
# Status=Up/Down
# |/ State=Normal/Leaving/Joining/Moving
# --  Address    Load       Tokens  Owns (effective)  Rack
# UN  10.0.1.1   256 GB     16      33.3%            rack1
# UN  10.0.1.2   248 GB     16      33.3%            rack2
# UN  10.0.1.3   252 GB     16      33.3%            rack3

Streaming Status¶

# Current streaming operations
nodetool netstats

# Shows:
# - Receiving streams (from other nodes)
# - Sending streams (to other nodes)
# - Progress percentage

Troubleshooting¶

Unavailable Exceptions¶

Error: Not enough replicas available for query at consistency QUORUM
       (2 required but only 1 alive)

Cause	Diagnosis	Resolution
Nodes down	`nodetool status` shows DN	Restart nodes or lower CL
RF > nodes	Keyspace RF higher than DC size	Lower RF or add nodes
Network partition	Some nodes unreachable	Fix network

Uneven Data Distribution¶

nodetool status shows:
Node 1: 100 GB
Node 2: 500 GB  ← Much larger
Node 3: 120 GB

Cause	Diagnosis	Resolution
Hot partitions	Check `nodetool tablestats`	Redesign partition keys
Uneven tokens	Check `nodetool ring`	Rebalance or use vnodes
Late joiner	Node joined after data loaded	Run repair

Missing Data After Node Replacement¶

# Check if replacement completed
nodetool netstats  # Look for ongoing streams

# Run repair to ensure data is complete
nodetool repair -full my_keyspace

Best Practices¶

Area	Recommendation
Strategy	Always use NetworkTopologyStrategy (even for single DC)
Replication factor	RF=3 minimum for production
Rack distribution	Distribute nodes across at least RF racks
Multi-DC	Same RF across DCs for active-active
Snitch	Use GossipingPropertyFileSnitch for portability
Naming	Use consistent DC/rack naming (case-sensitive)
New DCs	Use `nodetool rebuild` (faster than repair)

Distributed Data Overview - How partitioning, replication, and consistency work together
Partitioning - How data is distributed to nodes
Consistency - How consistency levels interact with replication
Replica Synchronization - How replicas converge