Apache Cassandra® Documentation¶

Production-grade reference for architecture, CQL, and operations.

Documentation Scope¶

This reference documentation covers Apache Cassandra versions 4.0 through 5.x, with emphasis on production deployments. Cassandra 5.0 (September 2024) introduced major features including Storage-Attached Indexes (SAI), Vector Search, and Unified Compaction Strategy (UCS).

Version Range	Java Requirement	Documentation Status
3.11.x	Java 8	Legacy reference
4.0.x	Java 8/11	Supported
4.1.x	Java 8/11	Fully Documented
5.0.x	Java 11/17	Current (5.0.6)

Legend: ✅ Production Ready | ⚠️ Limited/Deprecated | ❌ Not Supported

What's New¶

Cassandra 5.0.6 (October 2025) - Current Release¶

Bug fixes and stability improvements

Cassandra 5.0.0 (September 2024) - Major Release¶

Storage-Attached Indexes (SAI) (CEP-7) - Efficient secondary indexing within storage layer
Vector data type and search (CEP-30) - Approximate nearest neighbor searching via SAI
Unified Compaction Strategy (UCS) (CEP-26) - Adaptive compaction replacing multiple strategies
Trie memtables (CEP-19) - Trie-based in-memory data structures
Trie SSTables (CEP-25) - Trie-indexed SSTable format
Dynamic Data Masking (CEP-20) - Selective redaction of sensitive data at query time
Java 17 support - recommended for Cassandra 5.0
TTL and writetime on collections/UDTs - Extended metadata for complex types
CIDR-based authorizer (CEP-33) - Network-based access control
New math functions: abs, exp, log, log10, round

Cassandra 4.1.10 (September 2025)¶

Bug fixes and stability improvements

Cassandra 4.1.0 (December 2022)¶

Paxos v2 - Enhanced lightweight transaction protocol
Guardrails - Operational safety boundaries and limits
Partition denylist - Block access to problematic partitions
Top partition tracking - Per-table monitoring of hot partitions
Native transport rate limiting - Request throughput controls
Client-side password hashing - Enhanced authentication security
Pluggable memtables - Custom memtable implementations

Cassandra 4.0.19 (October 2025)¶

Bug fixes and stability improvements

Cassandra 4.0.0 (July 2021)¶

Virtual tables - System information via CQL queries
Audit logging - Comprehensive query audit trail
Full query logging - Capture all queries for replay
Incremental repair improvements - More efficient anti-entropy
Zero-copy streaming - Faster data transfer between nodes
Java 11 support - Modern JVM compatibility

Apache Cassandra is a widely adopted distributed database, but much of its operational and architectural knowledge has historically lived in mailing lists, conference talks, and tribal knowledge rather than formal documentation.

This documentation provides a comprehensive, production-focused reference for Apache Cassandra, covering storage engine internals, compaction strategies, indexing, CQL semantics, data modeling, and operational tooling. Content is designed for developers, operators, and architects building and maintaining Cassandra deployments at scale.

This documentation complements the Official Apache Cassandra Documentation, providing deeper explanations of behavioral contracts, failure semantics, and practical guidance for real-world deployments.

Apache Cassandra is a distributed NoSQL database designed for extreme scale, exceptional performance, and continuous availability. There is no master node—every node can handle reads and writes, so the failure of any single node (or even an entire datacenter) does not take down the database.

Cassandra excels at write-heavy workloads, time-series data, and applications requiring geographic distribution. Cassandra is less suited for complex queries, ad-hoc analytics, or workloads requiring strong consistency with frequent cross-partition transactions.

About This Documentation¶

This documentation serves as a comprehensive reference for Apache Cassandra, covering architecture, configuration, operations, data modeling, CQL, and troubleshooting. The goal is to provide complete, accurate, and practical guidance for developers, operators, and architects working with Cassandra in production environments.

Principle	Description
Source Code Verified	Configuration options, default values, and behavior are cross-referenced against the Cassandra source code to ensure accuracy
CEP Aligned	New features reference their corresponding Cassandra Enhancement Proposals (CEPs) for design rationale and implementation details
Version Aware	Documentation notes version-specific differences between Cassandra 4.x and 5.x releases
Operationally Focused	Content prioritizes practical operational guidance derived from production experience

Topics are organized for both learning and reference. New users can follow the Getting Started guides sequentially, while experienced operators can use the detailed reference sections for specific configuration options, JMX metrics, and operational procedures.

What is Apache Cassandra?¶

History and Origins¶

Cassandra was created at Facebook in 2007 by Avinash Lakshman and Prashant Malik to power Facebook's Inbox Search feature—a system requiring high write throughput across hundreds of millions of users with strict latency requirements. Lakshman, a co-author of Amazon's Dynamo paper, brought distributed systems expertise that shaped Cassandra's architecture.

Year	Milestone
2007	Development begins at Facebook
2008	Open sourced under Apache License 2.0 (July)
2009	Enters Apache Incubator (March)
2010	Graduates to Apache Top-Level Project (February)
2011	Cassandra 1.0 released
2014	Cassandra 2.0 introduces lightweight transactions
2016	Cassandra 3.0 brings materialized views and SASI
2021	Cassandra 4.0 after extensive testing focus
2024	Cassandra 5.0 introduces vectors, SAI, and UCS

The project is licensed under the Apache License 2.0, permitting commercial use, modification, and distribution.

Design Influences¶

Cassandra's design draws from two foundational distributed systems papers: Google's BigTable (2006) provided the storage model—SSTables, memtables, and the LSM-tree architecture. Amazon's Dynamo (2007) provided the distribution model—consistent hashing, gossip-based cluster membership, and tunable consistency levels.

Performance Characteristics¶

Cassandra delivers exceptional performance at scale:

Metric	Typical Performance	Notes
Write Throughput	100,000+ writes/sec per node	Sequential I/O to commit log; parallel memtable inserts
Read Latency (P99)	1-5 ms	With proper data modeling and warm caches
Write Latency (P99)	1-2 ms	Commit log append + memtable insert
Scalability	Linear to 1000+ nodes	Proven in production at petabyte scale

Performance derives from Cassandra's architecture:

Log-structured writes: All writes append sequentially to the commit log, avoiding random disk seeks
Memtable buffering: Recent writes held in memtables before flushing to disk
Parallel execution: Requests distributed across nodes; no single bottleneck
Token-aware routing: Drivers send requests directly to replica nodes, avoiding extra network hops

Fault Tolerance¶

Cassandra is designed to survive failures at every level:

Failure Scenario	Cassandra Behavior
Single node failure	Remaining replicas serve requests; hinted handoff queues writes for recovery
Rack failure	Rack-aware replication ensures replicas exist in other racks
Datacenter failure	Multi-DC replication provides geographic redundancy; traffic fails over automatically
Network partition	Nodes continue serving requests independently; reconciliation occurs on recovery

Unlike primary-replica databases that fail over to a standby, Cassandra has no failover—all nodes are active and capable of serving any request. This eliminates failover latency and split-brain scenarios.

Key Features¶

Feature	Description
Distributed Architecture	Data is automatically distributed across multiple nodes
Linear Scalability	Add capacity by adding nodes with no downtime
High Availability	No single point of failure; survives node and datacenter failures
Tunable Consistency	Choose consistency level per operation
Multi-Datacenter Replication	Built-in support for geographically distributed clusters
Flexible Schema	Wide-column store with support for complex data types

Common Misconceptions¶

Understanding what Cassandra is not helps set appropriate expectations.

Misconception	Reality
"Cassandra is eventually consistent"	Cassandra offers tunable consistency. With `QUORUM` reads and writes, strong consistency is achieved. "Eventually consistent" only applies when using weaker consistency levels like `ONE`.
"Cassandra doesn't support transactions"	Cassandra supports lightweight transactions (LWT) using Paxos for compare-and-set operations. Accord, a general-purpose distributed transaction protocol, is under active development for a future release. LWT provides linearizable consistency for specific use cases, though not ACID transactions across arbitrary rows.
"Cassandra can't do joins"	Correct—by design. Cassandra optimizes for fast reads at scale by denormalizing data. Model data according to query patterns rather than normalizing and joining at read time.
"Cassandra is only for write-heavy workloads"	Cassandra handles read-heavy workloads effectively when data is modeled correctly. The key is designing tables around query patterns, not write patterns.
"Cassandra requires expensive hardware"	Cassandra runs effectively on both commodity hardware and high-end servers. Modern Cassandra scales well both horizontally (adding nodes) and vertically (larger instances with more CPU cores and memory).
"Cassandra is hard to operate"	Modern tooling such as AxonOps automates most operational tasks. The learning curve exists, but operational complexity is manageable with proper tooling and training.
"Data modeling is too difficult"	Query-first modeling is different from relational modeling, not harder. Once the principles are understood (partition keys, clustering columns, denormalization), modeling becomes straightforward. Tools like AxonOps Workbench provide visual data modeling assistance.
"Cassandra loses data"	Data loss occurs from misconfiguration (improper `gc_grace_seconds`, skipped repairs) or hardware failures beyond the replication factor—not from Cassandra itself. With proper operations, Cassandra provides strong durability guarantees.
"Cassandra is an in-memory database"	Cassandra is a persistent, disk-based database. While memtables buffer recent writes in memory, all data is durably written to the commit log immediately and flushed to SSTables on disk. Memory caches improve read performance but are not the primary storage.

Getting Started¶

New to Cassandra? Begin with installation and initial configuration.

Installation

Install Cassandra on Linux, Docker, or Kubernetes environments.

Installation Guide
First Cluster

Create and configure a first Cassandra cluster step by step.

First Cluster
Client Drivers

Connect applications using Java, Python, Go, and other drivers.

Driver Setup
CQL Quickstart

Learn Cassandra Query Language basics with hands-on examples.

CQL Quickstart

Architecture¶

Understand Cassandra's distributed architecture and storage engine.

Architecture Overview

Distributed architecture fundamentals, gossip protocol, and cluster topology.

Architecture Overview
Data Distribution

Partitioning, token rings, and virtual nodes (vnodes) explained.

Data Distribution
Replication

Replication strategies, consistency levels, and fault tolerance.

Replication
Storage Engine

Memtables, SSTables, commit log, and write path internals.

Storage Engine
Compaction

STCS, LCS, TWCS, and UCS compaction strategies explained.

Compaction Strategies

CQL Reference¶

Complete Cassandra Query Language documentation.

CQL Overview

CQL language reference and query syntax fundamentals.

CQL Overview
Data Types

Native, collection, and user-defined types reference.

Data Types
DDL Commands

CREATE, ALTER, DROP statements for schema management.

DDL Commands
DML Commands

SELECT, INSERT, UPDATE, DELETE for data manipulation.

DML Commands
Indexing

Secondary indexes, SASI, and Storage-Attached Indexing (SAI).

Indexing
Functions

Built-in and user-defined functions reference.

Functions

Data Modeling¶

Design effective Cassandra data models.

Data Modeling Guide

Query-first design methodology and denormalization patterns.

Data Modeling Guide
Key Concepts

Partition keys, clustering columns, and primary key design.

Key Concepts
Anti-Patterns

Common data modeling mistakes and how to avoid them.

Anti-Patterns

Operations¶

Production deployment, monitoring, and maintenance procedures.

Cluster Management

Add, remove, replace, and decommission nodes safely.

Cluster Management
Backup & Restore

Snapshots, incremental backups, and disaster recovery.

Backup & Restore
Repair

Anti-entropy repair to maintain data consistency.

Repair
Configuration

cassandra.yaml, JVM options, and snitch configuration.

Configuration
Maintenance

Routine maintenance tasks and operational procedures.

Maintenance

Monitoring & Performance¶

Monitor clusters and optimize performance.

Monitoring

JMX metrics, key metrics to track, and alerting strategies.

Monitoring Guide
JMX Reference

500+ metrics with thresholds and 30 MBeans documented.

JMX Reference
Performance Tuning

Hardware sizing, JVM tuning, and OS optimization.

Performance Guide
Query Optimization

Write efficient queries and avoid performance pitfalls.

Query Optimization

Security¶

Authentication, authorization, and encryption for Cassandra deployments.

Authentication

Internal authentication, LDAP integration, and Kerberos.

Authentication
Authorization

Role-based access control and permission management.

Authorization
Encryption

TLS for client and internode encryption, encryption at rest.

Encryption

Tools¶

Essential Cassandra command-line and administration tools.

nodetool

Cluster management commands for operations and diagnostics.

nodetool Reference
cqlsh

Interactive CQL shell for queries and schema management.

cqlsh Reference
CQLAI

Modern AI-powered CQL shell with intelligent assistance.

CQLAI
cassandra-stress

Load testing and benchmarking tool for Cassandra.

cassandra-stress

Troubleshooting¶

Diagnostic procedures and solutions for common issues.

Diagnosis

Root cause analysis procedures and diagnostic workflows.

Diagnosis Guide
Log Analysis

Interpreting logs, log patterns, and log configuration.

Log Analysis
Common Errors

ReadTimeout, WriteTimeout, and other common errors explained.

Troubleshooting Guide

Quick Reference¶

Reference

Quick reference for configuration, metrics, and commands.

Reference

Quick Links¶

By Experience Level¶

Beginners: Installation → First Cluster → CQL Quickstart

Developers: Data Modeling → CQL Reference → Drivers

Operators: Operations → Monitoring → Troubleshooting

Performance Engineers: JMX Metrics → Performance Tuning → Benchmarking

Common Tasks¶

Task	Documentation
Install Cassandra	Installation Guide
Design a data model	Data Modeling Guide
Fix timeout errors	ReadTimeoutException
Manage cluster nodes	Cluster Management
Configure backups	Backup Guide
Monitor the cluster	Monitoring Guide
Tune performance	Performance Guide

Version Compatibility¶

Supported Versions¶

Version	Release Date	End of Support	Status
5.0.x	September 2024	Until 5.3.0 release	Current
4.1.x	December 2022	Until 5.2.0 release	Supported
4.0.x	July 2021	Until 5.1.0 release	Supported
3.11.x	June 2017	Unmaintained	Legacy

Upgrade Path

Direct upgrades skipping major versions are not supported. To upgrade from 3.11.x to 5.0.x:

Upgrade 3.11.x → 4.0.x
Upgrade 4.0.x → 4.1.x
Upgrade 4.1.x → 5.0.x

Documentation Conventions

This documentation uses RFC 2119 terminology (must, should, may) to indicate requirement levels. Version-specific behaviors are explicitly noted with the applicable Cassandra version range.

Contributing¶

This documentation is maintained by AxonOps. Found an error or want to contribute? Visit the GitHub repository.