Indexes¶
Indexes in Cassandra provide efficient data access patterns beyond partition key lookups. Understanding the available index types, their implementation differences, and appropriate use cases is essential for query optimization.
Background¶
The Primary Key Index¶
Every Cassandra table has an implicit primary key index. This index is fundamental to Cassandra's data model and requires no additional configuration.
Partition Key Index: Each SSTable contains an index mapping partition key tokens to their data positions. This enables O(log n) lookups within an SSTable.
Clustering Column Ordering: Within a partition, data is physically sorted by clustering columns. Range queries on clustering columns are efficient because they read contiguous disk regions.
-- Primary key enables these efficient queries:
CREATE TABLE events (
sensor_id uuid,
event_time timestamp,
reading double,
PRIMARY KEY (sensor_id, event_time)
);
-- Partition key lookup: O(log n) per SSTable
SELECT * FROM events WHERE sensor_id = ?;
-- Clustering range: sequential read within partition
SELECT * FROM events WHERE sensor_id = ? AND event_time > ?;
The Need for Secondary Indexes¶
Primary key indexes only support queries that include the partition key. Without secondary indexes, queries on non-key columns require full table scans—scanning every partition across all nodes.
-- Without secondary index: requires ALLOW FILTERING (full scan)
SELECT * FROM events WHERE reading > 100.0 ALLOW FILTERING;
-- With secondary index: targeted lookup
CREATE INDEX ON events (reading);
SELECT * FROM events WHERE reading > 100.0;
Evolution of Secondary Indexes in Cassandra¶
Cassandra has developed multiple secondary index implementations over time:
| Version | Index Type | Status |
|---|---|---|
| 0.7 (2011) | Secondary Index (2i) | Legacy, still supported |
| 3.4 (2016) | SASI | Experimental, limited support |
| 5.0 (2023) | SAI | Recommended for new deployments |
Each generation addressed limitations of its predecessors while introducing new capabilities and trade-offs.
Feature Availability by Version¶
| Feature | Cassandra 3.x | Cassandra 4.x | Cassandra 5.0+ |
|---|---|---|---|
| Secondary Index (2i) | ✅ | ✅ | ✅ |
| SASI (experimental) | ✅ 3.4+ | ✅ | ✅ |
| SAI | ❌ | ❌ | ✅ |
| Vector search | ❌ | ❌ | ✅ |
| Equality queries | ✅ | ✅ | ✅ |
| Range queries | SASI only | SASI only | SAI/SASI |
| LIKE prefix | SASI only | SASI only | SAI/SASI |
| LIKE contains | SASI only | SASI only | SAI/SASI |
| Collection indexing | 2i only | 2i only | 2i/SAI |
| ANN (vector) | ❌ | ❌ | SAI only |
Version Recommendation
- Cassandra 5.0+: Use SAI for all new indexes
- Cassandra 4.x: Use 2i for equality, SASI for range (with caution)
- Cassandra 3.x: Use 2i for equality, SASI for range (with caution)
- Upgrading: Plan migration from SASI to SAI when moving to 5.0
Index Architecture Comparison¶
How Secondary Indexes Work¶
All Cassandra secondary indexes share a common principle: they create a mapping from indexed column values to partition keys. The implementation of this mapping differs significantly between index types.
Storage Location Differences¶
The primary architectural distinction between index types is where index data is stored:
Separate Tables (2i): Legacy secondary indexes store index data in hidden tables. These tables have their own SSTables and compact independently from base table data.
SSTable-Attached (SASI, SAI): Modern indexes attach index data directly to base table SSTables. Index data compacts together with base table data, maintaining consistency.
Index Type Comparison¶
| Characteristic | Secondary Index (2i) | SASI | SAI |
|---|---|---|---|
| Storage | Separate hidden table | Attached to SSTable | Attached to SSTable |
| Cassandra Version | 0.7+ | 3.4+ | 5.0+ |
| Status | Legacy | Experimental | Recommended |
| Query Types | Equality only | Equality, range, LIKE | Equality, range, LIKE |
| Numeric Range | No | Yes | Yes |
| Text Search | No | PREFIX, CONTAINS | Yes |
| AND Queries | Scatter-gather | Single-pass | Single-pass |
| Write Overhead | Medium | Medium | Low |
| Cardinality Handling | Poor at extremes | Better | Best |
| Production Ready | Yes (with caveats) | No | Yes |
Operator Support Matrix¶
This table shows exactly which CQL operators are supported by each index type.
| Operator | Secondary Index (2i) | SASI | SAI | Notes |
|---|---|---|---|---|
= (equality) |
✅ | ✅ | ✅ | All index types |
> |
❌ | ✅ | ✅ | Requires SPARSE mode for SASI |
>= |
❌ | ✅ | ✅ | Requires SPARSE mode for SASI |
< |
❌ | ✅ | ✅ | Requires SPARSE mode for SASI |
<= |
❌ | ✅ | ✅ | Requires SPARSE mode for SASI |
LIKE 'prefix%' |
❌ | ✅ | ✅ | PREFIX mode default for SASI |
LIKE '%substring%' |
❌ | ✅ | ⚠️ | SASI: CONTAINS mode; SAI: requires analyzer |
LIKE '%suffix' |
❌ | ❌ | ❌ | Not supported by any index |
IN |
✅ | ✅ | ✅ | Multiple equality values |
CONTAINS (collection) |
✅ | ❌ | ✅ | Collection element search |
CONTAINS KEY (map) |
✅ | ❌ | ✅ | Map key search |
!= (not equal) |
❌ | ❌ | ❌ | Not supported by any index |
OR (cross-column) |
❌ | ❌ | ❌ | Application-level union required |
Legend: ✅ Supported | ⚠️ Partial/Conditional | ❌ Not Supported
Data Type Support¶
| Data Type | Secondary Index (2i) | SASI | SAI |
|---|---|---|---|
text / varchar |
✅ | ✅ | ✅ |
int / bigint |
✅ | ✅ | ✅ |
float / double |
✅ | ✅ | ✅ |
decimal |
✅ | ✅ | ✅ |
timestamp |
✅ | ✅ | ✅ |
date / time |
✅ | ✅ | ✅ |
uuid / timeuuid |
✅ | ✅ | ✅ |
boolean |
✅ | ⚠️ | ✅ |
inet |
✅ | ✅ | ✅ |
blob |
✅ | ⚠️ | ✅ |
list<T> |
✅ | ❌ | ✅ |
set<T> |
✅ | ❌ | ✅ |
map<K,V> |
✅ | ❌ | ✅ |
vector<float, N> |
❌ | ❌ | ✅ |
frozen<T> |
✅ | ⚠️ | ✅ |
SASI Collection Support
SASI does not support indexing collections. Use SAI or denormalized tables for collection queries.
Choosing an Index Type¶
Query Execution¶
Single-Index Query¶
When a query uses one index, all index types follow a similar pattern:
- Query coordinator identifies relevant nodes
- Each node queries its local index
- Index returns matching partition keys
- Node reads base table partitions
- Results returned to coordinator
Multi-Index Query (AND)¶
Queries with multiple indexed predicates differ significantly:
Secondary Index (2i): Executes each predicate separately, intersects results at coordinator. Creates scatter-gather pattern with potential for large intermediate result sets.
SASI / SAI: Intersects predicates within each SSTable before returning results. More efficient for multi-predicate queries.
-- Multi-predicate query
SELECT * FROM users WHERE city = 'NYC' AND age > 25;
-- 2i: Two separate index lookups, coordinator intersection
-- SAI: Single-pass intersection per SSTable
Performance Considerations¶
Write Path Impact¶
All secondary indexes add overhead to the write path:
| Index Type | Write Overhead | Reason |
|---|---|---|
| 2i | Medium | Separate table mutation |
| SASI | Medium | Index structure update |
| SAI | Low | Optimized append-only design |
Read Path Characteristics¶
| Query Type | 2i | SASI | SAI |
|---|---|---|---|
| Single equality | Fair | Good | Good |
| Multiple AND | Poor | Good | Good |
| Range | N/A | Good | Good |
| High selectivity | Poor | Fair | Good |
| Low selectivity | Poor | Fair | Fair |
Anti-Patterns¶
Do Not Index These Columns
The following patterns will cause performance problems or outright failures. These are not recommendations—they are hard constraints.
High Cardinality Columns¶
Never Index UUIDs, Timestamps, or Unique Identifiers
Problem: Index size equals or exceeds base table size. Every query contacts all nodes to find one row.
Symptoms: Query latency worse than full table scan, excessive disk usage, coordinator timeouts.
Instead: Include the column in the partition key or create a denormalized lookup table.
-- DO NOT DO THIS
CREATE INDEX ON events (event_id); -- event_id is UUID
-- INSTEAD: Make it the partition key
CREATE TABLE events_by_id (
event_id uuid PRIMARY KEY,
...
);
Low Cardinality Columns¶
Never Index Boolean or Low-Enum Columns Without Partition Key
Problem: Each index entry points to millions of rows. Single query returns unbounded results.
Symptoms: Memory exhaustion, GC storms, query timeouts, coordinator OOM.
Instead: Partition by the low-cardinality value, or always combine with partition key restriction.
-- DO NOT DO THIS
CREATE INDEX ON users (is_active); -- Returns 50% of all rows
-- INSTEAD: Partition by status
CREATE TABLE users_by_status (
is_active boolean,
user_id uuid,
PRIMARY KEY (is_active, user_id)
);
Frequently Updated Columns¶
Avoid Indexing Columns That Change Often
Problem: Every update requires index delete + insert. Tombstones accumulate rapidly.
Symptoms: Growing read latency, tombstone warnings, compaction pressure.
Instead: Store mutable state separately or accept query trade-offs.
-- PROBLEMATIC
CREATE INDEX ON sessions (last_activity); -- Updated every request
-- Every update creates a tombstone in the index
-- After 1M updates: 1M tombstones to scan
Global Queries Without Partition Restriction¶
Avoid Index-Only Queries in Large Clusters
Problem: Query contacts ALL nodes, latency = slowest node, no locality benefit.
Acceptable: Combined with partition key (restricts to one node).
Problematic: Global queries in large clusters with high throughput requirements.
-- SLOW: Contacts all nodes
SELECT * FROM users WHERE city = 'NYC';
-- FAST: Restricted to one partition
SELECT * FROM users WHERE region = 'us-east' AND city = 'NYC';
Operational Considerations¶
Index Building¶
# Check index build progress
nodetool describecluster
# Rebuild index (SAI)
nodetool rebuild_index keyspace table index_name
# View index status
nodetool tablestats keyspace.table
Monitoring¶
# JMX metrics for index performance
org.apache.cassandra.metrics:type=Index,scope=*,name=*
# Per-table index metrics
org.apache.cassandra.metrics:type=Table,keyspace=*,scope=*,name=IndexSummaryOffHeapMemoryUsed
Related Documentation¶
- Secondary Indexes (2i) - Legacy index implementation
- SASI - SSTable Attached Secondary Index
- SAI - Storage Attached Index (recommended)
- Read Path - Query execution details
- Compaction - Impact on index maintenance