CQL Indexing Reference¶

Indexes in Cassandra do not work like relational indexes, and that surprises many. A relational index is a global structure—query it, and all matching rows are returned. A Cassandra secondary index is local to each node—query it without a partition key, and Cassandra has to ask every node in the cluster. That is an all-node scatter query, and it does not scale.

This makes secondary indexes useful only in specific situations: low-cardinality columns (status fields, categories) or queries that always include the partition key. For high-cardinality lookups like email addresses, a separate denormalized table is usually more appropriate.

Cassandra 5.0 introduced SAI (Storage-Attached Indexes), which handle more use cases efficiently. Materialized views offer another option—automated denormalization at the cost of write amplification.

This guide covers when each approach makes sense.

Behavioral Guarantees¶

What Indexes Guarantee¶

Index updates are applied synchronously as part of the write path
Indexes are local to each node (each node indexes only its own data)
Queries with partition key plus indexed column contact only partition replicas
Index build is asynchronous; queries may return partial results during build
DROP INDEX removes the index immediately from schema

What Indexes Do NOT Guarantee¶

Undefined Behavior

The following behaviors are undefined and must not be relied upon:

Global ordering: Index queries without partition key return results in undefined order
Latency bounds: Scatter-gather queries have unbounded latency proportional to cluster size
Memory usage: Large result sets from index queries may exhaust coordinator memory
Build time: Index build duration depends on data volume; no progress or completion guarantee
Result completeness during build: Queries may miss data on nodes still building the index
Performance consistency: Query performance varies significantly based on data distribution

Index Type Comparison Contract¶

Index Type	Query Types	Cardinality	Production Status
Legacy (2i)	Equality only	Low	Supported
SASI	Equality, range, LIKE	Medium	Deprecated
SAI	Equality, range, LIKE, ANN	Any	Recommended (5.0+)

Query Execution Contract¶

Query Pattern	Nodes Contacted	Performance
`pk = ? AND indexed = ?`	RF replicas	Fast
`indexed = ?`	All nodes	Slow (scatter-gather)
`indexed = ? LIMIT n`	All nodes (early termination)	Variable
`indexed > ? AND indexed < ?`	All nodes (SAI/SASI only)	Variable

Index Selection Contract¶

Scenario	Recommended Index	Alternative
Low cardinality + partition key	Legacy 2i or SAI	None needed
High cardinality lookup	SAI	Denormalized table
Text search (prefix/suffix)	SAI	Application-side filtering
Vector similarity	SAI with ANN	External vector database
Frequent alternative access pattern	Materialized view	Application-managed table

Failure Semantics¶

Failure Mode	Outcome	Client Action
Index build incomplete	Partial results	Wait for build or query specific partitions
Node unavailable	Results from available nodes only	Retry or accept partial results
Query timeout	Partial results or exception	Add partition key or reduce scope
Index not found	Query fails	Create index or use ALLOW FILTERING

Version-Specific Behavior¶

Version	Behavior
All	Legacy secondary indexes (2i)
3.4+	SASI indexes (experimental)
4.0+	SASI officially not recommended for production
5.0+	SAI (Storage-Attached Indexes) as recommended default (CEP-7)

Index Types Overview¶

Type	Best For	Limitations
Secondary Index	Low cardinality, occasional queries	Full cluster scan
SAI (Cassandra 5.0+)	General purpose, high cardinality	Requires Cassandra 5.0+
SASI (deprecated)	Prefix/suffix search	Being replaced by SAI
Materialized View	Frequent alternative queries	Write amplification

Secondary Indexes¶

Creating Secondary Indexes¶

-- Basic secondary index
CREATE INDEX ON users (email);

-- Named index
CREATE INDEX users_email_idx ON users (email);

-- Index on collection values
CREATE INDEX ON users (tags);  -- For set/list

-- Index on map keys
CREATE INDEX ON users (KEYS(preferences));

-- Index on map values
CREATE INDEX ON users (VALUES(preferences));

-- Index on map entries
CREATE INDEX ON users (ENTRIES(preferences));

-- Index on full frozen collection
CREATE INDEX ON users (FULL(address));

Querying with Secondary Indexes¶

-- Query by indexed column
SELECT * FROM users WHERE email = '[email protected]';

-- With collection index
SELECT * FROM users WHERE tags CONTAINS 'premium';

-- With map index
SELECT * FROM users WHERE preferences CONTAINS KEY 'theme';
SELECT * FROM users WHERE preferences CONTAINS 'dark';
SELECT * FROM users WHERE preferences['theme'] = 'dark';

When to Use Secondary Indexes¶

Good use cases: - Low cardinality columns (status, country) - Infrequent queries - Combined with partition key - Small result sets expected

Bad use cases: - High cardinality columns (user_id, email) - Frequently queried columns - Large result sets - Primary access pattern

Secondary Index Limitations¶

-- These queries require ALLOW FILTERING or fail:

-- No partition key, no index
SELECT * FROM users WHERE age > 30;  -- Error

-- Range query on indexed column
SELECT * FROM users WHERE age > 30 ALLOW FILTERING;  -- Slow

-- Multiple conditions without partition key
SELECT * FROM users WHERE status = 'active' AND country = 'US';
-- Slow even if both indexed

Collection Index Behavior¶

Secondary indexes on collections have special query semantics:

Collection Type	Index Type	Query Operator	Example
SET/LIST	Values	`CONTAINS`	`WHERE tags CONTAINS 'premium'`
MAP	`KEYS()`	`CONTAINS KEY`	`WHERE preferences CONTAINS KEY 'theme'`
MAP	`VALUES()`	`CONTAINS`	`WHERE preferences CONTAINS 'dark'`
MAP	`ENTRIES()`	`[]` accessor	`WHERE preferences['theme'] = 'dark'`
FROZEN	`FULL()`	`=` (full match)	`WHERE address = {street: '...', city: '...'}`

Collection Index Restrictions

CONTAINS and CONTAINS KEY only work with indexed collections
Cannot combine multiple CONTAINS on same collection in one query
Each element in collection is indexed separately (storage overhead)
FULL() index requires exact match of entire frozen collection
Collection indexes still result in scatter-gather without partition key

-- ERROR: Cannot use multiple CONTAINS on same column
SELECT * FROM users
WHERE tags CONTAINS 'premium' AND tags CONTAINS 'verified';

-- WORKAROUND: Filter in application or use intersection

High-Cardinality Performance Pitfalls¶

Avoid Secondary Indexes on High-Cardinality Columns

Secondary indexes on high-cardinality columns (many unique values) cause severe performance problems:

Issue	Impact
Index size	Approaches table size; one entry per unique value
Read amplification	Must check index on every node
Hotspots	Popular values create uneven load
Tombstone buildup	Deleted values leave tombstones in index
Memory pressure	Large indexes consume heap on each node

Example of problematic indexing:

-- BAD: High cardinality
CREATE INDEX ON users (email);        -- Unique per user
CREATE INDEX ON orders (order_id);    -- Unique per order
CREATE INDEX ON events (timestamp);   -- High cardinality

-- BETTER: Low cardinality
CREATE INDEX ON users (account_type); -- Few values: 'free', 'premium', 'enterprise'
CREATE INDEX ON orders (status);      -- Few values: 'pending', 'shipped', 'delivered'

For high-cardinality lookups, use:

Denormalized tables with the lookup column as partition key
SAI indexes (Cassandra 5.0+) which handle cardinality better
Application-side caching for frequently accessed values

Storage-Attached Indexes (SAI)¶

SAI is the next-generation indexing in Cassandra 5.0+, offering better performance and more capabilities than secondary indexes.

Creating SAI Indexes¶

-- Basic SAI index
CREATE CUSTOM INDEX ON users (email)
USING 'StorageAttachedIndex';

-- Named SAI index
CREATE CUSTOM INDEX users_email_sai ON users (email)
USING 'StorageAttachedIndex';

-- SAI with options
CREATE CUSTOM INDEX ON users (description)
USING 'StorageAttachedIndex'
WITH OPTIONS = {
    'case_sensitive': 'false',
    'normalize': 'true'
};

-- Numeric index for range queries
CREATE CUSTOM INDEX ON products (price)
USING 'StorageAttachedIndex';

SAI Query Capabilities¶

-- Equality queries
SELECT * FROM users WHERE email = '[email protected]';

-- Range queries (numeric)
SELECT * FROM products WHERE price >= 100 AND price < 500;

-- AND queries with multiple SAI columns
SELECT * FROM products
WHERE category = 'electronics'
  AND price < 1000
  AND in_stock = true;

-- Combined with partition key (most efficient)
SELECT * FROM orders
WHERE user_id = ?
  AND status = 'pending';

SAI vs Secondary Index¶

Feature	Secondary Index	SAI
High cardinality	Poor	Good
Range queries	No	Yes
Multiple conditions	Poor	Better
Write overhead	Lower	Higher
Text search	No	Limited
Availability	All versions	5.0+

SAI Limitations¶

Still scans all nodes for queries without partition key
Not a replacement for proper data modeling
Higher storage overhead than secondary indexes
Text search is basic (no fuzzy matching)

SASI Indexes (Deprecated)¶

SASI (SSTable Attached Secondary Index) is deprecated but still available. Use SAI instead for new deployments.

Creating SASI Indexes¶

-- Prefix search index
CREATE CUSTOM INDEX ON users (username)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
    'mode': 'PREFIX',
    'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
    'case_sensitive': 'false'
};

-- Contains search index
CREATE CUSTOM INDEX ON users (bio)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
    'mode': 'CONTAINS',
    'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
    'case_sensitive': 'false'
};

-- Numeric sparse index
CREATE CUSTOM INDEX ON orders (total)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
    'mode': 'SPARSE'
};

SASI Query Patterns¶

-- Prefix search (LIKE 'prefix%')
SELECT * FROM users WHERE username LIKE 'john%';

-- Contains search (LIKE '%substring%')
SELECT * FROM users WHERE bio LIKE '%developer%';

-- Range queries
SELECT * FROM orders WHERE total > 100 AND total < 500;

Materialized Views¶

Materialized views automatically maintain denormalized copies of data with different primary keys.

Creating Materialized Views¶

-- Base table
CREATE TABLE users (
    user_id UUID PRIMARY KEY,
    username TEXT,
    email TEXT,
    country TEXT,
    created_at TIMESTAMP
);

-- Materialized view by email
CREATE MATERIALIZED VIEW users_by_email AS
    SELECT * FROM users
    WHERE email IS NOT NULL AND user_id IS NOT NULL
    PRIMARY KEY (email, user_id);

-- View with filtered data
CREATE MATERIALIZED VIEW active_users_by_country AS
    SELECT user_id, username, country, created_at FROM users
    WHERE country IS NOT NULL
      AND user_id IS NOT NULL
      AND status = 'active'
    PRIMARY KEY (country, created_at, user_id)
    WITH CLUSTERING ORDER BY (created_at DESC);

Querying Materialized Views¶

-- Query the view like a regular table
SELECT * FROM users_by_email WHERE email = '[email protected]';

SELECT * FROM active_users_by_country
WHERE country = 'US'
ORDER BY created_at DESC
LIMIT 100;

Materialized View Rules¶

Requirements: - All base table primary key columns must be in the view's primary key - One new column can be added to the partition key - All non-null filters must use IS NOT NULL - View primary key must include base table primary key

-- Base table
CREATE TABLE orders (
    order_id UUID,
    user_id UUID,
    status TEXT,
    amount DECIMAL,
    created_at TIMESTAMP,
    PRIMARY KEY ((order_id))
);

-- Valid view: adds user_id to partition key
CREATE MATERIALIZED VIEW orders_by_user AS
    SELECT * FROM orders
    WHERE user_id IS NOT NULL AND order_id IS NOT NULL
    PRIMARY KEY ((user_id), created_at, order_id)
    WITH CLUSTERING ORDER BY (created_at DESC);

-- Invalid: order_id not in primary key
-- CREATE MATERIALIZED VIEW invalid_view AS
--     SELECT * FROM orders
--     WHERE user_id IS NOT NULL
--     PRIMARY KEY ((user_id), created_at);  -- Missing order_id!

Materialized View Limitations¶

Limitation	Impact
Write amplification	Every write to base table writes to all views
Eventual consistency	Views may lag behind base table
Limited transformations	Cannot use functions or aggregates
Repair complexity	Views need repair too
Schema changes	Cannot alter view; must drop and recreate

When to Use Materialized Views¶

Good use cases: - Small number of views per table - Low write throughput - Need guaranteed consistency with base table - Simple alternative access patterns

Avoid when: - High write throughput - Many views needed - Complex transformations required - Critical latency requirements

Index Management¶

Listing Indexes¶

-- List all indexes
SELECT * FROM system_schema.indexes;

-- Indexes for specific keyspace
SELECT * FROM system_schema.indexes WHERE keyspace_name = 'my_keyspace';

-- Indexes for specific table
SELECT * FROM system_schema.indexes
WHERE keyspace_name = 'my_keyspace' AND table_name = 'users';

Dropping Indexes¶

-- Drop by name
DROP INDEX IF EXISTS users_email_idx;

-- Drop by keyspace.name
DROP INDEX my_keyspace.users_email_idx;

Rebuilding Indexes¶

# Rebuild all indexes on a table
nodetool rebuild_index my_keyspace users

# Rebuild specific index
nodetool rebuild_index my_keyspace users users_email_idx

Managing Materialized Views¶

-- List views
SELECT * FROM system_schema.views WHERE keyspace_name = 'my_keyspace';

-- Drop view
DROP MATERIALIZED VIEW IF EXISTS users_by_email;

-- Cannot alter views; must drop and recreate

Index Best Practices¶

Secondary Index Guidelines¶

DO:
✓ Use for low-cardinality columns
✓ Combine with partition key when possible
✓ Use for occasional queries
✓ Index columns queried with equality

DON'T:
✗ Index high-cardinality columns (use SAI instead)
✗ Rely on indexes for primary access patterns
✗ Use for range queries (use SAI instead)
✗ Create many indexes on single table

SAI Guidelines (5.0+)¶

DO:
✓ Use for high-cardinality columns
✓ Use for range queries
✓ Combine multiple SAI indexes in queries
✓ Use with partition key for best performance

DON'T:
✗ Use as replacement for good data model
✗ Query without partition key for large datasets
✗ Expect full-text search capabilities

Materialized View Guidelines¶

DO:
✓ Limit to 2-3 views per table
✓ Use for guaranteed consistency needs
✓ Use for simple access pattern changes
✓ Include all base table PK columns

DON'T:
✗ Create many views per table
✗ Use with high write throughput
✗ Use for complex data transformations
✗ Forget to repair views

Index Selection Decision Tree¶

Need to query by non-PK column?
│
├── Is this a frequent/primary access pattern?
│   └── Yes: Create a new table (denormalization)
│
├── Need guaranteed consistency with base table?
│   └── Yes: Consider Materialized View
│
├── Running Cassandra 5.0+?
│   ├── Yes: Use SAI
│   │   ├── High cardinality: SAI
│   │   ├── Range queries: SAI
│   │   └── Multiple conditions: SAI
│   │
│   └── No: Secondary Index or SASI
│       ├── Low cardinality: Secondary Index
│       ├── Prefix/contains search: SASI
│       └── High cardinality: Consider data model change
│
└── Is performance acceptable?
    ├── Yes: Keep current solution
    └── No: Redesign data model

Next Steps¶

Data Modeling - Query-first design
DML Reference - Query syntax
Performance Tuning - Optimization
Anti-Patterns - What to avoid