SASI (SSTable Attached Secondary Index)¶
SASI (SSTable Attached Secondary Index) was introduced in Cassandra 3.4 (2016) as an experimental indexing system providing range queries and text search capabilities. While offering significant improvements over legacy secondary indexes, SASI remains experimental and has known limitations.
Experimental Status
SASI is marked as experimental in Cassandra. For production workloads on Cassandra 5.0+, use SAI instead.
Background and History¶
Origins¶
SASI was developed by Apple and contributed to Apache Cassandra in 2016 (CASSANDRA-10661). The design addressed key limitations of legacy secondary indexes:
- Inability to perform range queries
- No text search capabilities
- Scatter-gather query patterns
- Separate index table compaction
Design Goals¶
SASI introduced several architectural improvements:
- SSTable attachment: Index data stored alongside base table SSTables
- Range queries: Support for inequality operators (
>,<,>=,<=) - Text search: PREFIX and CONTAINS operations
- Single-pass intersection: Multiple predicates evaluated together
Experimental Status¶
Despite its capabilities, SASI has remained experimental since introduction:
- Complex codebase with limited maintainership
- Memory management concerns during queries
- Known bugs in edge cases
- Superseded by SAI in Cassandra 5.0
Architecture¶
SSTable-Attached Storage¶
Unlike legacy secondary indexes that use separate hidden tables, SASI stores index data as additional SSTable components:
Benefits of SSTable attachment:
- Index compacts with base data
- No separate compaction coordination
- Index lifecycle matches data lifecycle
- Reduced storage overhead
Index Modes¶
SASI supports three index modes optimized for different data types:
| Mode | Data Type | Query Support | Use Case |
|---|---|---|---|
| PREFIX | Text | LIKE 'abc%' |
String prefix matching |
| CONTAINS | Text | LIKE '%abc%' |
Full-text search |
| SPARSE | Numeric | >, <, >=, <= |
Range queries |
Query Execution¶
SASI queries iterate through SSTables, applying predicates locally before returning results:
Single-pass intersection: Multiple SASI predicates are intersected within each SSTable iteration, avoiding the scatter-gather pattern of legacy indexes for multi-predicate queries.
Configuration¶
Creating SASI Indexes¶
-- Basic SASI index (PREFIX mode for text)
CREATE CUSTOM INDEX ON users (email)
USING 'org.apache.cassandra.index.sasi.SASIIndex';
-- CONTAINS mode for substring search
CREATE CUSTOM INDEX ON users (bio)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
'mode': 'CONTAINS',
'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
'case_sensitive': 'false'
};
-- SPARSE mode for numeric ranges
CREATE CUSTOM INDEX ON events (timestamp)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = { 'mode': 'SPARSE' };
-- PREFIX mode with case insensitivity
CREATE CUSTOM INDEX ON products (name)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
'mode': 'PREFIX',
'case_sensitive': 'false'
};
Configuration Options¶
| Option | Values | Default | Description |
|---|---|---|---|
mode |
PREFIX, CONTAINS, SPARSE | PREFIX | Index mode for query types |
case_sensitive |
true, false | true | Case sensitivity for text |
analyzed |
true, false | false | Enable text analysis |
analyzer_class |
class name | - | Custom analyzer for tokenization |
max_compaction_flush_memory_in_mb |
integer | 1024 | Memory limit during compaction |
Analyzer Options¶
For text search with CONTAINS mode:
-- Standard analyzer (whitespace tokenization)
CREATE CUSTOM INDEX ON articles (content)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
'mode': 'CONTAINS',
'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
'tokenization_enable_stemming': 'true',
'tokenization_locale': 'en',
'tokenization_skip_stop_words': 'true'
};
-- Non-tokenizing analyzer (exact substring matching)
CREATE CUSTOM INDEX ON logs (message)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
'mode': 'CONTAINS',
'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'
};
Query Capabilities¶
Text Queries¶
-- PREFIX: Starts with
SELECT * FROM users WHERE email LIKE 'john%';
-- CONTAINS: Substring anywhere
SELECT * FROM articles WHERE content LIKE '%database%';
-- Case insensitive (if configured)
SELECT * FROM products WHERE name LIKE 'Apple%';
Range Queries¶
-- Greater than
SELECT * FROM events WHERE timestamp > '2024-01-01';
-- Less than or equal
SELECT * FROM sensors WHERE reading <= 100.0;
-- Range (requires two conditions)
SELECT * FROM events
WHERE timestamp >= '2024-01-01'
AND timestamp < '2024-02-01';
Combined Queries¶
-- Multiple SASI indexes
SELECT * FROM users
WHERE age > 25
AND city LIKE 'New%'
AND status = 'active';
-- SASI with partition key (most efficient)
SELECT * FROM events
WHERE sensor_id = ?
AND timestamp > '2024-01-01';
Limitations¶
Memory Consumption¶
SASI queries can consume significant memory:
Problem: CONTAINS mode builds in-memory structures
Large result sets held in memory
No streaming for intermediate results
Symptoms:
- GC pressure during queries
- OOM errors on large datasets
- Query timeouts
Mitigation:
- Use LIMIT clauses
- Combine with partition key restrictions
- Tune
max_compaction_flush_memory_in_mb
Tombstone Handling¶
SASI has known issues with tombstones:
Problem: Tombstones not always properly filtered
Deleted data may appear in results
Impact: Consistency issues in rare cases
CONTAINS Mode Overhead¶
N-gram tokenization for CONTAINS mode creates storage overhead:
Example: String "database"
N-grams (n=3): dat, ata, tab, aba, bas, ase
Storage impact: ~3x original string size
Query impact: More index entries to scan
Experimental Bugs¶
Known issues include:
- Memory leaks under specific query patterns
- Incorrect results with certain predicate combinations
- Performance degradation with high tombstone ratios
- Compaction issues with large indexes
Benefits¶
Range Query Support¶
Primary advantage over legacy secondary indexes:
-- Not possible with 2i, possible with SASI
SELECT * FROM metrics WHERE value > 100.0;
SELECT * FROM logs WHERE timestamp >= '2024-01-01';
Text Search¶
Built-in text search without external systems:
-- Substring search
SELECT * FROM products WHERE description LIKE '%wireless%';
-- Prefix search
SELECT * FROM users WHERE name LIKE 'John%';
Efficient Multi-Predicate Queries¶
Single-pass intersection reduces overhead:
-- 2i: Two scatter-gather operations + coordinator merge
-- SASI: Single pass through SSTables with local intersection
SELECT * FROM users
WHERE age > 25 AND city LIKE 'San%';
SSTable Integration¶
Index lifecycle matches data:
- Compacts together
- Deleted together
- No orphaned index entries
When to Use SASI¶
Acceptable Use Cases¶
| Scenario | Rationale |
|---|---|
| Development/testing with range queries | Faster than data model redesign |
| Low-traffic text search | Avoids external search system |
| Cassandra 3.4 - 4.x without SAI | Only range query option |
| Proof of concept | Validate query patterns before SAI migration |
Avoid SASI When¶
| Scenario | Alternative |
|---|---|
| Production Cassandra 5.0+ | Use SAI |
| High-throughput queries | Denormalized tables |
| Large CONTAINS searches | External search (Elasticsearch) |
| Mission-critical workloads | SAI or data model redesign |
Migration to SAI¶
For Cassandra 5.0+, migrate SASI indexes to SAI:
-- Drop SASI index
DROP INDEX IF EXISTS users_email_idx;
-- Create SAI index
CREATE INDEX users_email_idx ON users (email)
USING 'sai';
-- SAI with text analysis
CREATE INDEX users_bio_idx ON users (bio)
USING 'sai'
WITH OPTIONS = {
'index_analyzer': 'standard'
};
Migration considerations:
- SAI syntax differs from SASI
- Some SASI analyzers have no SAI equivalent
- Test query patterns after migration
- SAI is production-ready; SASI is not
Monitoring¶
Index Status¶
# Check index build status
nodetool describecluster
# Table statistics
nodetool tablestats keyspace.table
Warning Signs¶
| Symptom | Likely Cause | Action |
|---|---|---|
| High GC during queries | Memory pressure | Add LIMIT, restrict partition |
| Slow CONTAINS queries | Large n-gram index | Consider external search |
| Inconsistent results | Tombstone bugs | Verify with full scan |
| Compaction failures | Memory limits | Tune flush memory |
Related Documentation¶
- Index Overview - Index type comparison
- Secondary Indexes (2i) - Legacy indexes
- SAI - Recommended replacement (Cassandra 5.0+)
- Compaction - SSTable lifecycle