Materialized Views¶

Materialized Views (MVs) provide automatic denormalization in Cassandra, maintaining synchronized copies of data with different primary keys. While offering query flexibility, MVs have significant operational considerations that require careful evaluation.

Production Considerations

Materialized Views have known consistency and performance limitations. Evaluate carefully before production use. Many deployments prefer application-managed denormalization.

Background and History¶

Origins¶

Materialized Views were introduced in Cassandra 3.0 (2015) to address a common challenge: Cassandra's data model requires designing tables around query patterns, often leading to manual denormalization where the same data is written to multiple tables.

MVs automate this denormalization by having Cassandra maintain additional tables that are synchronized with a base table.

Design Goals¶

The MV implementation aimed to provide:

Automatic synchronization: Base table changes propagate to views
Query flexibility: Different primary key for alternate access patterns
Consistency: View data eventually consistent with base table
Reduced application complexity: No manual multi-table writes

Evolution and Status¶

MVs have had a complex production history:

Version	Status	Notes
3.0 (2015)	Introduced	Initial implementation
3.x	Experimental	Known consistency issues discovered
4.0	Improved	Bug fixes, but still has caveats
4.1+	Stable with caveats	Production use requires understanding limitations

The feature remains supported but is not recommended for all use cases.

Architecture¶

How Materialized Views Work¶

When a base table row is modified, Cassandra automatically updates corresponding rows in all materialized views:

View Update Mechanism¶

The MV update process involves several steps:

Key points:

Write arrives at coordinator
Coordinator computes what MV rows need to change
Base table mutation applied
For updates/deletes: current row read to determine old MV key
MV mutations sent asynchronously to MV replicas
Client acknowledgment based on base table consistency level

Partition Key Mapping¶

MVs require that the base table's partition key be part of the view's primary key:

-- Base table
CREATE TABLE users (
    user_id uuid PRIMARY KEY,
    email text,
    city text,
    age int
);

-- Valid MV: email as partition key, user_id as clustering
CREATE MATERIALIZED VIEW users_by_email AS
    SELECT * FROM users
    WHERE email IS NOT NULL AND user_id IS NOT NULL
    PRIMARY KEY (email, user_id);

-- Valid MV: city as partition key, user_id as clustering
CREATE MATERIALIZED VIEW users_by_city AS
    SELECT * FROM users
    WHERE city IS NOT NULL AND user_id IS NOT NULL
    PRIMARY KEY (city, user_id);

-- INVALID: base table PK not in view PK
CREATE MATERIALIZED VIEW invalid_view AS
    SELECT * FROM users
    WHERE city IS NOT NULL
    PRIMARY KEY (city);  -- Missing user_id

This constraint ensures Cassandra can: - Locate MV rows from base table rows - Delete MV rows when base rows are deleted - Maintain referential integrity

Creating Materialized Views¶

Basic Syntax¶

CREATE MATERIALIZED VIEW [IF NOT EXISTS] keyspace.view_name AS
    SELECT column1, column2, ...
    FROM keyspace.base_table
    WHERE column1 IS NOT NULL [AND column2 IS NOT NULL ...]
    PRIMARY KEY (new_partition_key, clustering_columns);

Examples¶

-- View for querying users by email
CREATE MATERIALIZED VIEW users_by_email AS
    SELECT user_id, email, name, city, created_at
    FROM users
    WHERE email IS NOT NULL AND user_id IS NOT NULL
    PRIMARY KEY (email, user_id);

-- View with compound partition key
CREATE MATERIALIZED VIEW orders_by_status_date AS
    SELECT *
    FROM orders
    WHERE status IS NOT NULL
      AND order_date IS NOT NULL
      AND order_id IS NOT NULL
    PRIMARY KEY ((status, order_date), order_id);

-- View with different clustering order
CREATE MATERIALIZED VIEW events_by_type_desc AS
    SELECT *
    FROM events
    WHERE event_type IS NOT NULL
      AND event_time IS NOT NULL
      AND event_id IS NOT NULL
    PRIMARY KEY (event_type, event_time, event_id)
    WITH CLUSTERING ORDER BY (event_time DESC, event_id ASC);

WHERE Clause Requirements¶

The WHERE clause must include IS NOT NULL for all primary key columns:

-- All PK columns must have IS NOT NULL
CREATE MATERIALIZED VIEW mv AS
    SELECT * FROM base
    WHERE pk_col1 IS NOT NULL      -- Required
      AND pk_col2 IS NOT NULL      -- Required
      AND ck_col1 IS NOT NULL      -- Required
    PRIMARY KEY (pk_col1, pk_col2, ck_col1);

-- Additional filtering allowed (but use cautiously)
CREATE MATERIALIZED VIEW active_users AS
    SELECT * FROM users
    WHERE status IS NOT NULL
      AND user_id IS NOT NULL
      AND status = 'active'        -- Additional filter
    PRIMARY KEY (status, user_id);

View Options¶

CREATE MATERIALIZED VIEW mv AS
    SELECT * FROM base_table
    WHERE ...
    PRIMARY KEY (...)
    WITH
        -- Clustering order
        CLUSTERING ORDER BY (col DESC)
        -- Compaction strategy
        AND compaction = {'class': 'LeveledCompactionStrategy'}
        -- Other table properties
        AND gc_grace_seconds = 864000
        AND bloom_filter_fp_chance = 0.01;

Querying Materialized Views¶

MVs are queried like regular tables:

-- Query view by its partition key
SELECT * FROM users_by_email WHERE email = '[email protected]';

-- Range query on clustering column
SELECT * FROM events_by_type
WHERE event_type = 'error'
  AND event_time > '2024-01-01';

-- All standard CQL query capabilities apply
SELECT user_id, name FROM users_by_city
WHERE city = 'NYC'
LIMIT 100;

Benefits¶

Automatic Synchronization¶

Primary advantage: Cassandra handles denormalization automatically.

-- Single write to base table
INSERT INTO users (user_id, email, city) VALUES (uuid(), '[email protected]', 'NYC');

-- Automatically visible in all views
SELECT * FROM users_by_email WHERE email = '[email protected]';  -- Returns row
SELECT * FROM users_by_city WHERE city = 'NYC';        -- Returns row

Query Flexibility¶

Enable queries on non-partition-key columns:

-- Base table: query by user_id only
SELECT * FROM users WHERE user_id = ?;

-- MV: query by email
SELECT * FROM users_by_email WHERE email = ?;

-- MV: query by city
SELECT * FROM users_by_city WHERE city = ?;

Reduced Application Complexity¶

Application code writes once; Cassandra maintains consistency:

# Without MV: application manages multiple tables
session.execute("INSERT INTO users ...")
session.execute("INSERT INTO users_by_email ...")
session.execute("INSERT INTO users_by_city ...")
# Must handle partial failures

# With MV: single write
session.execute("INSERT INTO users ...")
# Cassandra propagates to views

Limitations and Drawbacks¶

Consistency Challenges¶

MVs provide eventual consistency, not strong consistency:

Scenario: Write to base table, immediately query view

Timeline:
T0: INSERT INTO users (user_id=X, email='[email protected]')
T1: Client receives acknowledgment (base table written)
T2: SELECT * FROM users_by_email WHERE email='[email protected]'
T3: MV mutation arrives at replica

Problem: If T2 < T3, query returns empty result

Known consistency issues:

View may temporarily lag base table
In rare cases, view and base table may diverge
Repair needed to reconcile inconsistencies

Write Amplification¶

Each base table write generates additional writes:

Base table write: 1 mutation
+ 1 MV: 2 mutations total
+ 2 MVs: 3 mutations total
+ N MVs: N+1 mutations total

For updates (not inserts):
- Read-before-write required
- Old row read to compute old MV key
- Delete old MV row + insert new MV row

Impact:

Increased write latency
Higher disk I/O
More network traffic

Performance Overhead¶

MVs add overhead to both reads and writes:

Operation	Overhead
Insert	1 additional mutation per MV
Update	Read-before-write + 2 mutations per MV
Delete	Read-before-write + 1 delete per MV
Base table read	None
View read	None (same as table read)

Repair Complexity¶

MVs complicate repair operations:

# Repair base table AND views
nodetool repair keyspace base_table
nodetool repair keyspace view_name

# Views should be repaired after base table
# Order matters for consistency

Tombstone Interactions¶

Tombstones in MVs have unique challenges:

Deleting base row creates tombstones in all MVs
gc_grace_seconds applies independently
Tombstone scan issues can affect views

Limited Schema Flexibility¶

MV schema is constrained:

Cannot add columns to existing MV
Cannot change MV primary key
Must drop and recreate for schema changes

When to Use Materialized Views¶

Acceptable Use Cases¶

Scenario	Rationale
Low write volume	Write amplification acceptable
Strong eventual consistency acceptable	Read-your-writes not required
Simple alternate access pattern	Single column lookup
Development/prototyping	Faster iteration

Avoid Materialized Views When¶

Scenario	Alternative
High write throughput	Application-managed denormalization
Strong consistency required	Application-managed with logged batches
Complex query patterns	Secondary indexes (SAI)
Frequent schema changes	Separate tables
Many views per table	Reconsider data model

Alternatives¶

Application-Managed Denormalization¶

Write to multiple tables in application code:

# Write to both tables explicitly
batch = BatchStatement()
batch.add(insert_user_stmt, [user_id, email, city])
batch.add(insert_user_by_email_stmt, [email, user_id, city])
session.execute(batch)

Trade-offs:

Aspect	MV	Application-Managed
Consistency	Eventual	Configurable
Complexity	Lower	Higher
Control	Less	Full
Performance	Overhead	Optimizable
Schema changes	Difficult	Flexible

Secondary Indexes (SAI)¶

For Cassandra 5.0+, SAI often provides better alternative:

-- Instead of MV for email lookup
CREATE INDEX ON users (email) USING 'sai';

-- Query directly
SELECT * FROM users WHERE email = '[email protected]';

SAI advantages over MV:

No write amplification
Simpler operations
Better consistency
Lower storage overhead

MV advantages over SAI:

Full partition key control
Different clustering order
Wider query patterns

Operations¶

Creating Views on Existing Tables¶

When creating a view on a table with data:

-- View creation triggers build process
CREATE MATERIALIZED VIEW users_by_email AS ...;

-- Monitor build progress
nodetool viewbuildstatus

The build process: 1. Scans base table 2. Populates view incrementally 3. View queries return partial results during build

Monitoring View Health¶

# View build status
nodetool viewbuildstatus

# View metrics
nodetool tablestats keyspace.view_name

# Check for inconsistencies (requires repair)
nodetool repair keyspace base_table
nodetool repair keyspace view_name

Dropping Views¶

-- Drop view (does not affect base table)
DROP MATERIALIZED VIEW IF EXISTS users_by_email;

Repair Procedure¶

Maintain view consistency with regular repair:

# 1. Repair base table first
nodetool repair keyspace base_table

# 2. Repair each view
nodetool repair keyspace view_name

# For full cluster repair
nodetool repair -pr keyspace base_table
nodetool repair -pr keyspace view_name

Best Practices¶

Design Guidelines¶

Limit views per table: 1-2 views maximum
Consider SAI first: For Cassandra 5.0+
Understand consistency model: Accept eventual consistency
Plan for repair: Include views in repair schedule
Monitor build times: Large tables take time to build

Schema Design¶

-- Good: Simple alternate lookup
CREATE MATERIALIZED VIEW users_by_email AS
    SELECT user_id, email, name FROM users
    WHERE email IS NOT NULL AND user_id IS NOT NULL
    PRIMARY KEY (email, user_id);

-- Consider carefully: Compound partition key
CREATE MATERIALIZED VIEW orders_by_status_date AS
    SELECT * FROM orders
    WHERE status IS NOT NULL AND ...
    PRIMARY KEY ((status, order_date), order_id);
-- May create hot partitions

-- Avoid: Many columns that change
-- Each column change = view update

Operational Guidelines¶

Practice	Rationale
Regular repair	Maintain consistency
Monitor build status	Track view population
Test failover scenarios	Understand consistency during failures
Benchmark write impact	Quantify overhead
Have rollback plan	Know how to disable if issues arise

Index Overview - Alternative query patterns
SAI - Recommended for Cassandra 5.0+
Secondary Indexes - Legacy indexing
Write Path - How writes propagate
Compaction - View compaction