Staged Event-Driven Architecture (SEDA)¶

Cassandra uses Staged Event-Driven Architecture (SEDA) to manage concurrency and resource allocation. Operations are decomposed into discrete stages, each with its own thread pool and queue. This design enables controlled parallelism, natural backpressure, and fine-grained performance tuning.

Why SEDA?¶

When Cassandra was designed (2008), servers typically had 2-8 CPU cores. Thread-per-request models were impractical—a server handling 10,000 concurrent requests would need 10,000 threads, but with only 4-8 cores, most threads would be waiting while consuming ~1MB stack memory each and creating context-switching overhead.

SEDA addressed this by:

Matching threads to cores - Each stage has a small, bounded thread pool sized for available CPU
Queues absorb concurrency - 10,000 requests become queue entries, not threads
Work decomposition - Breaking requests into stages allows pipeline parallelism
Natural backpressure - Full queues signal upstream to slow down

Era	Typical Cores	Thread-per-Request	SEDA
2008	4-8	10K threads = disaster	4-8 threads per stage
2015	16-32	Still problematic	Scales with core count
2024	64-128	Context switching still expensive	Thread pools sized to cores

Architecture Overview¶

Request Flow Through Stages¶

A CQL request flows through multiple stages from arrival to response:

Stage Components¶

Each stage consists of a queue and a thread pool:

Component	Metric	Description
Queue	`pending_tasks`	Tasks waiting for a thread
Thread Pool	`active_tasks`	Currently executing tasks
Thread Pool	`active_tasks_limit`	Pool size (max concurrent)
Counter	`completed_tasks`	Total completed since startup
Counter	`blocked_tasks`	Tasks blocked on downstream queue

Stage Categories¶

Request Path Stages¶

These stages are on the critical path for every request:

Stage	Default Threads	Purpose	Tuning Parameter
`Native-Transport-Requests`	128	CQL protocol handling	`native_transport_max_threads`
`ReadStage`	32	Local read execution	`concurrent_reads`
`MutationStage`	32	Local write execution	`concurrent_writes`
`CounterMutationStage`	32	Counter write execution	`concurrent_counter_writes`
`ViewMutationStage`	32	Materialized view updates	`concurrent_materialized_view_writes`
`RequestResponseStage`	(shared)	Coordinator logic	-

Storage Stages¶

Background stages for data persistence:

Stage	Default Threads	Purpose	Tuning Parameter
`MemtableFlushWriter`	2	Flush memtables to SSTables	`memtable_flush_writers`
`MemtablePostFlush`	1	Post-flush cleanup	-
`CompactionExecutor`	2-4	Compact SSTables	`concurrent_compactors`
`ValidationExecutor`	1	Repair validation	`concurrent_validations`

Cluster Communication Stages¶

Inter-node coordination:

Stage	Purpose	When Active
`GossipStage`	Cluster state propagation	Always (background)
`AntiEntropyStage`	Repair coordination	During repairs
`MigrationStage`	Schema changes	Schema modifications
`HintsDispatcher`	Hint delivery	Catching up nodes
`InternalResponseStage`	Internal coordination	Various

Auxiliary Stages¶

Stage	Purpose
`SecondaryIndexManagement`	Index maintenance
`CacheCleanupExecutor`	Cache eviction
`Sampler`	Query sampling
`PendingRangeCalculator`	Token range computation

Backpressure Mechanism¶

When a stage's queue fills, Cassandra applies backpressure to upstream stages. This prevents memory exhaustion and cascading failures.

Backpressure Flow¶

When Stage C's queue fills, Stage B blocks trying to enqueue. Stage B's queue then fills, blocking Stage A. Eventually the client experiences increased latency or receives OverloadedException.

Blocked Tasks¶

When a stage cannot enqueue work to the next stage:

The submitting thread blocks
blocked_tasks metric increments
If persistent, blocked_tasks_all_time accumulates
Eventually, backpressure reaches the client

-- Find stages experiencing backpressure
SELECT name, pending_tasks, blocked_tasks, blocked_tasks_all_time
FROM system_views.thread_pools
WHERE blocked_tasks > 0 OR pending_tasks > 100;

Native Transport Backpressure (Cassandra 4.0+)¶

The native transport layer has memory-based backpressure:

# cassandra.yaml

# Total bytes allowed in flight (node-wide)
native_transport_max_concurrent_requests_in_bytes: 0  # unlimited by default

# Bytes allowed per client IP
native_transport_max_concurrent_requests_in_bytes_per_ip: 0  # unlimited

# Response when overloaded
# false: Apply TCP backpressure (recommended)
# true: Return OverloadedException immediately
native_transport_throw_on_overload: false

When limits are exceeded:

throw_on_overload: false (default): Stops reading from client sockets, TCP buffers fill, client naturally slows down
throw_on_overload: true: Returns OverloadedException immediately, client must handle

Metrics Reference¶

Per-Stage Metrics¶

Each stage exposes these metrics via JMX and virtual tables:

Metric	Type	Description
`active_tasks`	Gauge	Currently executing tasks
`active_tasks_limit`	Gauge	Thread pool size (max concurrent)
`pending_tasks`	Gauge	Tasks queued waiting for execution
`blocked_tasks`	Gauge	Tasks currently blocked (waiting to enqueue downstream)
`blocked_tasks_all_time`	Counter	Total blocked tasks since startup
`completed_tasks`	Counter	Total completed tasks since startup

JMX Path¶

org.apache.cassandra.metrics:type=ThreadPools,path=<category>,scope=<stage>,name=<metric>

Examples:
org.apache.cassandra.metrics:type=ThreadPools,path=request,scope=ReadStage,name=PendingTasks
org.apache.cassandra.metrics:type=ThreadPools,path=request,scope=MutationStage,name=ActiveTasks
org.apache.cassandra.metrics:type=ThreadPools,path=internal,scope=CompactionExecutor,name=CompletedTasks

Monitoring Commands¶

# All thread pool stats
nodetool tpstats

# Via virtual table
cqlsh -e "SELECT name, active_tasks, pending_tasks, blocked_tasks
          FROM system_views.thread_pools;"

# Via JMX
nodetool sjk mx -b "org.apache.cassandra.metrics:type=ThreadPools,path=request,scope=ReadStage,name=PendingTasks" -f Value

Key Thresholds¶

Stage	Metric	Warning	Critical
`MutationStage`	`pending_tasks`	> 0 sustained	> 10
`MutationStage`	`blocked_tasks`	> 0	Any
`ReadStage`	`pending_tasks`	> 50	> 100
`ReadStage`	`blocked_tasks`	> 0	Any
`MemtableFlushWriter`	`pending_tasks`	> 2	> 5
`CompactionExecutor`	`pending_tasks`	> 50	> 100
`Native-Transport-Requests`	`pending_tasks`	> 500	> 1000

Configuration Tuning¶

Thread Pool Sizing¶

# cassandra.yaml

# Request path stages
concurrent_reads: 32              # ReadStage threads
concurrent_writes: 32             # MutationStage threads
concurrent_counter_writes: 32     # CounterMutationStage threads
concurrent_materialized_view_writes: 32

# Storage stages
memtable_flush_writers: 2         # MemtableFlushWriter threads
concurrent_compactors: 4          # CompactionExecutor threads
concurrent_validations: 1         # ValidationExecutor threads

# Native transport
native_transport_max_threads: 128

Sizing Guidelines¶

Stage	Sizing Consideration
`ReadStage`	Match to disk parallelism. NVMe: 32-64. HDD: 8-16.
`MutationStage`	Similar to ReadStage. Memory-bound, not disk-bound.
`MemtableFlushWriter`	2-4 typically. More if flush falls behind.
`CompactionExecutor`	Balance with read/write load. Too many impacts foreground ops.

Dynamic Tuning¶

Some settings can be changed at runtime:

# Change compaction threads
nodetool setconcurrentcompactors 8

# Check current setting
nodetool getconcurrentcompactors

Troubleshooting¶

Identifying Bottleneck Stages¶

-- Find the problem stage
SELECT name,
       active_tasks,
       pending_tasks,
       blocked_tasks,
       completed_tasks
FROM system_views.thread_pools
WHERE pending_tasks > 0 OR blocked_tasks > 0
ORDER BY pending_tasks DESC;

Common Issues¶

MutationStage Blocked¶

Symptoms: blocked_tasks > 0, write latency spikes

Cause: Downstream stage (usually MemtableFlushWriter) cannot keep up

Resolution: 1. Check disk I/O: iostat -x 1 2. Increase flush writers: memtable_flush_writers: 4 3. Verify commit log on fast disk

ReadStage Backed Up¶

Symptoms: High pending_tasks, read latency increases

Cause: Disk I/O bottleneck, large partitions, or tombstone scanning

Resolution: 1. Check partition sizes 2. Review tombstone metrics 3. Verify key cache hit ratio 4. Consider faster storage or more nodes

CompactionExecutor Behind¶

Symptoms: pending_tasks > 100, growing disk usage, increasing read latency

Cause: Write rate exceeds compaction throughput

Resolution: 1. Increase concurrent_compactors 2. Review compaction strategy 3. Check if compactions are completing (not stuck)

Native Transport Saturated¶

Symptoms: Native-Transport-Requests high pending, client timeouts

Cause: More requests than the node can handle

Resolution: 1. Add nodes to distribute load 2. Implement client-side throttling 3. Review query patterns for inefficiencies

Historical Context¶

SEDA was introduced in the 2001 SOSP paper by Matt Welsh, David Culler, and Eric Brewer at UC Berkeley. The paper addressed a fundamental problem of that era: servers had very few CPU cores (often 1-4) but needed to handle thousands of concurrent connections.

Cassandra adopted SEDA from its earliest versions (2008) when 4-8 core servers were common. The architecture allowed Cassandra to handle massive concurrency without creating thousands of threads.

Why SEDA remains relevant today:

Even with 64-128 core servers, SEDA provides benefits beyond core matching:

Graceful degradation - Under overload, throughput plateaus rather than collapsing
Resource isolation - Slow reads don't block writes, compaction doesn't block queries
Visibility - Per-stage metrics pinpoint exactly where bottlenecks occur
Tuning granularity - Disk-bound stages can have different thread counts than CPU-bound stages

Thread Pools Virtual Table - Monitoring via CQL
nodetool tpstats - Command-line monitoring
Write Path - How writes flow through stages
Read Path - How reads flow through stages
Client Throttling - Backpressure at client level