ReadTimeoutException¶

ReadTimeoutException occurs when a read operation cannot complete within the configured timeout period.

Error Message¶

com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query
at consistency LOCAL_QUORUM (2 responses were required but only 1 replica responded)

Or in logs:

ERROR [ReadStage-2] ReadCallback.java:123 - Read timeout: 5000 ms

Symptoms¶

Client receives ReadTimeoutException
High read latencies in nodetool proxyhistograms
Increased ReadTimeout count in JMX metrics
Slow or unresponsive queries

Diagnosis¶

Step 1: Check Read Latency¶

# Overall read latencies
nodetool proxyhistograms

# Per-table latencies
nodetool tablehistograms my_keyspace my_table

What to look for: p99 latencies > timeout value (default 5000ms)

Step 2: Check Table Statistics¶

nodetool tablestats my_keyspace.my_table

Key metrics:

Metric	Warning Level	Indicates
SSTable count	> 20	Compaction lag
Partition size (max)	> 100MB	Large partitions
Tombstone scans	> 1000	Too many tombstones
Bloom filter false positive ratio	> 0.01	Inefficient bloom filters

Step 3: Check for Tombstones¶

# Enable query tracing
cqlsh> TRACING ON;
cqlsh> SELECT * FROM my_table WHERE pk = 'value';

Look for: "Scanned X tombstones" messages.

Step 4: Check Thread Pool Status¶

nodetool tpstats

Warning signs: - ReadStage pending > 0 for extended periods - ReadStage blocked > 0 - High All time blocked count

Step 5: Check Disk I/O¶

iostat -x 1 5

Warning signs: - %util > 80% - await > 10ms - High r_await (read wait time)

Step 6: Check Garbage Collection¶

grep "GC pause" /var/log/cassandra/gc.log | tail -20

Warning signs: - GC pauses > 500ms - Frequent full GC

Root Causes¶

1. Large Partitions¶

Symptoms: Single partition queries slow, multi-partition queries fast.

Verify:

nodetool tablestats my_keyspace.my_table | grep -i partition

Solution: - Redesign data model to use smaller partitions - Use bucketing (e.g., by time period) - Set partition size alerts

2. Too Many Tombstones¶

Symptoms: Queries become slower after deletes, slow range queries.

Verify:

# Run with tracing
TRACING ON;
SELECT * FROM my_table WHERE partition_key = 'x';
# Look for "Scanned X tombstones"

Solution: - Reduce gc_grace_seconds (after ensuring repair runs regularly) - Avoid deletes if possible (use TTL instead) - Run compaction to purge tombstones

nodetool compact my_keyspace my_table

3. High SSTable Count¶

Symptoms: Read latency increases over time, compaction pending.

Verify:

nodetool compactionstats
nodetool tablestats my_keyspace.my_table | grep "SSTable count"

Solution: - Increase compaction throughput

nodetool setcompactionthroughput 128

- Force compaction

nodetool compact my_keyspace my_table

4. Slow Disk I/O¶

Symptoms: All operations slow, high disk utilization.

Verify:

iostat -x 1 10

Solution: - Use SSDs (required for production) - Check for disk issues - Reduce concurrent operations

5. Network Issues¶

Symptoms: Only some nodes show timeouts, inter-node latency high.

Verify:

# Check cross-node latency
nodetool netstats

# Ping test
ping -c 10 <other-node-ip>

Solution: - Check network infrastructure - Verify firewall rules - Check for packet loss

6. GC Pressure¶

Symptoms: Intermittent timeouts, correlated with GC pauses.

Verify:

grep -E "GC|pause" /var/log/cassandra/gc.log | tail -50

Solution: - Tune heap size (typically 8-31GB) - Adjust GC settings - Check for memory leaks

7. Overloaded Cluster¶

Symptoms: All operations slow, high thread pool utilization.

Verify:

nodetool tpstats

Solution: - Add nodes - Reduce traffic - Optimize queries

Resolution¶

Immediate Actions¶

Increase timeout (temporary fix):

# cassandra.yaml
read_request_timeout_in_ms: 10000

Reduce consistency level (if acceptable):

CONSISTENCY LOCAL_ONE;
SELECT * FROM my_table WHERE pk = 'x';

Force compaction (if SSTable count high):
```
nodetool compact my_keyspace my_table
```

Long-term Fixes¶

Fix data model:
Smaller partitions (< 100MB)
Avoid unbounded partition growth
Use appropriate clustering columns

Tune compaction:

# For time-series data
compaction = {'class': 'TimeWindowCompactionStrategy',
              'compaction_window_unit': 'HOURS',
              'compaction_window_size': '1'}

Add capacity:
Add nodes if cluster is overloaded
Scale horizontally
Optimize queries:
Add appropriate indexes
Use LIMIT clause
Avoid full table scans

Prevention¶

Monitoring¶

Set up alerts for:

Metric	Warning	Critical
Read latency p99	> 50ms	> 500ms
SSTable count	> 20	> 50
ReadStage pending	> 0 (sustained)	> 10
GC pause time	> 200ms	> 500ms

Best Practices¶

Design for your queries: Model data based on access patterns
Keep partitions small: Target < 100MB per partition
Avoid tombstones: Use TTL instead of DELETE when possible
Monitor proactively: Track read latencies over time
Run repairs: Keep data consistent across replicas

WriteTimeoutException
Diagnosis Guide

ReadTimeoutException¶

Error Message¶

Symptoms¶

Diagnosis¶

Step 1: Check Read Latency¶

Step 2: Check Table Statistics¶

Step 3: Check for Tombstones¶

Step 4: Check Thread Pool Status¶

Step 5: Check Disk I/O¶

Step 6: Check Garbage Collection¶

Root Causes¶

1. Large Partitions¶

2. Too Many Tombstones¶

3. High SSTable Count¶

4. Slow Disk I/O¶

5. Network Issues¶

6. GC Pressure¶

7. Overloaded Cluster¶

Resolution¶

Immediate Actions¶

Long-term Fixes¶

Prevention¶

Monitoring¶

Best Practices¶

Related¶