WriteTimeoutException Troubleshooting¶

Symptoms¶

WriteTimeoutException in application logs
Error message: Cassandra timeout during SIMPLE write query
Client receives timeout error before operation completes
Consistency level not achieved within timeout period

Error Message Examples¶

com.datastax.driver.core.exceptions.WriteTimeoutException:
Cassandra timeout during SIMPLE write query at consistency QUORUM
(2 replicas were required but only 1 acknowledged the write)

WriteTimeout: Error from server: code=1100 [Coordinator node timed out waiting
for replica nodes' responses] message="Operation timed out - received only
1 responses." info={'received_responses': 1, 'required_responses': 2,
'consistency': 'QUORUM', 'write_type': 'SIMPLE'}

Common Causes¶

1. Overloaded Nodes¶

Replicas are too busy to process writes in time.

Diagnosis:

# Check MutationStage pending/blocked
nodetool tpstats | grep -E "Pool|Mutation"

# Check for dropped messages
nodetool tpstats | grep -E "MUTATION|Dropped"

# High MutationStage pending indicates overload

Resolution: - Scale cluster (add nodes) - Reduce write throughput - Increase concurrent_writes in cassandra.yaml - Check for hot partitions

2. Network Issues¶

Network latency or partition between coordinator and replicas.

Diagnosis:

# Check network latency
ping <replica_ip>

# Check gossip state
nodetool gossipinfo | grep -A5 <replica_ip>

# Look for UnreachableMembers
nodetool status

Resolution: - Fix network connectivity issues - Check firewall rules (port 7000) - Verify inter-node networking

3. Replica Node Down or Unresponsive¶

Not enough live replicas to satisfy consistency level.

Diagnosis:

# Check cluster status
nodetool status

# Look for Down (D) or unavailable nodes
# Check if nodes are joining/leaving (J/L)

Resolution: - Bring down nodes back up - Lower consistency level (if acceptable) - Replace dead nodes

4. Disk I/O Bottleneck¶

Commit log writes are slow.

Diagnosis:

# Check disk I/O
iostat -xz 1 5

# Check commit log directory latency
# Look for high await times

# Check for disk full
df -h /var/lib/cassandra

Resolution: - Use SSD/NVMe for commit log - Separate commit log from data directory - Increase disk capacity

5. Large Writes or Batches¶

Individual writes or batches too large.

Diagnosis:

# Check for batch size warnings
grep -i "batch" /var/log/cassandra/system.log

# Check write sizes in table stats
nodetool tablestats my_keyspace.table | grep -i write

Resolution: - Reduce batch sizes - Split large writes into smaller operations - Review data model for write amplification

6. GC Pauses¶

Long garbage collection pauses blocking writes.

Diagnosis:

# Check GC logs
tail -100 /var/log/cassandra/gc.log

# Look for long pauses (> 500ms)
grep -E "pause.*[0-9]{4,}ms" /var/log/cassandra/gc.log

Resolution: - Tune GC settings - Reduce heap size if too large - Consider ZGC (JDK 17+)

Write Types¶

The write_type in the error indicates what kind of write failed:

Write Type	Description	Common Issues
`SIMPLE`	Regular INSERT/UPDATE	Node overload, network
`BATCH`	Batch operation	Batch too large, spanning partitions
`UNLOGGED_BATCH`	Unlogged batch	Same as BATCH
`COUNTER`	Counter update	Counter replica issues
`BATCH_LOG`	Batch log write	Batch log nodes unavailable
`CAS`	Lightweight transaction	Paxos timeout, contention
`VIEW`	Materialized view	View replicas unavailable

Step-by-Step Diagnosis¶

Step 1: Check Write Type and Consistency Level¶

-- What was the query?
-- Check consistency level requirement
CONSISTENCY;

-- For QUORUM with RF=3, need 2 replicas
-- For LOCAL_QUORUM with RF=3 in DC, need 2 in that DC

Step 2: Check Cluster Status¶

# Are all replicas up?
nodetool status

# Check which nodes own the partition
nodetool getendpoints my_keyspace my_table <partition_key>

Step 3: Check Node Health¶

# On coordinator and replicas:

# Thread pools
nodetool tpstats

# Dropped messages (critical)
nodetool tpstats | grep -E "Dropped|MUTATION"

# GC activity
nodetool gcstats

# Recent errors
tail -100 /var/log/cassandra/system.log | grep -i error

Step 4: Check Resource Utilization¶

# CPU
top -b -n 1 | head -20

# Memory
free -h

# Disk I/O
iostat -xz 1 5

# Network
sar -n DEV 1 5

Step 5: Review Metrics¶

# Write latency
nodetool proxyhistograms

# Table-level write stats
nodetool tablestats my_keyspace.table | grep -i write

Resolution Strategies¶

Immediate Actions¶

Lower consistency level (temporary):

-- If eventual consistency is acceptable
CONSISTENCY LOCAL_ONE;
INSERT INTO table ...;

Increase timeout (application):

// Java driver
SimpleStatement stmt = SimpleStatement.builder("INSERT ...")
    .setTimeout(Duration.ofSeconds(30))
    .build();

Add retry logic (application):

RetryPolicy retry = new RetryPolicy() {
    @Override
    public RetryDecision onWriteTimeout(...) {
        if (writeType == WriteType.SIMPLE && retryCount < 3) {
            return RetryDecision.retry(cl);
        }
        return RetryDecision.rethrow();
    }
};

Long-term Fixes¶

Scale cluster:

# Add nodes to distribute load
# Monitor: nodetool status during expansion

Tune cassandra.yaml:

# Increase write timeout
write_request_timeout_in_ms: 10000  # default: 2000

# Increase concurrent writes (if CPU available)
concurrent_writes: 64  # default: 32

# Optimize commit log
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000

Review data model:
Check for hot partitions
Split large partitions
Reduce write amplification
Optimize hardware:
Use NVMe for commit log
Increase network bandwidth
Add RAM for more memtable space

Prevention¶

Best Practices¶

# Appropriate timeouts
write_request_timeout_in_ms: 5000  # 5 seconds

# Adequate thread pool
concurrent_writes: 32  # Based on disk count

# Proper sizing
# memtable_heap_space_in_mb: adequate for workload

Monitoring¶

# Alert on:
- Write latency p99 > 1 second
- Dropped MUTATION messages > 0
- MutationStage pending > 100
- Disk I/O await > 10ms

Application Design¶

// Set appropriate timeouts
DriverConfigLoader loader = DriverConfigLoader.programmaticBuilder()
    .withDuration(DefaultDriverOption.REQUEST_TIMEOUT, Duration.ofSeconds(10))
    .build();

// Use async writes for high throughput
session.executeAsync(statement);

// Implement proper error handling
try {
    session.execute(statement);
} catch (WriteTimeoutException e) {
    // Log and handle appropriately
    // Consider: retry, queue for later, alert
}

ReadTimeoutException - Read timeout troubleshooting

Next Steps¶

Thread Pool Tuning - Optimize thread pools
Troubleshooting Guide - Additional troubleshooting guidance
Monitoring Guide - Set up alerts

WriteTimeoutException Troubleshooting¶

Symptoms¶

Error Message Examples¶

Common Causes¶

1. Overloaded Nodes¶

2. Network Issues¶

3. Replica Node Down or Unresponsive¶

4. Disk I/O Bottleneck¶

5. Large Writes or Batches¶

6. GC Pauses¶

Write Types¶

Step-by-Step Diagnosis¶

Step 1: Check Write Type and Consistency Level¶

Step 2: Check Cluster Status¶

Step 3: Check Node Health¶

Step 4: Check Resource Utilization¶

Step 5: Review Metrics¶

Resolution Strategies¶

Immediate Actions¶

Long-term Fixes¶

Prevention¶

Best Practices¶

Monitoring¶

Application Design¶

Related Issues¶

Next Steps¶