Large Partition Issues¶
Large partitions are a common cause of performance problems in Cassandra. They cause slow reads, OOM during compaction, and uneven data distribution.
Symptoms¶
- Slow reads on specific partition keys
- OOM errors during compaction
- "Compacting large partition" warnings in logs
- Uneven disk usage across nodes
- Read timeouts on specific queries
- High GC pressure when accessing certain data
Diagnosis¶
Step 1: Check for Large Partition Warnings¶
grep -i "large partition\|compacting large\|Writing large partition" /var/log/cassandra/system.log | tail -20
Example warning:
WARN Writing large partition my_keyspace/my_table:key123 (150.2 MiB) to sstable
Step 2: Check Table Statistics¶
nodetool tablestats my_keyspace.my_table
Key metrics:
- Compacted partition maximum bytes: Largest partition
- Compacted partition mean bytes: Average partition size
Target: Keep partitions under 100 MB.
Step 3: Identify Specific Large Partitions¶
# Use sstablepartitions tool (Cassandra 4.0+)
sstablepartitions /var/lib/cassandra/data/my_keyspace/my_table-*/nb-*-big-Data.db --min-size 50MB
Or query with tracing:
TRACING ON;
SELECT * FROM my_table WHERE partition_key = 'suspect_key' LIMIT 1;
TRACING OFF;
Step 4: Check Partition Size Configuration¶
grep -i "partition_size\|compaction_large" /etc/cassandra/cassandra.yaml
Step 5: Monitor During Access¶
# Watch for GC during large partition access
nodetool gcstats
# Watch heap usage
nodetool info | grep Heap
Resolution¶
Immediate: Handle OOM During Compaction¶
Temporarily skip large partition:
# Not recommended for production, but for emergency
# Reduce compaction throughput
nodetool setcompactionthroughput 16
Increase heap (temporary):
This is a workaround, not a fix:
# In jvm.options
-Xmx16G # Increase temporarily
Short-term: Adjust Compaction Settings¶
# cassandra.yaml - increase limits for large partitions
compaction_large_partition_warning_threshold_mb: 100
Long-term: Fix Data Model¶
Problem Pattern 1: Unbounded partition growth
-- Bad: All events for a user in one partition
CREATE TABLE user_events (
user_id uuid,
event_time timestamp,
data text,
PRIMARY KEY (user_id, event_time)
);
-- Partition grows forever
Solution: Add time bucketing
-- Good: Events bucketed by day
CREATE TABLE user_events (
user_id uuid,
day date,
event_time timestamp,
data text,
PRIMARY KEY ((user_id, day), event_time)
);
-- Partition limited to one day of events
Problem Pattern 2: Wide rows with many columns
-- Bad: Thousands of columns per row
CREATE TABLE sensor_data (
sensor_id uuid,
metric_name text,
value double,
PRIMARY KEY (sensor_id, metric_name)
);
-- Can have millions of metrics per sensor
Solution: Add bucketing or limit scope
-- Good: Bucket by time period
CREATE TABLE sensor_data (
sensor_id uuid,
hour timestamp,
metric_name text,
value double,
PRIMARY KEY ((sensor_id, hour), metric_name)
);
Problem Pattern 3: Collection columns
-- Bad: Large collections
CREATE TABLE users (
user_id uuid PRIMARY KEY,
followers set<uuid> -- Can grow to millions
);
Solution: Use separate table
-- Good: Separate relationship table
CREATE TABLE user_followers (
user_id uuid,
follower_id uuid,
followed_at timestamp,
PRIMARY KEY (user_id, follower_id)
);
Data Migration Strategy¶
To fix existing large partitions:
-- 1. Create new table with better model
CREATE TABLE user_events_v2 (...);
-- 2. Migrate data with bucketing
-- Use Spark, application code, or COPY command
-- 3. Update application to use new table
-- 4. Drop old table after verification
DROP TABLE user_events;
Recovery¶
Verify Partition Sizes¶
# After data model fix, check new partition sizes
nodetool tablestats my_keyspace.my_table_v2 | grep "partition"
Monitor for Recurrence¶
Set up alerting: - Alert on "Writing large partition" log entries - Alert on max partition size > 50MB - Monitor specific problematic partition keys
Partition Size Guidelines¶
| Size | Status | Action |
|---|---|---|
| < 10 MB | Good | No action needed |
| 10-50 MB | Warning | Monitor, plan for growth |
| 50-100 MB | Problem | Redesign data model |
| > 100 MB | Critical | Immediate remediation required |
Estimating Partition Size¶
Partition size ≈ (number of rows) × (average row size)
Row size ≈ sum of column sizes + clustering key size + 23 bytes overhead
Design Targets¶
| Metric | Target |
|---|---|
| Partition size | < 100 MB |
| Rows per partition | < 100,000 |
| Cells per partition | < 100,000 |
Prevention¶
- Design for bounded partitions - Always include time or other limiting factor
- Avoid unbounded collections - Use separate tables instead
- Monitor partition sizes - Alert before they become critical
- Test with realistic data - Load test with expected data volumes
- Review data model changes - Check for partition size impact
Related Commands¶
| Command | Purpose |
|---|---|
nodetool tablestats |
Check partition size metrics |
sstablepartitions |
Find large partitions in SSTables |
nodetool gcstats |
Monitor GC during access |
Related Documentation¶
- Data Modeling - Partition design best practices
- GC Pause Issues - GC problems from large partitions
- Compaction Management - Compaction and large partitions