Cassandra Alerting Configuration¶
Effective alerting enables proactive response to cluster issues before they impact application performance.
Alert Categories¶
| Category | Priority | Response Time | Examples |
|---|---|---|---|
| Critical | P1 | Immediate (< 5 min) | Node down, disk full, dropped messages |
| Warning | P2 | Soon (< 1 hour) | High latency, compaction backlog |
| Info | P3 | Next business day | Certificate expiring, minor threshold breach |
Critical Alerts¶
Node Down¶
A node becomes unreachable, reducing cluster capacity and fault tolerance.
| Parameter | Value |
|---|---|
| Metric | Node status via gossip or agent heartbeat |
| Condition | Unreachable for > 1 minute |
| Severity | Critical |
Response:
1. Verify node status with nodetool status from another node
2. Check if process is running on the affected node
3. Review system logs for crash indicators
4. Check network connectivity
High Heap Usage¶
Sustained high heap usage indicates memory pressure and increases GC pause risk.
| Parameter | Value |
|---|---|
| Metric | java.lang:type=Memory/HeapMemoryUsage |
| Condition | > 85% for 5 minutes |
| Severity | Critical |
Response:
1. Check GC logs for long pauses
2. Review nodetool tpstats for blocked stages
3. Consider reducing concurrent operations
4. Evaluate heap size configuration
Disk Full¶
Cassandra stops accepting writes when disk usage exceeds thresholds.
| Parameter | Value |
|---|---|
| Metric | Filesystem usage on data directories |
| Condition | > 85% used |
| Severity | Critical |
Response:
1. Identify largest tables with nodetool tablestats
2. Run cleanup if data was recently deleted: nodetool cleanup
3. Check for compaction backlog
4. Add capacity or remove data
Dropped Messages¶
Dropped messages indicate the cluster cannot process requests within timeout.
| Parameter | Value |
|---|---|
| Metric | org.apache.cassandra.metrics:type=DroppedMessage |
| Condition | Any dropped messages |
| Severity | Critical |
Response:
1. Identify which message types are dropping via nodetool tpstats
2. Check for overloaded thread pools
3. Review client request patterns
4. Evaluate cluster capacity
Warning Alerts¶
High Read Latency¶
Elevated read latency impacts application response times.
| Parameter | Value |
|---|---|
| Metric | ClientRequest.Read.Latency.99thPercentile |
| Condition | > 100ms for 10 minutes |
| Severity | Warning |
Response:
1. Check nodetool proxyhistograms for coordinator vs local latency
2. Review nodetool tablestats for problematic tables
3. Check pending compactions
4. Evaluate read patterns and data model
High Write Latency¶
Elevated write latency may indicate commit log or memtable pressure.
| Parameter | Value |
|---|---|
| Metric | ClientRequest.Write.Latency.99thPercentile |
| Condition | > 50ms for 10 minutes |
| Severity | Warning |
Response: 1. Check commit log directory disk I/O 2. Review memtable flush frequency 3. Verify commit log is on separate disk from data 4. Check for concurrent compaction activity
Compaction Backlog¶
Pending compactions indicate write volume exceeds compaction throughput.
| Parameter | Value |
|---|---|
| Metric | org.apache.cassandra.metrics:type=Compaction,name=PendingTasks |
| Condition | > 30 for 15 minutes |
| Severity | Warning |
Response:
1. Review compaction throughput settings
2. Check disk I/O utilization
3. Evaluate compaction strategy for workload
4. Consider increasing concurrent_compactors
Heap Usage Warning¶
Early warning of approaching memory pressure.
| Parameter | Value |
|---|---|
| Metric | java.lang:type=Memory/HeapMemoryUsage |
| Condition | > 70% for 10 minutes |
| Severity | Warning |
Response: 1. Monitor GC activity trends 2. Review recent workload changes 3. Check for large partition access patterns 4. Prepare for potential heap tuning
Alert Thresholds Summary¶
| Metric | Warning | Critical | JMX Path |
|---|---|---|---|
| Heap Usage | > 70% | > 85% | java.lang:type=Memory |
| Disk Usage | > 60% | > 80% | OS-level metric |
| Read Latency p99 | > 50ms | > 200ms | ClientRequest.Read.Latency |
| Write Latency p99 | > 20ms | > 100ms | ClientRequest.Write.Latency |
| Pending Compactions | > 20 | > 50 | Compaction.PendingTasks |
| Dropped Messages | > 0 | > 10/s | DroppedMessage.*.Dropped |
| GC Pause | > 500ms | > 1s | java.lang:type=GarbageCollector |
AxonOps Alert Configuration¶
Creating Alerts¶
Alerts are configured in the AxonOps dashboard under Settings > Alerts.
- Navigate to Settings > Alerts > Create Alert
- Select the metric to monitor
- Define threshold conditions
- Set evaluation interval and duration
- Configure notification channels
Alert Rule Parameters¶
| Parameter | Description |
|---|---|
| Metric | The JMX metric or derived value to evaluate |
| Operator | Comparison operator (>, <, =, !=) |
| Threshold | Value that triggers the alert |
| Duration | Time the condition must persist |
| Severity | Critical, Warning, or Info |
Example: High Latency Alert¶
Name: High Read Latency
Metric: cassandra.client_request.read.latency.p99
Condition: > 100
Duration: 10 minutes
Severity: Warning
Notification Channels¶
AxonOps supports multiple notification integrations:
| Channel | Use Case |
|---|---|
| General notifications | |
| Slack | Team collaboration |
| PagerDuty | On-call escalation |
| Webhook | Custom integrations |
| OpsGenie | Incident management |
Configure channels under Settings > Integrations.
Alert Routing¶
Severity-Based Routing¶
| Severity | Notification |
|---|---|
| Critical | PagerDuty + Slack |
| Warning | Slack only |
| Info | Email digest |
Time-Based Routing¶
Configure different notification behavior for business hours vs off-hours:
- Business hours: Warning alerts to Slack
- Off-hours: Only critical alerts to PagerDuty
Runbook Integration¶
Each alert should reference a runbook with:
- Description: What the alert means
- Impact: How this affects the cluster/application
- Diagnosis: Steps to investigate
- Resolution: Actions to resolve
- Escalation: When and how to escalate
Best Practices¶
Threshold Tuning¶
- Establish baselines during normal operation
- Set warning thresholds at 2x baseline
- Set critical thresholds at 4x baseline or operational limits
- Review and adjust quarterly
Reducing Alert Fatigue¶
- Require duration before alerting (avoid transient spikes)
- Group related alerts
- Use warning alerts for early indicators
- Reserve critical for immediate action required
Alert Coverage¶
Minimum alert coverage for production clusters:
- [ ] Node availability
- [ ] Heap usage (warning and critical)
- [ ] Disk usage (warning and critical)
- [ ] Read/write latency
- [ ] Dropped messages
- [ ] Pending compactions
- [ ] GC pause duration
Next Steps¶
- Key Metrics - Metrics reference
- Logging - Log analysis and configuration