Cassandra Alerting Configuration¶

Effective alerting enables proactive response to cluster issues before they impact application performance.

Alert Categories¶

Category	Priority	Response Time	Examples
Critical	P1	Immediate (< 5 min)	Node down, disk full, dropped messages
Warning	P2	Soon (< 1 hour)	High latency, compaction backlog
Info	P3	Next business day	Certificate expiring, minor threshold breach

Critical Alerts¶

Node Down¶

A node becomes unreachable, reducing cluster capacity and fault tolerance.

Parameter	Value
Metric	Node status via gossip or agent heartbeat
Condition	Unreachable for > 1 minute
Severity	Critical

Response: 1. Verify node status with nodetool status from another node 2. Check if process is running on the affected node 3. Review system logs for crash indicators 4. Check network connectivity

High Heap Usage¶

Sustained high heap usage indicates memory pressure and increases GC pause risk.

Parameter	Value
Metric	`java.lang:type=Memory/HeapMemoryUsage`
Condition	> 85% for 5 minutes
Severity	Critical

Response: 1. Check GC logs for long pauses 2. Review nodetool tpstats for blocked stages 3. Consider reducing concurrent operations 4. Evaluate heap size configuration

Disk Full¶

Cassandra stops accepting writes when disk usage exceeds thresholds.

Parameter	Value
Metric	Filesystem usage on data directories
Condition	> 85% used
Severity	Critical

Response: 1. Identify largest tables with nodetool tablestats 2. Run cleanup if data was recently deleted: nodetool cleanup 3. Check for compaction backlog 4. Add capacity or remove data

Dropped Messages¶

Dropped messages indicate the cluster cannot process requests within timeout.

Parameter	Value
Metric	`org.apache.cassandra.metrics:type=DroppedMessage`
Condition	Any dropped messages
Severity	Critical

Response: 1. Identify which message types are dropping via nodetool tpstats 2. Check for overloaded thread pools 3. Review client request patterns 4. Evaluate cluster capacity

Warning Alerts¶

High Read Latency¶

Elevated read latency impacts application response times.

Parameter	Value
Metric	`ClientRequest.Read.Latency.99thPercentile`
Condition	> 100ms for 10 minutes
Severity	Warning

Response: 1. Check nodetool proxyhistograms for coordinator vs local latency 2. Review nodetool tablestats for problematic tables 3. Check pending compactions 4. Evaluate read patterns and data model

High Write Latency¶

Elevated write latency may indicate commit log or memtable pressure.

Parameter	Value
Metric	`ClientRequest.Write.Latency.99thPercentile`
Condition	> 50ms for 10 minutes
Severity	Warning

Response: 1. Check commit log directory disk I/O 2. Review memtable flush frequency 3. Verify commit log is on separate disk from data 4. Check for concurrent compaction activity

Compaction Backlog¶

Pending compactions indicate write volume exceeds compaction throughput.

Parameter	Value
Metric	`org.apache.cassandra.metrics:type=Compaction,name=PendingTasks`
Condition	> 30 for 15 minutes
Severity	Warning

Response: 1. Review compaction throughput settings 2. Check disk I/O utilization 3. Evaluate compaction strategy for workload 4. Consider increasing concurrent_compactors

Heap Usage Warning¶

Early warning of approaching memory pressure.

Parameter	Value
Metric	`java.lang:type=Memory/HeapMemoryUsage`
Condition	> 70% for 10 minutes
Severity	Warning

Response: 1. Monitor GC activity trends 2. Review recent workload changes 3. Check for large partition access patterns 4. Prepare for potential heap tuning

Alert Thresholds Summary¶

Metric	Warning	Critical	JMX Path
Heap Usage	> 70%	> 85%	`java.lang:type=Memory`
Disk Usage	> 60%	> 80%	OS-level metric
Read Latency p99	> 50ms	> 200ms	`ClientRequest.Read.Latency`
Write Latency p99	> 20ms	> 100ms	`ClientRequest.Write.Latency`
Pending Compactions	> 20	> 50	`Compaction.PendingTasks`
Dropped Messages	> 0	> 10/s	`DroppedMessage.*.Dropped`
GC Pause	> 500ms	> 1s	`java.lang:type=GarbageCollector`

AxonOps Alert Configuration¶

Creating Alerts¶

Alerts are configured in the AxonOps dashboard under Settings > Alerts.

Navigate to Settings > Alerts > Create Alert
Select the metric to monitor
Define threshold conditions
Set evaluation interval and duration
Configure notification channels

Alert Rule Parameters¶

Parameter	Description
Metric	The JMX metric or derived value to evaluate
Operator	Comparison operator (>, <, =, !=)
Threshold	Value that triggers the alert
Duration	Time the condition must persist
Severity	Critical, Warning, or Info

Example: High Latency Alert¶

Name: High Read Latency
Metric: cassandra.client_request.read.latency.p99
Condition: > 100
Duration: 10 minutes
Severity: Warning

Notification Channels¶

AxonOps supports multiple notification integrations:

Channel	Use Case
Email	General notifications
Slack	Team collaboration
PagerDuty	On-call escalation
Webhook	Custom integrations
OpsGenie	Incident management

Configure channels under Settings > Integrations.

Alert Routing¶

Severity-Based Routing¶

Severity	Notification
Critical	PagerDuty + Slack
Warning	Slack only
Info	Email digest

Time-Based Routing¶

Configure different notification behavior for business hours vs off-hours:

Business hours: Warning alerts to Slack
Off-hours: Only critical alerts to PagerDuty

Runbook Integration¶

Each alert should reference a runbook with:

Description: What the alert means
Impact: How this affects the cluster/application
Diagnosis: Steps to investigate
Resolution: Actions to resolve
Escalation: When and how to escalate

Best Practices¶

Threshold Tuning¶

Establish baselines during normal operation
Set warning thresholds at 2x baseline
Set critical thresholds at 4x baseline or operational limits
Review and adjust quarterly

Reducing Alert Fatigue¶

Require duration before alerting (avoid transient spikes)
Group related alerts
Use warning alerts for early indicators
Reserve critical for immediate action required

Alert Coverage¶

Minimum alert coverage for production clusters:

[ ] Node availability
[ ] Heap usage (warning and critical)
[ ] Disk usage (warning and critical)
[ ] Read/write latency
[ ] Dropped messages
[ ] Pending compactions
[ ] GC pause duration

Next Steps¶

Key Metrics - Metrics reference
Logging - Log analysis and configuration