AxonOps Dropped Messages Dashboard Metrics Mapping¶

This document maps the metrics used in the AxonOps Dropped Messages dashboard.

Dashboard Overview¶

The Dropped Messages dashboard monitors when Cassandra drops messages due to overload or timeout conditions. Dropped messages are a critical indicator of cluster health and performance issues. When Cassandra cannot process messages within configured timeouts, it drops them to prevent system overload.

Metrics Mapping¶

Dropped Message Metrics¶

Dashboard Metric	Description	Attributes
`cas_DroppedMessage_Dropped`	Count of dropped messages by type	`scope` (message type), `function` (Count), `axonfunction` (rate), `dc`, `rack`, `host_id`

Message Types (Scopes)¶

Data Operation Messages¶

Scope	Description	Common Causes
`MUTATION`	Write operations (INSERT, UPDATE, DELETE)	Write overload, slow disks, GC pauses
`COUNTER_MUTATION`	Counter column updates	Similar to MUTATION but for counter operations
`HINT`	Hinted handoff messages	Node recovery backlog, network issues
`READ`	Read operations (SELECT)	Read overload, large partitions, slow queries
`RANGE_SLICE`	Range queries (token ranges)	Large range scans, inefficient queries
`PAGED_RANGE`	Paginated range queries	Similar to RANGE_SLICE but with pagination

Repair and Maintenance Messages¶

Scope	Description	Common Causes
`READ_REPAIR`	Read repair operations	Inconsistent data, repair overload
`BATCH_STORE`	Batch log writes	Batch operation overload
`BATCH_REMOVE`	Batch log cleanup	Batch completion backlog

Internal Messages¶

Scope	Description	Common Causes
`REQUEST_RESPONSE`	Inter-node response messages	Network latency, coordinator overload
`_TRACE`	Tracing messages	Heavy tracing load

Query Examples¶

Dropped Mutations per Second¶

cas_DroppedMessage_Dropped{axonfunction='rate',function='Count',scope='MUTATION',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}

Dropped Hints per Second¶

cas_DroppedMessage_Dropped{axonfunction='rate',function='Count',scope='HINT',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}

Dropped Reads per Second¶

cas_DroppedMessage_Dropped{axonfunction='rate',function='Count',scope='READ',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}

Total Count Queries (not rate)¶

// Counter Mutations
cas_DroppedMessage_Dropped{function='Count',scope='COUNTER_MUTATION',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}

// Paged Range
cas_DroppedMessage_Dropped{function='Count',scope='PAGED_RANGE',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}

Panel Organization¶

Dropped Messages Section¶

Row 1: - Dropped Mutation per secs - Write operation drops

Dropped Hints per secs - Hinted handoff drops
Dropped Read per secs - Read operation drops

Row 2: - Dropped Counter Mutation - Counter operation drops (total count)

Dropped Read Repair per secs - Read repair drops
Dropped Paged Range - Paginated range query drops (total count)

Row 3: - Dropped Batch Store - Batch log write drops (total count)

Dropped Batch Remove - Batch log cleanup drops (total count)
Dropped Request Response - Inter-node response drops (total count)

Row 4: - Dropped Range Slice - Range query drops (total count)

Dropped Trace - Tracing message drops (total count)

Filters¶

data center (dc) - Filter by data center
rack - Filter by rack
node (host_id) - Filter by specific node
groupBy - Dynamic grouping (dc, rack, host_id, keyspace)

Understanding Dropped Messages¶

Why Messages Are Dropped¶

Timeout: Message exceeds configured timeout
Queue Full: Internal queue reaches capacity
Overload: Node cannot keep up with request rate
Resource Constraints: Memory, CPU, or I/O limitations

Message Type Timeouts (Default)¶

MUTATION: 5000ms (write_request_timeout_in_ms)
READ: 5000ms (read_request_timeout_in_ms)
RANGE_SLICE: 10000ms (range_request_timeout_in_ms)
COUNTER_MUTATION: 5000ms (counter_write_request_timeout_in_ms)
REQUEST_RESPONSE: 10000ms (request_timeout_in_ms)

Impact of Dropped Messages¶

Dropped Mutations:

Write failures at consistency level
Potential data loss if hints also dropped
Client receives timeout exceptions

Dropped Reads:

Read timeouts for clients
Incomplete query results
Application errors

Dropped Hints:

Delayed consistency
Requires repair to fix
Indicates replica communication issues

Dropped Read Repairs:

Inconsistencies persist longer
Manual repair may be needed
Background repair falling behind

Troubleshooting Guide¶

High Dropped Mutations¶

Check disk I/O performance
Monitor GC pauses
Review write load distribution
Consider increasing timeout
Check for large batches

High Dropped Reads¶

Look for large partitions
Check read patterns
Monitor CPU usage
Review query efficiency
Consider read timeout increase

High Dropped Hints¶

Check node availability
Monitor network health
Review hint storage capacity
Check for overloaded nodes
Consider hint delivery throttling

General Recommendations¶

Zero Tolerance: Aim for zero dropped messages
Early Warning: Any drops indicate problems
Root Cause: Always investigate underlying cause
Capacity Planning: Drops often indicate need for scaling

Units and Display¶

Rate Metrics: messages per second (short)
Count Metrics: absolute count (short)
Legend Format: $dc - $host_id

Best Practices¶

Monitor Continuously:

Set alerts for any dropped messages
Track trends over time
Correlate with other metrics

Investigate Immediately:

Dropped messages indicate serious issues
Check system resources
Review recent changes

Preventive Measures:

Proper capacity planning
Regular performance tuning
Appropriate timeout configuration
Load testing before production

Notes¶

Some panels show rate (axonfunction='rate'), others show total count
Rate metrics are more useful for real-time monitoring
Total counts help understand historical impact
The _TRACE scope has underscore prefix in the actual metric