AxonOps Entropy Dashboard Metrics Mapping¶

This document maps the metrics used in the AxonOps Entropy dashboard.

Dashboard Overview¶

The Entropy dashboard (also known as Anti-Entropy) monitors Cassandra's data consistency mechanisms including hinted handoff, read repairs, and repair operations. These features ensure eventual consistency across the cluster by detecting and fixing data inconsistencies.

Metrics Mapping¶

Hints Metrics¶

Dashboard Metric	Description	Attributes
`cas_Storage_TotalHints`	Total number of hints created	`axonfunction` (rate), `dc`, `rack`, `host_id`
`cas_Storage_TotalHintsInProgress`	Currently active hints being delivered	`dc`, `rack`, `host_id`

Read Repair Metrics¶

Dashboard Metric	Description	Attributes
`cas_ReadRepair_Attempted`	Read repair attempts	`function` (Count), `axonfunction` (rate), `dc`, `rack`, `host_id`
`cas_ReadRepair_RepairedBackground`	Background read repairs completed	`function` (Count), `axonfunction` (rate), `dc`, `rack`, `host_id`
`cas_ReadRepair_RepairedBlocking`	Blocking read repairs completed	`function` (Count), `axonfunction` (rate), `dc`, `rack`, `host_id`

Coordinator Error Metrics¶

Dashboard Metric	Description	Attributes
`cas_ClientRequest_Timeouts`	Request timeouts at coordinator	`scope` (Read/Write), `function` (Count), `axonfunction` (rate), `dc`, `rack`, `host_id`
`cas_ClientRequest_Unavailables`	Unavailable exceptions at coordinator	`scope` (Read/Write), `function` (Count), `axonfunction` (rate), `dc`, `rack`, `host_id`

Thread Pool Metrics¶

Dashboard Metric	Description	Attributes
`cas_ThreadPools_request`	Request thread pool statistics	`scope` (pool name), `key` (CompletedTasks), `axonfunction` (rate), `dc`, `rack`, `host_id`
`cas_ThreadPools_internal`	Internal thread pool statistics	`scope` (pool name), `key` (CompletedTasks), `axonfunction` (rate), `dc`, `rack`, `host_id`

Query Examples¶

Hints Section¶

// Total Hints Created Rate
cas_Storage_TotalHints{axonfunction='rate',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}

// Hints Currently In Progress
cas_Storage_TotalHintsInProgress{dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}

Read Repairs Section¶

// Attempted Per Second
sum(cas_ReadRepair_Attempted{axonfunction='rate',function='Count',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)

// Repaired Background
sum(cas_ReadRepair_RepairedBackground{axonfunction='rate',function='Count',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)

// Repaired Blocking
sum(cas_ReadRepair_RepairedBlocking{axonfunction='rate',function='Count',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)

Coordinator Request Errors Section¶

// Read Timeouts
sum(cas_ClientRequest_Timeouts{axonfunction='rate',function='Count',scope='Read',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)

// Read Unavailables
sum(cas_ClientRequest_Unavailables{axonfunction='rate',function='Count',scope='Read',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)

// Write Timeouts
sum(cas_ClientRequest_Timeouts{axonfunction='rate',scope='Write',function='Count',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)

// Write Unavailables
sum(cas_ClientRequest_Unavailables{axonfunction='rate',scope='Write',function='Count',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)

Thread Pools Section¶

// Request Thread Pool Distribution (Pie Chart)
sum(cas_ThreadPools_request{axonfunction='rate',key='CompletedTasks',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by (scope)

// Internal Thread Pool Distribution (Pie Chart)
sum(cas_ThreadPools_internal{axonfunction='rate',key='CompletedTasks',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by (scope)

// Anti-Entropy Stage Tasks
sum(cas_ThreadPools_internal{axonfunction='rate',scope='AntiEntropyStage',key='CompletedTasks',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)

// Read Repair Stage Tasks
sum(cas_ThreadPools_internal{axonfunction='rate',scope='ReadRepairStage',key='CompletedTasks',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)

// Hints Dispatcher Tasks
sum(cas_ThreadPools_internal{axonfunction='rate',scope='HintsDispatcher',key='CompletedTasks',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)

Panel Organization¶

Hints Section¶

Total Hints Created By Each Node - Rate of hint creation
Total Hints In Progress - Active hint delivery count

Read Repairs Section¶

Attempted Per Second - Read repair attempt rate
Repaired Background - Background repairs completed
Repaired Blocked - Blocking repairs completed

Coordinator Requests Errors Section¶

Read Timeouts Per Second - Read operation timeout rate
Read Unavailables Per Second - Read unavailable exception rate
Write Timeouts Per Second - Write operation timeout rate
Write Unavailables Per Second - Write unavailable exception rate

Thread Pools Section¶

ThreadPools Request - Request thread pool activity distribution
ThreadPools Internal - Internal thread pool activity distribution
Completed Tasks per sec - Anti Entropy Stage - Repair coordination tasks
Completed Tasks per sec - Read Repair Stage - Read repair execution tasks
Completed Tasks per sec - Hinted Handoff - Hint delivery tasks

Events Section¶

Starting Repair Events - Repair start event frequency
Streaming Events - Data streaming event frequency

Filters¶

data center (dc) - Filter by data center
rack - Filter by rack
node (host_id) - Filter by specific node
groupBy - Dynamic grouping (dc, rack, host_id, keyspace)

Understanding Anti-Entropy Mechanisms¶

Hinted Handoff¶

Purpose: Temporary storage of writes when replicas are unavailable
TotalHints: Accumulating counter of all hints created
HintsInProgress: Current active hint deliveries
Impact: High hint rates indicate replica availability issues

Read Repair¶

Attempted: All read repair attempts
RepairedBackground: Asynchronous repairs (non-blocking)
RepairedBlocking: Synchronous repairs (blocks read response)

Types:

Background: Happens after read completes
Blocking: Happens before read response

Coordinator Errors¶

Timeouts: Request exceeded configured timeout
Unavailables: Not enough replicas available

Causes:

Node overload
Network issues
Insufficient replicas

Thread Pools¶

AntiEntropyStage: Handles repair coordination
ReadRepairStage: Executes read repairs
HintsDispatcher: Delivers hints to recovered nodes

Best Practices¶

Hints Monitoring¶

Zero Hints Ideal: Indicates all replicas available
Growing Hints: Sign of persistent replica issues
High In-Progress: May indicate slow hint delivery

Actions:

Check node health
Review network connectivity
Monitor hint storage capacity

Read Repair Monitoring¶

Background vs Blocking:

High blocking repairs impact read latency
Background repairs are preferred

High Attempt Rate:

Indicates data inconsistency
May need full repair

Success Rate:

Compare attempted vs repaired
Low success indicates issues

Error Monitoring¶

Zero Tolerance:

Any timeouts/unavailables are concerning
Investigate root cause immediately

Read vs Write:

Different implications
Write unavailables risk data loss

Correlation:

Check with dropped messages
Monitor system resources

Thread Pool Health¶

Balanced Distribution:

Even work across pools
No single pool dominating

Anti-Entropy Activity:

Spikes during repairs
Should be low normally

Hints Dispatcher:

Activity indicates recovery
Should complete eventually

Troubleshooting Guide¶

High Hint Rate¶

Check node status
Review network connectivity
Monitor disk space for hints
Consider max_hint_window_in_ms setting

High Read Repair Rate¶

Run nodetool repair
Check consistency levels
Review replication factor
Monitor for flapping nodes

Timeout/Unavailable Errors¶

Check system resources
Review timeout settings
Monitor GC activity
Check request patterns

Thread Pool Congestion¶

Monitor pending tasks
Check blocked tasks
Review pool sizing
Consider capacity expansion

Units and Display¶

Rates: operations per second (short)
Counts: absolute numbers (short)

Legend Format:

Aggregated: $groupBy
Node-specific: $dc - $host_id
Thread pools: $scope

Notes¶

Events use message filtering for repair and streaming activities
Thread pool metrics use key='CompletedTasks' for rate calculations
The dashboard name "Entropy" refers to anti-entropy (consistency) mechanisms
All rate metrics use axonfunction='rate' for per-second calculations