AxonOps Entropy Dashboard Metrics Mapping¶
This document maps the metrics used in the AxonOps Entropy dashboard.
Dashboard Overview¶
The Entropy dashboard (also known as Anti-Entropy) monitors Cassandra's data consistency mechanisms including hinted handoff, read repairs, and repair operations. These features ensure eventual consistency across the cluster by detecting and fixing data inconsistencies.
Metrics Mapping¶
Hints Metrics¶
| Dashboard Metric | Description | Attributes |
|---|---|---|
cas_Storage_TotalHints |
Total number of hints created | axonfunction (rate), dc, rack, host_id |
cas_Storage_TotalHintsInProgress |
Currently active hints being delivered | dc, rack, host_id |
Read Repair Metrics¶
| Dashboard Metric | Description | Attributes |
|---|---|---|
cas_ReadRepair_Attempted |
Read repair attempts | function (Count), axonfunction (rate), dc, rack, host_id |
cas_ReadRepair_RepairedBackground |
Background read repairs completed | function (Count), axonfunction (rate), dc, rack, host_id |
cas_ReadRepair_RepairedBlocking |
Blocking read repairs completed | function (Count), axonfunction (rate), dc, rack, host_id |
Coordinator Error Metrics¶
| Dashboard Metric | Description | Attributes |
|---|---|---|
cas_ClientRequest_Timeouts |
Request timeouts at coordinator | scope (Read/Write), function (Count), axonfunction (rate), dc, rack, host_id |
cas_ClientRequest_Unavailables |
Unavailable exceptions at coordinator | scope (Read/Write), function (Count), axonfunction (rate), dc, rack, host_id |
Thread Pool Metrics¶
| Dashboard Metric | Description | Attributes |
|---|---|---|
cas_ThreadPools_request |
Request thread pool statistics | scope (pool name), key (CompletedTasks), axonfunction (rate), dc, rack, host_id |
cas_ThreadPools_internal |
Internal thread pool statistics | scope (pool name), key (CompletedTasks), axonfunction (rate), dc, rack, host_id |
Query Examples¶
Hints Section¶
// Total Hints Created Rate
cas_Storage_TotalHints{axonfunction='rate',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}
// Hints Currently In Progress
cas_Storage_TotalHintsInProgress{dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}
Read Repairs Section¶
// Attempted Per Second
sum(cas_ReadRepair_Attempted{axonfunction='rate',function='Count',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)
// Repaired Background
sum(cas_ReadRepair_RepairedBackground{axonfunction='rate',function='Count',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)
// Repaired Blocking
sum(cas_ReadRepair_RepairedBlocking{axonfunction='rate',function='Count',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)
Coordinator Request Errors Section¶
// Read Timeouts
sum(cas_ClientRequest_Timeouts{axonfunction='rate',function='Count',scope='Read',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)
// Read Unavailables
sum(cas_ClientRequest_Unavailables{axonfunction='rate',function='Count',scope='Read',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)
// Write Timeouts
sum(cas_ClientRequest_Timeouts{axonfunction='rate',scope='Write',function='Count',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)
// Write Unavailables
sum(cas_ClientRequest_Unavailables{axonfunction='rate',scope='Write',function='Count',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)
Thread Pools Section¶
// Request Thread Pool Distribution (Pie Chart)
sum(cas_ThreadPools_request{axonfunction='rate',key='CompletedTasks',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by (scope)
// Internal Thread Pool Distribution (Pie Chart)
sum(cas_ThreadPools_internal{axonfunction='rate',key='CompletedTasks',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by (scope)
// Anti-Entropy Stage Tasks
sum(cas_ThreadPools_internal{axonfunction='rate',scope='AntiEntropyStage',key='CompletedTasks',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)
// Read Repair Stage Tasks
sum(cas_ThreadPools_internal{axonfunction='rate',scope='ReadRepairStage',key='CompletedTasks',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)
// Hints Dispatcher Tasks
sum(cas_ThreadPools_internal{axonfunction='rate',scope='HintsDispatcher',key='CompletedTasks',dc=~'$dc',rack=~'$rack',host_id=~'$host_id'}) by ($groupBy)
Panel Organization¶
Hints Section¶
-
Total Hints Created By Each Node - Rate of hint creation
-
Total Hints In Progress - Active hint delivery count
Read Repairs Section¶
-
Attempted Per Second - Read repair attempt rate
-
Repaired Background - Background repairs completed
-
Repaired Blocked - Blocking repairs completed
Coordinator Requests Errors Section¶
-
Read Timeouts Per Second - Read operation timeout rate
-
Read Unavailables Per Second - Read unavailable exception rate
-
Write Timeouts Per Second - Write operation timeout rate
-
Write Unavailables Per Second - Write unavailable exception rate
Thread Pools Section¶
-
ThreadPools Request - Request thread pool activity distribution
-
ThreadPools Internal - Internal thread pool activity distribution
-
Completed Tasks per sec - Anti Entropy Stage - Repair coordination tasks
-
Completed Tasks per sec - Read Repair Stage - Read repair execution tasks
-
Completed Tasks per sec - Hinted Handoff - Hint delivery tasks
Events Section¶
-
Starting Repair Events - Repair start event frequency
-
Streaming Events - Data streaming event frequency
Filters¶
-
data center (
dc) - Filter by data center -
rack - Filter by rack
-
node (
host_id) - Filter by specific node -
groupBy - Dynamic grouping (dc, rack, host_id, keyspace)
Understanding Anti-Entropy Mechanisms¶
Hinted Handoff¶
-
Purpose: Temporary storage of writes when replicas are unavailable
-
TotalHints: Accumulating counter of all hints created
-
HintsInProgress: Current active hint deliveries
-
Impact: High hint rates indicate replica availability issues
Read Repair¶
-
Attempted: All read repair attempts
-
RepairedBackground: Asynchronous repairs (non-blocking)
-
RepairedBlocking: Synchronous repairs (blocks read response)
Types:
- Background: Happens after read completes
- Blocking: Happens before read response
Coordinator Errors¶
-
Timeouts: Request exceeded configured timeout
-
Unavailables: Not enough replicas available
Causes:
- Node overload
- Network issues
- Insufficient replicas
Thread Pools¶
-
AntiEntropyStage: Handles repair coordination
-
ReadRepairStage: Executes read repairs
-
HintsDispatcher: Delivers hints to recovered nodes
Best Practices¶
Hints Monitoring¶
-
Zero Hints Ideal: Indicates all replicas available
-
Growing Hints: Sign of persistent replica issues
-
High In-Progress: May indicate slow hint delivery
Actions:
- Check node health
- Review network connectivity
- Monitor hint storage capacity
Read Repair Monitoring¶
Background vs Blocking:
- High blocking repairs impact read latency
- Background repairs are preferred
High Attempt Rate:
- Indicates data inconsistency
- May need full repair
Success Rate:
- Compare attempted vs repaired
- Low success indicates issues
Error Monitoring¶
Zero Tolerance:
- Any timeouts/unavailables are concerning
- Investigate root cause immediately
Read vs Write:
- Different implications
- Write unavailables risk data loss
Correlation:
- Check with dropped messages
- Monitor system resources
Thread Pool Health¶
Balanced Distribution:
- Even work across pools
- No single pool dominating
Anti-Entropy Activity:
- Spikes during repairs
- Should be low normally
Hints Dispatcher:
- Activity indicates recovery
- Should complete eventually
Troubleshooting Guide¶
High Hint Rate¶
- Check node status
- Review network connectivity
- Monitor disk space for hints
- Consider max_hint_window_in_ms setting
High Read Repair Rate¶
- Run nodetool repair
- Check consistency levels
- Review replication factor
- Monitor for flapping nodes
Timeout/Unavailable Errors¶
- Check system resources
- Review timeout settings
- Monitor GC activity
- Check request patterns
Thread Pool Congestion¶
- Monitor pending tasks
- Check blocked tasks
- Review pool sizing
- Consider capacity expansion
Units and Display¶
-
Rates: operations per second (short)
-
Counts: absolute numbers (short)
Legend Format:
- Aggregated:
$groupBy - Node-specific:
$dc - $host_id - Thread pools:
$scope
Notes¶
- Events use message filtering for repair and streaming activities
- Thread pool metrics use
key='CompletedTasks'for rate calculations - The dashboard name "Entropy" refers to anti-entropy (consistency) mechanisms
- All rate metrics use
axonfunction='rate'for per-second calculations