nodetool failuredetector¶
Displays the failure detector information for the cluster.
Synopsis¶
nodetool [connection_options] failuredetector
Description¶
nodetool failuredetector displays information about Cassandra's failure detector, which monitors the health of nodes in the cluster using the Phi Accrual Failure Detector algorithm. This information helps understand how the cluster perceives node health and connectivity.
The failure detector uses gossip heartbeats to calculate a "phi" value representing the likelihood that a node has failed. When phi exceeds the configured threshold, the node is marked as down.
Examples¶
Basic Usage¶
nodetool failuredetector
Output¶
Sample Output¶
Endpoint Phi
192.168.1.101 0.0034521
192.168.1.102 0.0028934
192.168.1.103 0.0041256
192.168.1.104 5.2341567
Interpreting Phi Values¶
| Phi Value | Interpretation |
|---|---|
| 0 - 0.5 | Very healthy, recent heartbeat |
| 0.5 - 5 | Healthy, normal range |
| 5 - 8 | Elevated, possible issues |
| > 8 | Likely down (default threshold) |
Failure Detection Algorithm¶
Phi Accrual Failure Detector¶
How Phi is Calculated:
1. Each node sends periodic heartbeats via gossip
2. Receiving nodes track heartbeat arrival times
3. Statistical analysis calculates expected arrival time
4. Phi = -log10(P(heartbeat will still arrive))
5. Higher phi = higher probability of failure
Default Threshold¶
# cassandra.yaml
phi_convict_threshold: 8
A node is marked DOWN when phi exceeds this threshold.
Use Cases¶
Diagnose Cluster Health¶
# Check all nodes' phi values
nodetool failuredetector
# Identify nodes with elevated phi
nodetool failuredetector | awk '$2 > 1 {print}'
Network Issue Investigation¶
When experiencing intermittent connectivity:
# Monitor phi values over time
watch -n 5 'nodetool failuredetector'
Pre-Maintenance Check¶
Before cluster operations:
# Ensure all nodes are healthy
nodetool failuredetector
# All phi values should be low
Monitoring Script¶
#!/bin/bash
# monitor_failure_detector.sh
THRESHOLD=5.0
echo "=== Failure Detector Check ==="
echo ""
# Get failure detector info
nodetool failuredetector | tail -n +2 | while read endpoint phi; do
# Compare phi to threshold
elevated=$(echo "$phi > $THRESHOLD" | bc -l)
if [ "$elevated" -eq 1 ]; then
echo "WARNING: $endpoint has elevated phi: $phi"
else
echo "OK: $endpoint phi=$phi"
fi
done
Cluster-Wide Check¶
#!/bin/bash
# cluster_failure_detector.sh
echo "=== Cluster Failure Detector Status ==="# Get list of node IPs from local nodetool status
nodes=$(nodetool status | grep "^UN\|^DN" | awk '{print $2}')
for node in $nodes; do
echo ""
echo "=== From perspective of $node ==="
ssh "$node" "nodetool failuredetector 2>/dev/null || echo "Cannot connect to $node""
done
Troubleshooting¶
High Phi Values¶
If a node shows consistently high phi:
# Check network connectivity
ping <node_ip>
# Check if node is under load
ssh <node_ip> "nodetool tpstats"
# Check for GC issues
ssh <node_ip> "nodetool gcstats"
Fluctuating Phi Values¶
Indicates network instability:
# Check for network issues
traceroute <node_ip>
# Monitor over time
for i in {1..60}; do
echo "$(date): $(nodetool failuredetector | grep <node_ip>)"
sleep 10
done
Node Incorrectly Marked Down¶
If a healthy node is marked down:
# Check phi threshold
grep phi_convict_threshold /etc/cassandra/cassandra.yaml
# Consider adjusting if network is high-latency
# Higher threshold = more tolerant of delays
Configuration¶
Phi Threshold¶
# cassandra.yaml
phi_convict_threshold: 8 # Default
# For high-latency networks, consider increasing:
# phi_convict_threshold: 12
Affecting Factors¶
| Factor | Effect on Phi |
|---|---|
| Network latency | Higher latency → higher phi |
| GC pauses | Long GC → spikes in phi |
| CPU load | High load → delayed heartbeats |
| Network packet loss | Missing heartbeats → elevated phi |
Best Practices¶
Failure Detector Guidelines
- Regular monitoring - Include in health checks
- Baseline values - Know normal phi ranges for your cluster
- Alert on elevated phi - Before nodes are marked down
- Investigate spikes - Don't ignore temporary elevations
- Tune threshold - Adjust for network characteristics
Healthy Cluster Indicators
- All phi values < 1.0
- Values stable over time
- No sudden spikes
- Symmetric across nodes (A sees B same as B sees A)
Related Commands¶
| Command | Relationship |
|---|---|
| gossipinfo | Detailed gossip state |
| status | Cluster status overview |
| info | Node information |
| netstats | Network statistics |