Gossip Failures¶
Gossip is Cassandra's peer-to-peer protocol for sharing cluster state. Gossip failures cause nodes to lose visibility of each other, leading to cluster partitions and availability issues.
Symptoms¶
- Nodes showing as DOWN (DN) in
nodetool statusdespite being running nodetool gossipinfoshows stale or missing entries- "Unable to gossip with" errors in logs
- New nodes failing to join cluster
- Inconsistent cluster views across nodes
- Schema disagreement
Diagnosis¶
Step 1: Check Node Status¶
# On multiple nodes - compare output
nodetool status
Different views from different nodes indicate gossip issues.
Step 2: Check Gossip State¶
nodetool gossipinfo
Key fields to check:
- STATUS: Should be NORMAL for healthy nodes
- HEARTBEAT: Should be recent (incrementing)
- GENERATION: Timestamp of last restart
Step 3: Check Gossip Service¶
nodetool statusgossip
Should return running. If not running, gossip is disabled.
Step 4: Check Network Connectivity¶
# Gossip uses port 7000 (or 7001 for SSL)
for node in node1 node2 node3; do
nc -zv $node 7000 && echo "$node gossip: OK" || echo "$node gossip: FAILED"
done
# Check internode communication
for node in node1 node2 node3; do
nc -zv $node 7000
nc -zv $node 9042
done
Step 5: Check Logs¶
grep -i "gossip\|cannot reach\|connection refused" /var/log/cassandra/system.log | tail -50
Step 6: Check Seeds Configuration¶
grep seeds /etc/cassandra/cassandra.yaml
Ensure seeds are reachable and consistent across cluster.
Resolution¶
Case 1: Network Issues¶
Problem: Firewall blocking gossip port
# Check firewall
sudo iptables -L -n | grep 7000
# Open gossip port
sudo iptables -A INPUT -p tcp --dport 7000 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 7001 -j ACCEPT # SSL
Problem: DNS resolution failing
# Test DNS
nslookup node1.example.com
# Use IP addresses in cassandra.yaml if DNS unreliable
listen_address: 192.168.1.10
Case 2: Gossip Disabled¶
# Check status
nodetool statusgossip
# Enable if disabled
nodetool enablegossip
Case 3: Stale Gossip State¶
Problem: Node has outdated view of cluster
# Restart gossip (non-disruptive)
nodetool disablegossip
sleep 5
nodetool enablegossip
Case 4: Corrupted Gossip State¶
Problem: Node persistently has wrong cluster view
# Rolling restart of affected node
nodetool drain
sudo systemctl restart cassandra
Case 5: Zombie Node¶
Problem: Removed node still appearing in gossip
# Check for zombie entries
nodetool gossipinfo | grep -B5 "STATUS:LEFT\|STATUS:removed"
# Force remove if necessary (use carefully)
nodetool assassinate <zombie-node-ip>
Assassinate Warning
nodetool assassinate should only be used for nodes that have been properly decommissioned or are permanently dead. Using it on a live node can cause data loss.
Case 6: Seed Node Issues¶
Problem: All seed nodes unreachable
- Verify at least one seed is running and reachable
- Ensure seeds are listed consistently across all nodes
- Seeds should never include all nodes - typically 2-3 per datacenter
# cassandra.yaml - good seed configuration
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "node1,node2" # 2-3 seeds per DC
Recovery¶
Verify Gossip Health¶
#!/bin/bash
# All nodes should see the same status (using SSH for local execution)
for node in node1 node2 node3; do
echo "=== $node ==="
ssh "$node" "nodetool status" | head -10
done
# Gossip should show recent heartbeats (run locally)
nodetool gossipinfo | grep HEARTBEAT
Verify Schema Agreement¶
nodetool describecluster | grep -A 20 "Schema versions"
All nodes should show the same schema version.
Common Causes¶
| Cause | Symptom | Fix |
|---|---|---|
| Firewall | Connection refused | Open ports 7000/7001 |
| DNS issues | Cannot resolve hostname | Use IPs or fix DNS |
| Network partition | Partial cluster visibility | Fix network routing |
| Clock skew | Gossip timestamp errors | Sync with NTP |
| Bad seed config | Nodes can't find cluster | Fix seeds list |
| Resource exhaustion | Gossip timeouts | Add resources |
Gossip Port Reference¶
| Port | Purpose | Protocol |
|---|---|---|
| 7000 | Internode gossip | TCP |
| 7001 | Internode gossip (SSL) | TCP |
| 7199 | JMX monitoring | TCP |
| 9042 | CQL native transport | TCP |
Prevention¶
- Monitor gossip health - Alert on nodes marked DOWN
- Use stable networking - Avoid network configurations that cause partitions
- Sync clocks - Use NTP across all nodes
- Consistent configuration - Same seeds on all nodes
- Firewall rules - Ensure gossip ports are always open
- Multiple seeds - 2-3 per datacenter for redundancy
Related Commands¶
| Command | Purpose |
|---|---|
nodetool status |
Cluster overview |
nodetool gossipinfo |
Detailed gossip state |
nodetool statusgossip |
Check if gossip is running |
nodetool enablegossip |
Enable gossip |
nodetool disablegossip |
Disable gossip |
nodetool assassinate |
Remove dead node from gossip |
Related Documentation¶
- Schema Disagreement - Schema issues often caused by gossip problems
- Configuration - Configuration including network setup