Gossip Failures¶

Gossip is Cassandra's peer-to-peer protocol for sharing cluster state. Gossip failures cause nodes to lose visibility of each other, leading to cluster partitions and availability issues.

Symptoms¶

Nodes showing as DOWN (DN) in nodetool status despite being running
nodetool gossipinfo shows stale or missing entries
"Unable to gossip with" errors in logs
New nodes failing to join cluster
Inconsistent cluster views across nodes
Schema disagreement

Diagnosis¶

Step 1: Check Node Status¶

# On multiple nodes - compare output
nodetool status

Different views from different nodes indicate gossip issues.

Step 2: Check Gossip State¶

nodetool gossipinfo

Key fields to check: - STATUS: Should be NORMAL for healthy nodes - HEARTBEAT: Should be recent (incrementing) - GENERATION: Timestamp of last restart

Step 3: Check Gossip Service¶

nodetool statusgossip

Should return running. If not running, gossip is disabled.

Step 4: Check Network Connectivity¶

# Gossip uses port 7000 (or 7001 for SSL)
for node in node1 node2 node3; do
    nc -zv $node 7000 && echo "$node gossip: OK" || echo "$node gossip: FAILED"
done

# Check internode communication
for node in node1 node2 node3; do
    nc -zv $node 7000
    nc -zv $node 9042
done

Step 5: Check Logs¶

grep -i "gossip\|cannot reach\|connection refused" /var/log/cassandra/system.log | tail -50

Step 6: Check Seeds Configuration¶

grep seeds /etc/cassandra/cassandra.yaml

Ensure seeds are reachable and consistent across cluster.

Resolution¶

Case 1: Network Issues¶

Problem: Firewall blocking gossip port

# Check firewall
sudo iptables -L -n | grep 7000

# Open gossip port
sudo iptables -A INPUT -p tcp --dport 7000 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 7001 -j ACCEPT  # SSL

Problem: DNS resolution failing

# Test DNS
nslookup node1.example.com

# Use IP addresses in cassandra.yaml if DNS unreliable
listen_address: 192.168.1.10

Case 2: Gossip Disabled¶

# Check status
nodetool statusgossip

# Enable if disabled
nodetool enablegossip

Case 3: Stale Gossip State¶

Problem: Node has outdated view of cluster

# Restart gossip (non-disruptive)
nodetool disablegossip
sleep 5
nodetool enablegossip

Case 4: Corrupted Gossip State¶

Problem: Node persistently has wrong cluster view

# Rolling restart of affected node
nodetool drain
sudo systemctl restart cassandra

Case 5: Zombie Node¶

Problem: Removed node still appearing in gossip

# Check for zombie entries
nodetool gossipinfo | grep -B5 "STATUS:LEFT\|STATUS:removed"

# Force remove if necessary (use carefully)
nodetool assassinate <zombie-node-ip>

Assassinate Warning

nodetool assassinate should only be used for nodes that have been properly decommissioned or are permanently dead. Using it on a live node can cause data loss.

Case 6: Seed Node Issues¶

Problem: All seed nodes unreachable

Verify at least one seed is running and reachable
Ensure seeds are listed consistently across all nodes
Seeds should never include all nodes - typically 2-3 per datacenter

# cassandra.yaml - good seed configuration
seed_provider:
  - class_name: org.apache.cassandra.locator.SimpleSeedProvider
    parameters:
      - seeds: "node1,node2"  # 2-3 seeds per DC

Recovery¶

Verify Gossip Health¶

#!/bin/bash
# All nodes should see the same status (using SSH for local execution)
for node in node1 node2 node3; do
    echo "=== $node ==="
    ssh "$node" "nodetool status" | head -10
done

# Gossip should show recent heartbeats (run locally)
nodetool gossipinfo | grep HEARTBEAT

Verify Schema Agreement¶

nodetool describecluster | grep -A 20 "Schema versions"

All nodes should show the same schema version.

Common Causes¶

Cause	Symptom	Fix
Firewall	Connection refused	Open ports 7000/7001
DNS issues	Cannot resolve hostname	Use IPs or fix DNS
Network partition	Partial cluster visibility	Fix network routing
Clock skew	Gossip timestamp errors	Sync with NTP
Bad seed config	Nodes can't find cluster	Fix seeds list
Resource exhaustion	Gossip timeouts	Add resources

Gossip Port Reference¶

Port	Purpose	Protocol
7000	Internode gossip	TCP
7001	Internode gossip (SSL)	TCP
7199	JMX monitoring	TCP
9042	CQL native transport	TCP

Prevention¶

Monitor gossip health - Alert on nodes marked DOWN
Use stable networking - Avoid network configurations that cause partitions
Sync clocks - Use NTP across all nodes
Consistent configuration - Same seeds on all nodes
Firewall rules - Ensure gossip ports are always open
Multiple seeds - 2-3 per datacenter for redundancy

Command	Purpose
`nodetool status`	Cluster overview
`nodetool gossipinfo`	Detailed gossip state
`nodetool statusgossip`	Check if gossip is running
`nodetool enablegossip`	Enable gossip
`nodetool disablegossip`	Disable gossip
`nodetool assassinate`	Remove dead node from gossip

Schema Disagreement - Schema issues often caused by gossip problems
Configuration - Configuration including network setup