[CASSANDRA-10052] Misleading down-node push notifications when rpc_address is shared - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 2.1.10, 2.2.2, 3.0.0 rc2
Component/s: Legacy/CQL
Labels:
None

Severity:
Normal
Impacts:

Clients

Description

When a node goes down, the other nodes learn that through the gossip.

And I do see the log from (Gossiper.java):

private void markDead(InetAddress addr, EndpointState localState)
   {
       if (logger.isTraceEnabled())
           logger.trace("marking as down {}", addr);
       localState.markDead();
       liveEndpoints.remove(addr);
       unreachableEndpoints.put(addr, System.nanoTime());
       logger.info("InetAddress {} is now DOWN", addr);
       for (IEndpointStateChangeSubscriber subscriber : subscribers)
           subscriber.onDead(addr, localState);
       if (logger.isTraceEnabled())
           logger.trace("Notified " + subscribers);
   }

Saying: "InetAddress 192.168.101.1 is now Down", in the Cassandra's system log.

Now on all the other nodes the client side (java driver) says, " Cannot connect to any host, scheduling retry in 1000 milliseconds". They eventually do reconnect but some queries fail during this intermediate period.

To me it seems like when the server pushes the nodeDown event, it call the getRpcAddress(endpoint), and thus sends localhost as the argument in the nodeDown event.

As in org.apache.cassandra.transport.Server.java

  public void onDown(InetAddress endpoint)
       {      
           server.connectionTracker.send(Event.StatusChange.nodeDown(getRpcAddress(endpoint), server.socket.getPort()));
       }

the getRpcAddress returns localhost for any endpoint if the cassandra.yaml is using localhost as the configuration for rpc_address (which by the way is the default).