Details
-
Improvement
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
-
Operability
-
Normal
-
All
-
None
-
Description
In one of our deployments, after a host replacement a subset of nodes still saw the nodes as JOINING despite the rest of the cluster seeing it as NORMAL with a failure to gossip. This was traced to a DNS lookup failure on the nodes during an interim state leading to an exception being thrown and gossip state never transitioning.
Rather than implicitly requiring operators to bounce the node by throwing an exception, we should instead suppress the exception when checking if a node is replacing the same host address and ID if we get an UnknownHostException.