[CASSANDRA-13993] Add optional startup delay to wait until peers are ready - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Low
Resolution: Fixed
Fix Version/s: 4.0-alpha1, 4.0
Component/s: Local/Startup and Shutdown
Labels:
None

Description

When bouncing a node in a large cluster, is can take a while to recognize the rest of the cluster as available. This is especially true if using TLS on internode messaging connections. The bouncing node (and any clients connected to it) may see a series of Unavailable or Timeout exceptions until the node is 'warmed up' as connecting to the rest of the cluster is asynchronous from the rest of the startup process.

There are two aspects that drive a node's ability to successfully communicate with a peer after a bounce:

marking the peer as 'alive' (state that is held in gossip). This affects the unavailable exceptions
having both open outbound and inbound connections open and ready to each peer. This affects timeouts.

Details of each of these mechanisms are described in the comments below.

This ticket proposes adding a mechanism, optional and configurable, to delay opening the client native protocol port until some percentage of the peers in the cluster is marked alive and connected to/from. Thus while we potentially slow down startup (delay opening the client port), we alleviate the chance that queries made by clients don't hit transient unavailable/timeout exceptions.

Attachments

Issue Links

is related to

CASSANDRA-18968 StartupClusterConnectivityChecker fails on upgrade from 3.X

Resolved

CASSANDRA-14001 Gossip after node restart can take a long time to converge about "down" nodes in large clusters

Resolved

relates to

CASSANDRA-14297 Startup checker should wait for count rather than percentage

Resolved

Activity

People

Assignee:: Jason Brown

Reporter:: Jason Brown

Authors:: Jason Brown

Reviewers:: Ariel Weisberg

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 05/Nov/17 14:15

Updated:: 27/Oct/23 16:47

Resolved:: 26/Feb/18 14:48