[CASSANDRA-3294] a node whose TCP connection is not up should be considered down for the purpose of reads and writes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Normal
Resolution: Duplicate
Fix Version/s: None
Component/s: None
Labels:
None

Description

Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without ~~CASSANDRA-2540~~ but that only helps under certain circumstances).

A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better.

Are there difficulties I'm not seeing?

I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we know we cannot talk to, it seems unnecessarily detrimental.

Attachments

Activity

People

Assignee:: Peter Schuller

Reporter:: Peter Schuller

Authors:: Peter Schuller

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 02/Oct/11 12:04

Updated:: 16/Apr/19 09:32

Resolved:: 09/Jul/13 16:15