[CASSANDRA-3273] FailureDetector can take a very long time to mark a host down - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 0.8.7, 1.0.0
Component/s: None
Labels:
None

Severity:
Normal

Description

There are two ways to trigger this:

Bring a node up very briefly in a mixed-version cluster and then terminate it
Bring a node up, terminate it for a very long time, then bring it back up and take it down again

In the first case, what can happen is a very short interval arrival time is recorded by the versioning logic which requires reconnecting and can happen very quickly. This can easily be solved by rejecting any intervals within a reasonable bound, for instance the gossiper interval.

The second instance is harder to solve, because what is happening is that an extremely large interval is recorded, which is the time the node was left dead the first time. This throws off the mean of the intervals and causes it to take a much longer time than it should to mark it down the second time.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

3273.txt
29/Sep/11 21:04
5 kB
Brandon Williams

Activity

People

Assignee:: Brandon Williams

Reporter:: Brandon Williams

Authors:: Brandon Williams

Reviewers:: paul cannon

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 29/Sep/11 04:21

Updated:: 16/Apr/19 09:32

Resolved:: 03/Oct/11 20:37