[SLING-5285] more aggressive self-check for heartbeat timeout - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: Discovery Impl 1.2.0
Fix Version/s: Discovery Impl 1.2.2
Component/s: Extensions
Labels:
None

Description

~~SLING-5195~~ introduced a self-check that was monitoring if the HeartbeatHandler was properly storing the heartbeats regularly. This is done because there are different reasons why that might not be the case, eg: the HeartbeatHandler could be blocked because of another long-running-commit happening locally - or it might be blocked due to thread-pool-exhaustion - or perhaps something yet different.

The check was setting off an alarm when the time-since-last-heartbeat was bigger than a heartbeatTimeout. This however is not sufficient. The comparison should be much more aggressive. It should compare against a heartbeatTimeout minus 2 times heartbeatInterval to have enough safety margin. 2 times because 1 time is actually the very minimum: this background check only runs every heartbeatInterval, so in the worst case it could run just heartbeatInterval many seconds before the timeout hits - and still be too late by a fraction. So 1 is the very minimum. The 2 is actually adding a safety margin of 1 heartbeatInterval only.

Note: this also means that you should configure the heartbeatTimeout at least 4-5 times the heartbeatInterval.

Attachments

Issue Links

is related to

SLING-5284 use dedicate thread instead of scheduler

Closed

Activity

People

Assignee:: Stefan Egli

Reporter:: Stefan Egli

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 09/Nov/15 16:13

Updated:: 16/Nov/15 15:00

Resolved:: 09/Nov/15 17:14