Uploaded image for project: 'Sling'
  1. Sling
  2. SLING-5285

more aggressive self-check for heartbeat timeout

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • Discovery Impl 1.2.0
    • Discovery Impl 1.2.2
    • Extensions
    • None

    Description

      SLING-5195 introduced a self-check that was monitoring if the HeartbeatHandler was properly storing the heartbeats regularly. This is done because there are different reasons why that might not be the case, eg: the HeartbeatHandler could be blocked because of another long-running-commit happening locally - or it might be blocked due to thread-pool-exhaustion - or perhaps something yet different.

      The check was setting off an alarm when the time-since-last-heartbeat was bigger than a heartbeatTimeout. This however is not sufficient. The comparison should be much more aggressive. It should compare against a heartbeatTimeout minus 2 times heartbeatInterval to have enough safety margin. 2 times because 1 time is actually the very minimum: this background check only runs every heartbeatInterval, so in the worst case it could run just heartbeatInterval many seconds before the timeout hits - and still be too late by a fraction. So 1 is the very minimum. The 2 is actually adding a safety margin of 1 heartbeatInterval only.

      Note: this also means that you should configure the heartbeatTimeout at least 4-5 times the heartbeatInterval.

      Attachments

        Issue Links

          Activity

            People

              stefanegli Stefan Egli
              stefanegli Stefan Egli
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: