Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-10281

Introduce recoveryDelay to ClusterNodeInfo.isRecoveryNeeded

    XMLWordPrintableJSON

Details

    • Task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.62.0
    • documentmk
    • None

    Description

      Oak instances periodically update their leases to signal to peers in the cluster that they are still alive. A lease that has timed out is hence taken as indication that the corresponding oak instance has crashed (and not released the lease). It is also assumed that the corresponding, crashing oak instance does not do any further write operations after the lease timeout - as it would otherwise have been alive and updated their lease, which it did not.

      As already reported elsewhere (eg OAK-10254) there is a case where indeed writes happen later than the lease timeout (aka "late writes"): a writing thread could go passed the lease check, then a stop-the-world (eg high JVM GC) could halt the thread for more than the lease timeout (eg 2min), and upon continuation that writing thread could then send the write operation to the DocumentStore.

      One way to mitigate this late-write risk is to delay the recovery. Ie wait with doing the LastRevRecovery for eg 10min after a lease failure. That includes putting the state of the clusterNode back into inactive.

      This ticket is about introducing such a recoveryDelay config parameter.

      Attachments

        Issue Links

          Activity

            People

              stefanegli Stefan Egli
              stefanegli Stefan Egli
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: