Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10638

PrimaryNode close waits for replicas to close, but there is no guarantee they ever will

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 9.2
    • None
    • modules/replicator
    • None
    • New

    Description

      We run Lucene Replicator to replicate a single primary to many replicas. In production, we have experienced downtime due to PrimaryNode.close never finishing.

       

      For some unknown reason - incorrect exception handling? Replica hung forever? Reference counting bug? - the primary's CopyState ref count never reaches 0, and so close hangs forever. While obviously we should fix the underlying bug that prevents CopyState from being released correctly, in the meantime it is quite harmful to have PrimaryNode hang for a condition that may never happen. There are also operational possibilities that could cause this even without bugs, for example a replica that hangs forever.

       

      PrimaryNode.close should have the option to avoid this situation. One possibility is to add a timeout - give replicas a configurable timeout to close cleanly, otherwise go forward with closing anyway.

       

      In our case, all replicas must already handle errors on the primary (e.g. crash) so in fact closing immediately is not more harmful than any of these other situations we must handle anyway. One could argue that generally replicas must expect a primary could disappear at any time for any reason, and in that case, maybe waiting for replicas to close is unnecessary in the first place.

       

      If we can build consensus around the right approach for a fix here, and committers don't have time to do so themselves, I am happy to assemble a PR.

      Attachments

        Activity

          People

            Unassigned Unassigned
            stevenschlansker Steven Schlansker
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: