Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9337

CMS might miss to pickup pending merges when maxMergeCount changes while merges are running

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: master (9.0), 8.6
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      We found a test hanging on an IW#forceMerge on elastics CI on an innocent looking test:

      14:52:06    [junit4]   2>         at java.base@11.0.2/java.lang.Object.wait(Native Method)
      14:52:06    [junit4]   2>         at app//org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4722)
      14:52:06    [junit4]   2>         at app//org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:2034)
      14:52:06    [junit4]   2>         at app//org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1960)
      14:52:06    [junit4]   2>         at app//org.apache.lucene.index.RandomIndexWriter.forceMerge(RandomIndexWriter.java:500)
      14:52:06    [junit4]   2>         at app//org.apache.lucene.index.BaseDocValuesFormatTestCase.doTestNumericsVsStoredFields(BaseDocValuesFormatTestCase.java:1301)
      14:52:06    [junit4]   2>         at app//org.apache.lucene.index.BaseDocValuesFormatTestCase.doTestNumericsVsStoredFields(BaseDocValuesFormatTestCase.java:1258)
      14:52:06    [junit4]   2>         at app//org.apache.lucene.index.BaseDocValuesFormatTestCase.testZeroOrMin(BaseDocValuesFormatTestCase.java:2423)
      

      after spending quite some time trying to reproduce without any luck I tried to review all involved code again to understand possible threading issues. What I found is that if maxMergeCount gets changed on CMS while there are merges running and the forceMerge gets kicked off at the same time the running merges return we might miss to pick up the final pending merges which causes the forceMerge to hang. I was able to build a test-case that is very likely to fail on every run without the fix. While I think this is not a critical bug from how likely it is to happen in practice, if it happens it's basically a deadlock unless the IW sees any other change that kicks off a merge.

      Lemme walk through the issue. Lets say we have 1 pending merge and 2 merge threads running on CMS. The forceMerge is already waiting for merges to finish. Once the first merge thread finishes we try to check if we need to stall it here but since it's a merge thread we return here and don't pick up another merge here.
      Now the second running merge thread checks the condition here while the first one is finishing up. But before it can actually update the internal datastructures here it releases the CMS lock and the calculation in the stall method on how many threads are running is off causing the second thread also to step out of the maybeStall method not picking up the pending merge.

        Attachments

          Activity

            People

            • Assignee:
              simonw Simon Willnauer
              Reporter:
              simonw Simon Willnauer
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 50m
                50m