Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9337

CMS might miss to pickup pending merges when maxMergeCount changes while merges are running

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 9.0, 8.6
    • None
    • None
    • New, Patch Available

    Description

      We found a test hanging on an IW#forceMerge on elastics CI on an innocent looking test:

      14:52:06    [junit4]   2>         at java.base@11.0.2/java.lang.Object.wait(Native Method)
      14:52:06    [junit4]   2>         at app//org.apache.lucene.index.IndexWriter.doWait(IndexWriter.java:4722)
      14:52:06    [junit4]   2>         at app//org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:2034)
      14:52:06    [junit4]   2>         at app//org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1960)
      14:52:06    [junit4]   2>         at app//org.apache.lucene.index.RandomIndexWriter.forceMerge(RandomIndexWriter.java:500)
      14:52:06    [junit4]   2>         at app//org.apache.lucene.index.BaseDocValuesFormatTestCase.doTestNumericsVsStoredFields(BaseDocValuesFormatTestCase.java:1301)
      14:52:06    [junit4]   2>         at app//org.apache.lucene.index.BaseDocValuesFormatTestCase.doTestNumericsVsStoredFields(BaseDocValuesFormatTestCase.java:1258)
      14:52:06    [junit4]   2>         at app//org.apache.lucene.index.BaseDocValuesFormatTestCase.testZeroOrMin(BaseDocValuesFormatTestCase.java:2423)
      

      after spending quite some time trying to reproduce without any luck I tried to review all involved code again to understand possible threading issues. What I found is that if maxMergeCount gets changed on CMS while there are merges running and the forceMerge gets kicked off at the same time the running merges return we might miss to pick up the final pending merges which causes the forceMerge to hang. I was able to build a test-case that is very likely to fail on every run without the fix. While I think this is not a critical bug from how likely it is to happen in practice, if it happens it's basically a deadlock unless the IW sees any other change that kicks off a merge.

      Lemme walk through the issue. Lets say we have 1 pending merge and 2 merge threads running on CMS. The forceMerge is already waiting for merges to finish. Once the first merge thread finishes we try to check if we need to stall it here but since it's a merge thread we return here and don't pick up another merge here.
      Now the second running merge thread checks the condition here while the first one is finishing up. But before it can actually update the internal datastructures here it releases the CMS lock and the calculation in the stall method on how many threads are running is off causing the second thread also to step out of the maybeStall method not picking up the pending merge.

      Attachments

        Activity

          People

            simonw Simon Willnauer
            simonw Simon Willnauer
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 50m
                50m