Lucene - Core
  1. Lucene - Core
  2. LUCENE-6214

IW deadlocks if commit and reopen happens concurrently while exception is hit

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 5.0
    • Fix Version/s: 4.10.4, 5.0, 5.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I just hit this while working on an elasticseach test using a lucene 5.1 snapshot (5.1.0-snapshot-1654549). The test throws random exceptions via MockDirWrapper and deadlocks, jstack says:

      Found one Java-level deadlock:
      =============================
      "elasticsearch[node_2][refresh][T#2]":
        waiting to lock monitor 0x00007fe51314c098 (object 0x00000007018ee8d8, a java.lang.Object),
        which is held by "elasticsearch[node_2][generic][T#1]"
      "elasticsearch[node_2][generic][T#1]":
        waiting to lock monitor 0x00007fe512d74b68 (object 0x00000007018ee8e8, a java.lang.Object),
        which is held by "elasticsearch[node_2][refresh][T#2]"
      
      Java stack information for the threads listed above:
      ===================================================
      "elasticsearch[node_2][refresh][T#2]":
      	at org.apache.lucene.index.IndexWriter.tragicEvent(IndexWriter.java:4441)
      	- waiting to lock <0x00000007018ee8d8> (a java.lang.Object)
      	at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:436)
      	- locked <0x00000007018ee8e8> (a java.lang.Object)
      	at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:281)
      	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:256)
      	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:246)
      	at org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:104)
      	at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:123)
      	at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:137)
      	at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
      	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
      	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
      	at org.elasticsearch.index.engine.internal.InternalEngine.refresh(InternalEngine.java:703)
      	at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:500)
      	at org.elasticsearch.index.shard.IndexShard$EngineRefresher$1.run(IndexShard.java:954)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      "elasticsearch[node_2][generic][T#1]":
      	at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2730)
      	- waiting to lock <0x00000007018ee8e8> (a java.lang.Object)
      	- locked <0x00000007018ee8d8> (a java.lang.Object)
      	at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2888)
      	- locked <0x00000007018ee8d8> (a java.lang.Object)
      	at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2855)
      	at org.elasticsearch.index.engine.internal.InternalEngine.commitIndexWriter(InternalEngine.java:722)
      	at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:800)
      	at org.elasticsearch.index.engine.internal.InternalEngine$RecoveryCounter.endRecovery(InternalEngine.java:1520)
      	at org.elasticsearch.index.engine.internal.InternalEngine$RecoveryCounter.close(InternalEngine.java:1533)
      	at org.elasticsearch.common.lease.Releasables.close(Releasables.java:45)
      	at org.elasticsearch.common.lease.Releasables.closeWhileHandlingException(Releasables.java:70)
      	at org.elasticsearch.common.lease.Releasables.closeWhileHandlingException(Releasables.java:75)
      	at org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1048)
      	at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:635)
      	at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:120)
      	at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:48)
      	at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:141)
      	at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:127)
      	at org.elasticsearch.transport.local.LocalTransport$2.doRun(LocalTransport.java:287)
      	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      
      Found 1 deadlock.
      
      
      1. LUCENE-6214.patch
        10 kB
        Michael McCandless
      2. LUCENE-6214.patch
        3 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Here's a simple test that seems to reproduce the deadlock.

        The deadlock happens because when IW.getReader hits a tragedy, it's still holding the fullFlushLock when it calls tragicEvent.

        delelteAll does so as well.

        I'll try to fix next...

        Show
        Michael McCandless added a comment - Here's a simple test that seems to reproduce the deadlock. The deadlock happens because when IW.getReader hits a tragedy, it's still holding the fullFlushLock when it calls tragicEvent. delelteAll does so as well. I'll try to fix next...
        Hide
        Michael McCandless added a comment -

        Patch w/ test + fix. I reorganized the hairy try/finally clauses in those two methods so that when tragedy strikes, we are not holding the fullFlushLock.

        The new test also uncovered a missing ensureOpen in ReaderPool.get, which was allowing some NRT segment readers to open in one thread even once IndexWriter was closed by tragedy in another. It's possible this new test will uncover more such cases once Jenkins is running it...

        I also added an assert in DW.finishFullFlush to confirm the calling thread is in fact holding the full flush lock.

        I think it's ready: I distributed beasted Lucene tests for a few iterations and no new failures were uncovered...

        Show
        Michael McCandless added a comment - Patch w/ test + fix. I reorganized the hairy try/finally clauses in those two methods so that when tragedy strikes, we are not holding the fullFlushLock. The new test also uncovered a missing ensureOpen in ReaderPool.get, which was allowing some NRT segment readers to open in one thread even once IndexWriter was closed by tragedy in another. It's possible this new test will uncover more such cases once Jenkins is running it... I also added an assert in DW.finishFullFlush to confirm the calling thread is in fact holding the full flush lock. I think it's ready: I distributed beasted Lucene tests for a few iterations and no new failures were uncovered...
        Hide
        Simon Willnauer added a comment -

        LGTM - I was chasing a pending file with that reopen trace as well that would be explained by the missing ensureOpen!! good stuff

        Show
        Simon Willnauer added a comment - LGTM - I was chasing a pending file with that reopen trace as well that would be explained by the missing ensureOpen!! good stuff
        Hide
        Michael McCandless added a comment -

        I think we should fix this for 5.0.0.

        Show
        Michael McCandless added a comment - I think we should fix this for 5.0.0.
        Hide
        Robert Muir added a comment -

        Can you please fix this in the test:

        try {
          dir.close();
        } catch (RuntimeException re) {
          // Ignore "cannot close because of open file handles" because tragedy can leave open file handles
          throw re;
        }
        

        I don't think its correct, a leak in this case would be a bug? thats the whole point of closing, to release resources. I applied your patch, replaced it with just dir.close() and beasted the test 200 times with no issues.

        Show
        Robert Muir added a comment - Can you please fix this in the test: try { dir.close(); } catch (RuntimeException re) { // Ignore "cannot close because of open file handles" because tragedy can leave open file handles throw re; } I don't think its correct, a leak in this case would be a bug? thats the whole point of closing, to release resources. I applied your patch, replaced it with just dir.close() and beasted the test 200 times with no issues.
        Hide
        Michael McCandless added a comment -

        Woops, yes I'll definitely remove that ... that was a holdover from when I was "struggling" Thanks for beasting.

        Show
        Michael McCandless added a comment - Woops, yes I'll definitely remove that ... that was a holdover from when I was "struggling" Thanks for beasting.
        Hide
        ASF subversion and git services added a comment -

        Commit 1656361 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1656361 ]

        LUCENE-6214: don't deadlock when tragedy strikes during getReader and another thread is committing

        Show
        ASF subversion and git services added a comment - Commit 1656361 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1656361 ] LUCENE-6214 : don't deadlock when tragedy strikes during getReader and another thread is committing
        Hide
        ASF subversion and git services added a comment -

        Commit 1656362 from Michael McCandless in branch 'dev/branches/lucene_solr_5_0'
        [ https://svn.apache.org/r1656362 ]

        LUCENE-6214: don't deadlock when tragedy strikes during getReader and another thread is committing

        Show
        ASF subversion and git services added a comment - Commit 1656362 from Michael McCandless in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1656362 ] LUCENE-6214 : don't deadlock when tragedy strikes during getReader and another thread is committing
        Hide
        ASF subversion and git services added a comment -

        Commit 1656366 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1656366 ]

        LUCENE-6214: don't deadlock when tragedy strikes during getReader and another thread is committing

        Show
        ASF subversion and git services added a comment - Commit 1656366 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1656366 ] LUCENE-6214 : don't deadlock when tragedy strikes during getReader and another thread is committing
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.
        Hide
        Michael McCandless added a comment -

        Reopen for 4.10.4

        Show
        Michael McCandless added a comment - Reopen for 4.10.4
        Hide
        ASF subversion and git services added a comment -

        Commit 1662195 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1662195 ]

        LUCENE-6214: don't deadlock when tragedy strikes during getReader and another thread is committing

        Show
        ASF subversion and git services added a comment - Commit 1662195 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1662195 ] LUCENE-6214 : don't deadlock when tragedy strikes during getReader and another thread is committing
        Hide
        Michael McCandless added a comment -

        Bulk close for 4.10.4 release

        Show
        Michael McCandless added a comment - Bulk close for 4.10.4 release

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development