Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.3
    • Fix Version/s: 4.4, 4.3.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      OpenJDK 64-Bit Server VM (23.7-b01 mixed mode)
      Linux Ubuntu Server 12.04 LTS 64-Bit

    • Lucene Fields:
      New

      Description

      Hi all,

      We have an obvious deadlock between a "MaybeRefreshIndexJob" thread
      calling ReferenceManager.maybeRefresh(ReferenceManager.java:204) and a
      "RebuildIndexJob" thread calling
      IndexWriter.deleteAll(IndexWriter.java:2065).

      Lucene wants to flush in the "MaybeRefreshIndexJob" thread trying to intrinsically lock the IndexWriter instance at DocumentsWriterPerThread.java:563 before notifyAll()ing the flush.

      Simultaneously the "RebuildIndexJob" thread who already intrinsically locked the IndexWriter instance at IndexWriter#deleteAll wait()s at DocumentsWriterFlushControl.java:245 for the flush forever causing a deadlock.

      "MaybeRefreshIndexJob Thread - 2" daemon prio=10 tid=0x00007f8fe4006000 nid=0x1ac2 waiting for monitor entry [0x00007f8fa7bf7000]
         java.lang.Thread.State: BLOCKED (on object monitor)
      	at org.apache.lucene.index.IndexWriter.useCompoundFile(IndexWriter.java:2223)
      	- waiting to lock <0x00000000f1c00438> (a org.apache.lucene.index.IndexWriter)
      	at org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:563)
      	at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:533)
      	at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:422)
      	at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:559)
      	at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:365)
      	- locked <0x00000000f1c007d0> (a java.lang.Object)
      	at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:270)
      	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:245)
      	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:235)
      	at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
      	at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:118)
      	at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
      	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:155)
      	at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:204)
      	at jobs.MaybeRefreshIndexJob.timeout(MaybeRefreshIndexJob.java:47)
      
      "RebuildIndexJob Thread - 1" prio=10 tid=0x00007f903000a000 nid=0x1a38 in Object.wait() [0x00007f9037dd6000]
         java.lang.Thread.State: WAITING (on object monitor)
      	at java.lang.Object.wait(Native Method)
      	- waiting on <0x00000000f1c0c240> (a org.apache.lucene.index.DocumentsWriterFlushControl)
      	at java.lang.Object.wait(Object.java:503)
      	at org.apache.lucene.index.DocumentsWriterFlushControl.waitForFlush(DocumentsWriterFlushControl.java:245)
      	- locked <0x00000000f1c0c240> (a org.apache.lucene.index.DocumentsWriterFlushControl)
      	at org.apache.lucene.index.DocumentsWriter.abort(DocumentsWriter.java:235)
      	- locked <0x00000000f1c05370> (a org.apache.lucene.index.DocumentsWriter)
      	at org.apache.lucene.index.IndexWriter.deleteAll(IndexWriter.java:2065)
      	- locked <0x00000000f1c00438> (a org.apache.lucene.index.IndexWriter)
      	at jobs.RebuildIndexJob.buildIndex(RebuildIndexJob.java:102)
      
      1. LUCENE-5002_test.patch
        3 kB
        Simon Willnauer
      2. LUCENE-5002.patch
        13 kB
        Simon Willnauer

        Activity

        Hide
        Sergiusz Urbaniak added a comment -

        Implementation note: we (obviously) use the same IndexWriter instance across all threads.

        Show
        Sergiusz Urbaniak added a comment - Implementation note: we (obviously) use the same IndexWriter instance across all threads.
        Hide
        Simon Willnauer added a comment -

        here is a patch that has a test that hangs. Pretty straight forward though. Yet, the problem is that we are locking on the index writer in DWPT. Or on the other hand there are too many synch blocks in IW to make it safe to call into IW from DWPT.

        I need to look into that more closely to figure out how to fix that.

        Show
        Simon Willnauer added a comment - here is a patch that has a test that hangs. Pretty straight forward though. Yet, the problem is that we are locking on the index writer in DWPT. Or on the other hand there are too many synch blocks in IW to make it safe to call into IW from DWPT. I need to look into that more closely to figure out how to fix that.
        Hide
        Uwe Schindler added a comment -

        Yet, the problem is that we are locking on the index writer in DWPT.

        My personal horror scenario! The worst thing you can do is to also externally synchronize on IW. This also causes deadlocks. We should maybe open an issue to fix the synchronization in IW and make it simplier, especially with using ju.concurrent.Lock implementations.

        Show
        Uwe Schindler added a comment - Yet, the problem is that we are locking on the index writer in DWPT. My personal horror scenario! The worst thing you can do is to also externally synchronize on IW. This also causes deadlocks. We should maybe open an issue to fix the synchronization in IW and make it simplier, especially with using ju.concurrent.Lock implementations.
        Hide
        Sergiusz Urbaniak added a comment -

        Hi,

        Thanks for the quick feedback! As long as the sync issues on IW are unresolved we declare IW instances as not thread-safe for our development and synchronize access to it externally (of course as mentioned in the docs not on the IW instance itself).

        Show
        Sergiusz Urbaniak added a comment - Hi, Thanks for the quick feedback! As long as the sync issues on IW are unresolved we declare IW instances as not thread-safe for our development and synchronize access to it externally (of course as mentioned in the docs not on the IW instance itself).
        Hide
        Michael McCandless added a comment -

        I think we should address this for 4.3.1?

        Show
        Michael McCandless added a comment - I think we should address this for 4.3.1?
        Hide
        Simon Willnauer added a comment -

        Ok so I tried to make this work for an entire day and bottom line is that once I move the DocumentsWriter#abort() out of the sync block my test still fails all over the place. Yet, it's not hanging but concurrent access to IW while IW#deleteAll() is called is entirely broken IMO. I don't even know where to start, here is a small wrapup of the failures I saw:

        • asserts are tripped in global field map since we clear and concurrently index (remember indexing is non-blocking)
        • concurrent commits fail with fiel not found exception (even if we fully sync) seems like some state in IW is not cleared
        • updatePendingMerges fails with FNF when merges are updated concurrently.

        To begin with I doubt that the semantics of IW#deleteAll() are correct today if you are accessing the IW concurrently. I mean we basically dropping everything and don't maintain any happens before relationship here at all, delete all files that are not referenced in any seg info wipe all the global field infos etc. We should address this properly.

        I agree that we have to fix this until 4.3.1!

        Yet, Serguiuz do you see any FileNotFoundExceptions or anything when you concurrently call deleteAll? I mean this seems entirely broken to me at this point. I suggest you to use deleteQuery(new MatchAllDocsQuery()) for now and not lock globally.

        simon

        Show
        Simon Willnauer added a comment - Ok so I tried to make this work for an entire day and bottom line is that once I move the DocumentsWriter#abort() out of the sync block my test still fails all over the place. Yet, it's not hanging but concurrent access to IW while IW#deleteAll() is called is entirely broken IMO. I don't even know where to start, here is a small wrapup of the failures I saw: asserts are tripped in global field map since we clear and concurrently index (remember indexing is non-blocking) concurrent commits fail with fiel not found exception (even if we fully sync) seems like some state in IW is not cleared updatePendingMerges fails with FNF when merges are updated concurrently. To begin with I doubt that the semantics of IW#deleteAll() are correct today if you are accessing the IW concurrently. I mean we basically dropping everything and don't maintain any happens before relationship here at all, delete all files that are not referenced in any seg info wipe all the global field infos etc. We should address this properly. I agree that we have to fix this until 4.3.1! Yet, Serguiuz do you see any FileNotFoundExceptions or anything when you concurrently call deleteAll? I mean this seems entirely broken to me at this point. I suggest you to use deleteQuery(new MatchAllDocsQuery()) for now and not lock globally. simon
        Hide
        Simon Willnauer added a comment -

        here is a patch that fixes this issue and adds some asserts that make sure we don't wait while holding the IW lock. Yet, this is a pretty drastic step in the patch since I need to stop the world for this to be an operation that works correct in a concurrent world. I think it's ok for us do to a stop the world here but we really need to beast this patch.

        Show
        Simon Willnauer added a comment - here is a patch that fixes this issue and adds some asserts that make sure we don't wait while holding the IW lock. Yet, this is a pretty drastic step in the patch since I need to stop the world for this to be an operation that works correct in a concurrent world. I think it's ok for us do to a stop the world here but we really need to beast this patch.
        Hide
        Michael McCandless added a comment -

        One (app level) workaround here is to not call deleteAll in one thread, while other threads are still indexing.

        But we still have to fix the deadlock. I think a simple fix in deleteAll is to move the sync down after docWriter.abort(), so we don't hold IW's intrinsic lock while calling docWriter.abort(). I tried this, but it leads to an AssertionError:

        Caused by: java.lang.AssertionError
        	at __randomizedtesting.SeedInfo.seed([60429B79D72112B5]:0)
        	at org.apache.lucene.index.FieldInfos$Builder.addOrUpdateInternal(FieldInfos.java:284)
        	at org.apache.lucene.index.FieldInfos$Builder.addOrUpdate(FieldInfos.java:266)
        	at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:211)
        	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256)
        	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
        	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1508)
        	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1183)
        	at org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:152)
        	at org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:114)
        	at org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:331)
        

        Separately we really need to overhaul / simplify IW/DW/BD's locking!

        Show
        Michael McCandless added a comment - One (app level) workaround here is to not call deleteAll in one thread, while other threads are still indexing. But we still have to fix the deadlock. I think a simple fix in deleteAll is to move the sync down after docWriter.abort(), so we don't hold IW's intrinsic lock while calling docWriter.abort(). I tried this, but it leads to an AssertionError: Caused by: java.lang.AssertionError at __randomizedtesting.SeedInfo.seed([60429B79D72112B5]:0) at org.apache.lucene.index.FieldInfos$Builder.addOrUpdateInternal(FieldInfos.java:284) at org.apache.lucene.index.FieldInfos$Builder.addOrUpdate(FieldInfos.java:266) at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:211) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:256) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1508) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1183) at org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:152) at org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:114) at org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:331) Separately we really need to overhaul / simplify IW/DW/BD's locking!
        Hide
        Michael McCandless added a comment -

        The patch is terrifying looking yet seems necessary, and on beasting seems to work (but: we have to either not use docValues, or prevent Lucene3x codec).

        I think stop-the-world is perfectly fine here: it's not like apps are calling deleteAll 1000s of times per second.

        I even think it'd be fine if concurrent indexing threads hit strange exceptions, and we document that you should not use other methods while deleteAll is invoked in one thread, as long as we can guarantee this never leads to index corruption. This is just like you can call IW.close from one thread while other threads are still indexing but those other threads can hit strange exceptions.

        I'll open a new issue to somehow simplify IW's sync... it's a mess now.

        Show
        Michael McCandless added a comment - The patch is terrifying looking yet seems necessary, and on beasting seems to work (but: we have to either not use docValues, or prevent Lucene3x codec). I think stop-the-world is perfectly fine here: it's not like apps are calling deleteAll 1000s of times per second. I even think it'd be fine if concurrent indexing threads hit strange exceptions, and we document that you should not use other methods while deleteAll is invoked in one thread, as long as we can guarantee this never leads to index corruption. This is just like you can call IW.close from one thread while other threads are still indexing but those other threads can hit strange exceptions. I'll open a new issue to somehow simplify IW's sync... it's a mess now.
        Hide
        Michael McCandless added a comment -

        OK I opened LUCENE-5006.

        Show
        Michael McCandless added a comment - OK I opened LUCENE-5006 .
        Hide
        Sergiusz Urbaniak added a comment -

        Simon,

        No FileNotFoundExceptions what so ever. The stack trace above is "complete" except the crappy ejb stack forrest which is not relevant.

        Again thanks for the quick reaction, we'll use deleteQuery(new MatchAllDocsQuery()) instead and omit the global lock.

        Show
        Sergiusz Urbaniak added a comment - Simon, No FileNotFoundExceptions what so ever. The stack trace above is "complete" except the crappy ejb stack forrest which is not relevant. Again thanks for the quick reaction, we'll use deleteQuery(new MatchAllDocsQuery()) instead and omit the global lock.
        Hide
        Simon Willnauer added a comment -

        Mike, I want to commit this patch and let is bake in a bit on trunk and 4x, any objections? I will remove the DV use in 4x on trunk this is not a problem.

        Show
        Simon Willnauer added a comment - Mike, I want to commit this patch and let is bake in a bit on trunk and 4x, any objections? I will remove the DV use in 4x on trunk this is not a problem.
        Hide
        Michael McCandless added a comment -

        +1 to commit

        Show
        Michael McCandless added a comment - +1 to commit
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk closing after 4.3.1 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk closing after 4.3.1 release

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Sergiusz Urbaniak
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development