Lucene - Core
  1. Lucene - Core
  2. LUCENE-4245

IndexWriter#close(true) should either not be interruptible or should abort background merge threads before returning

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-BETA, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently, there is no safe way to close a directory after closing the writer without causing an exception in a merge thread if the #close call is interrupted.

      1. LUCENE-4245.patch
        5 kB
        Uwe Schindler
      2. LUCENE-4245.patch
        3 kB
        Uwe Schindler
      3. LUCENE-4245.patch
        4 kB
        Uwe Schindler
      4. LUCENE-4245.patch
        2 kB
        Uwe Schindler
      5. LUCENE-4245.patch
        1 kB
        Uwe Schindler

        Activity

        Hide
        Mark Miller added a comment -

        Another option is another param to force actually waiting even in the face of an interrupt? Really though - I think if we are going to respond to the interrupt in close, we should handle it in a way that assures background threads are interrupted first.

        Show
        Mark Miller added a comment - Another option is another param to force actually waiting even in the face of an interrupt? Really though - I think if we are going to respond to the interrupt in close, we should handle it in a way that assures background threads are interrupted first.
        Hide
        Uwe Schindler added a comment -

        Here my idea of a patch, there is still a crazy test missing enforcing an interrupt to the merge thread:

        • It changes the finishMerges() in closeInternal to a loop that repeats until the call succeeded and did not throw ThreadInterruptedException.
        • After the first one was interrupted, all later invocations will pass waitForMerges = false, so CMS shuts down ASAP (but could still be interrupted, so we have to do this in loop).
        • After loop has finished, we rethrow the first ThreadInterruptedException & close the merge scheduler for safety.
        Show
        Uwe Schindler added a comment - Here my idea of a patch, there is still a crazy test missing enforcing an interrupt to the merge thread: It changes the finishMerges() in closeInternal to a loop that repeats until the call succeeded and did not throw ThreadInterruptedException. After the first one was interrupted, all later invocations will pass waitForMerges = false, so CMS shuts down ASAP (but could still be interrupted, so we have to do this in loop). After loop has finished, we rethrow the first ThreadInterruptedException & close the merge scheduler for safety.
        Hide
        Uwe Schindler added a comment -

        Unfortunately more methods in CMS can throw ThreadInterruptedException, so handle all cases and try hardest to clean up all running merge threads on close.

        This example shows why it is better to let InterruptedException be handled as checked Exception. With the original InterruptedException declared everywhere and not converted to a RuntimeException, the need for cleanup in close() [which is not allowed to throw InterrupotedException because Closeable interface] is obvious and we are enforced to do it.

        We should rethink about InterruptedException!

        Show
        Uwe Schindler added a comment - Unfortunately more methods in CMS can throw ThreadInterruptedException, so handle all cases and try hardest to clean up all running merge threads on close. This example shows why it is better to let InterruptedException be handled as checked Exception. With the original InterruptedException declared everywhere and not converted to a RuntimeException, the need for cleanup in close() [which is not allowed to throw InterrupotedException because Closeable interface] is obvious and we are enforced to do it. We should rethink about InterruptedException!
        Hide
        Michael McCandless added a comment -

        If IndexWriter.close throws ThreadInterruptedException (because "you" had interrupted it) then you should consider the IndexWriter not closed, and either IW.rollback it, or IW.close it again?

        Show
        Michael McCandless added a comment - If IndexWriter.close throws ThreadInterruptedException (because "you" had interrupted it) then you should consider the IndexWriter not closed, and either IW.rollback it, or IW.close it again?
        Hide
        Mark Miller added a comment -

        You don't always have control over these interrupts - ie its not 'you' that interrupts it - in my case it's jetty on shutdown. Most interrupts to Lucene are probably a result of an external agent like a web container.

        I don't think close should be interruptible - but waiting for merge threads to finish perhaps should be - you should still get a closed IW though. You don't know where you interrupted it and having to call close or rollback again is just confusing. It would be hard to even know if you did manage to interrupt it - the interrupt has to come from another thread - so you don't know if it happened before the close, after the close, etc. You don't know the result of the close or the interrupt or whether you have to make another call to close or rollback.

        In my opinion, close itself should not be an interruptible call - the interrupt should just be about canceling a wait on background merge threads at most (but they should still be stopped first)

        Show
        Mark Miller added a comment - You don't always have control over these interrupts - ie its not 'you' that interrupts it - in my case it's jetty on shutdown. Most interrupts to Lucene are probably a result of an external agent like a web container. I don't think close should be interruptible - but waiting for merge threads to finish perhaps should be - you should still get a closed IW though. You don't know where you interrupted it and having to call close or rollback again is just confusing. It would be hard to even know if you did manage to interrupt it - the interrupt has to come from another thread - so you don't know if it happened before the close, after the close, etc. You don't know the result of the close or the interrupt or whether you have to make another call to close or rollback. In my opinion, close itself should not be an interruptible call - the interrupt should just be about canceling a wait on background merge threads at most (but they should still be stopped first)
        Hide
        Uwe Schindler added a comment -

        I think my patch does exactly that. If you call close() and one interrupts it, you still get a closed reader (and the Exception).

        Another fix in Solr code would be do make the part that calls IndexWriter.close() be a while loop as long as a ThreadInterruptedException is catched.

        Show
        Uwe Schindler added a comment - I think my patch does exactly that. If you call close() and one interrupts it, you still get a closed reader (and the Exception). Another fix in Solr code would be do make the part that calls IndexWriter.close() be a while loop as long as a ThreadInterruptedException is catched.
        Hide
        Mark Miller added a comment -

        that calls IndexWriter.close() be a while loop as long as a ThreadInterruptedException is catched.

        I could do that, but I think it means close is kind of 'trappy' - at least it should be a checked exception so people know they should always loop on close in case of an interrupt...though that seems odd to me.

        Why would you interrupt close? Surely not so you can keep using the Writer? Surely not because it takes to long?

        The only thing I can see interrupting is waiting for the background merge threads. Interrupts generally happen when something is trying to shutdown. Wouldn't you only interrupt if close was taking too long? Your still going to close your directory right after. It's not like you are going to interrupt a close so you can continue using the Writer? That seems like a crazy use case. At most, it would seem to mean, close faster if you can - not, abort, I still want to use you! ABORT ABORT! I've got a document to add!!

        Show
        Mark Miller added a comment - that calls IndexWriter.close() be a while loop as long as a ThreadInterruptedException is catched. I could do that, but I think it means close is kind of 'trappy' - at least it should be a checked exception so people know they should always loop on close in case of an interrupt...though that seems odd to me. Why would you interrupt close? Surely not so you can keep using the Writer? Surely not because it takes to long? The only thing I can see interrupting is waiting for the background merge threads. Interrupts generally happen when something is trying to shutdown. Wouldn't you only interrupt if close was taking too long? Your still going to close your directory right after. It's not like you are going to interrupt a close so you can continue using the Writer? That seems like a crazy use case. At most, it would seem to mean, close faster if you can - not, abort, I still want to use you! ABORT ABORT! I've got a document to add!!
        Hide
        Robert Muir added a comment -

        The trap is that close() does slow things at all

        close() should be equivalent to rollback()

        Show
        Robert Muir added a comment - The trap is that close() does slow things at all close() should be equivalent to rollback()
        Hide
        Robert Muir added a comment -

        I'm gonna spin off an issue for this. Its so insanely ridiculous that close() commits, waits for merges, or does any of what it does.

        Now is a great time to fix it: people have to change their code to upgrade to 4.0.

        Show
        Robert Muir added a comment - I'm gonna spin off an issue for this. Its so insanely ridiculous that close() commits, waits for merges, or does any of what it does. Now is a great time to fix it: people have to change their code to upgrade to 4.0.
        Hide
        Shai Erera added a comment -

        I'm gonna spin off an issue for this. Its so insanely ridiculous that close() commits, waits for merges, or does any of what it does.

        Why is that different from OutputStream.close() calling flush()? Would you like to call flush() yourself everytime before you close()?

        Show
        Shai Erera added a comment - I'm gonna spin off an issue for this. Its so insanely ridiculous that close() commits, waits for merges, or does any of what it does. Why is that different from OutputStream.close() calling flush()? Would you like to call flush() yourself everytime before you close()?
        Hide
        Uwe Schindler added a comment -

        What is the problem with the attached patch?

        Why is that different from OutputStream.close() calling flush()? Would you like to call flush() yourself everytime before you close()?

        I agree with Shai, I dont see any reason to change this logic. If we remove the wait for merge/commit shit in IW.close(), we still have to wait for the MergeScheduler to shutdown. As this one is in another thread, you have to join()/wait() so its always interruptible. Thats the main problem with CMS, shutting it down is hard.

        My attached patch does three steps to shutdown, if we remove the commit, it would just be one step less, but the very last step mergeScheduler.close() [see patch] would stay alive, and this one unfortunately can be interrupted (if it is CMS). Can we do 2 steps? First commit something like my patch and later think about committing/flushing/... or not?

        Show
        Uwe Schindler added a comment - What is the problem with the attached patch? Why is that different from OutputStream.close() calling flush()? Would you like to call flush() yourself everytime before you close()? I agree with Shai, I dont see any reason to change this logic. If we remove the wait for merge/commit shit in IW.close(), we still have to wait for the MergeScheduler to shutdown. As this one is in another thread, you have to join()/wait() so its always interruptible. Thats the main problem with CMS, shutting it down is hard. My attached patch does three steps to shutdown, if we remove the commit, it would just be one step less, but the very last step mergeScheduler.close() [see patch] would stay alive, and this one unfortunately can be interrupted (if it is CMS). Can we do 2 steps? First commit something like my patch and later think about committing/flushing/... or not?
        Hide
        Michael McCandless added a comment -

        I don't think close should be interruptible

        I don't think that's really an option: close first flushes which eg
        can hit interrupt in the IO (which causes IW to drop that one
        segment).

        but waiting for merge threads to finish perhaps should be

        I'm not sure: I think it's a little too much magic to interpret an
        interrupt to mean abort running merges. I think the app should turn
        around and call close(false), call rollback(), call commit() then
        rollback(), etc.: we don't know the "intention" of the interrupt.

        Show
        Michael McCandless added a comment - I don't think close should be interruptible I don't think that's really an option: close first flushes which eg can hit interrupt in the IO (which causes IW to drop that one segment). but waiting for merge threads to finish perhaps should be I'm not sure: I think it's a little too much magic to interpret an interrupt to mean abort running merges. I think the app should turn around and call close(false), call rollback(), call commit() then rollback(), etc.: we don't know the "intention" of the interrupt.
        Hide
        Robert Muir added a comment -

        This is so broken for something transactional to have a close like this.

        Show
        Robert Muir added a comment - This is so broken for something transactional to have a close like this.
        Hide
        Mark Miller added a comment -

        What is the problem with the attached patch?

        Nothing in my opinion - +1 from me.

        I think Robert's issue is another JIRA that will have it's own discussion.

        But I guess Mike is not on board with the current patch?

        I guess I don't see it as magic. An interrupt in java means please stop what you are doing. It's up to the app to determine what to stop and what not to stop and how long to take. I don't think that's interpreting an interrupt - it just is what it is. It's loose as goose.

        When I say that close should not be interruptible, I don't mean that interrupts should not kill IO - I mean that the IW should still close and not leave threads running. I mean that once you call IW#close that should be the end game. It should then be safe to call Dir#close.

        If that means waiting for all merge threads to finish even though you have been interrupted - that is fine. It's up to the app how to respond to an interrupt. I'd prefer merges where aborted - usually if you are going to listen to an interrupt at all, you try to do it in a timely manner.

        Show
        Mark Miller added a comment - What is the problem with the attached patch? Nothing in my opinion - +1 from me. I think Robert's issue is another JIRA that will have it's own discussion. But I guess Mike is not on board with the current patch? I guess I don't see it as magic. An interrupt in java means please stop what you are doing. It's up to the app to determine what to stop and what not to stop and how long to take. I don't think that's interpreting an interrupt - it just is what it is. It's loose as goose. When I say that close should not be interruptible, I don't mean that interrupts should not kill IO - I mean that the IW should still close and not leave threads running. I mean that once you call IW#close that should be the end game. It should then be safe to call Dir#close. If that means waiting for all merge threads to finish even though you have been interrupted - that is fine. It's up to the app how to respond to an interrupt. I'd prefer merges where aborted - usually if you are going to listen to an interrupt at all, you try to do it in a timely manner.
        Hide
        Uwe Schindler added a comment - - edited

        If that means waiting for all merge threads to finish even though you have been interrupted - that is fine. It's up to the app how to respond to an interrupt. I'd prefer merges where aborted - usually if you are going to listen to an interrupt at all, you try to do it in a timely manner.

        That's what the patch is doing. If you interrupt close(), it will recover the wait() and try a second time on the mergeScheduler with no longer waiting for merges to finish. That's all it does. But as waiting in the mergeScheduler.close() call contains Thread.join() [for CMS] its also interruptible, so this call must be repeated until no exceptions occur anymore. It is the same like closeSafely, it also tries to close everything ignoring Exceptions.

        Show
        Uwe Schindler added a comment - - edited If that means waiting for all merge threads to finish even though you have been interrupted - that is fine. It's up to the app how to respond to an interrupt. I'd prefer merges where aborted - usually if you are going to listen to an interrupt at all, you try to do it in a timely manner. That's what the patch is doing. If you interrupt close(), it will recover the wait() and try a second time on the mergeScheduler with no longer waiting for merges to finish. That's all it does. But as waiting in the mergeScheduler.close() call contains Thread.join() [for CMS] its also interruptible, so this call must be repeated until no exceptions occur anymore. It is the same like closeSafely, it also tries to close everything ignoring Exceptions.
        Hide
        Uwe Schindler added a comment -

        Much more easier patch:

        • In preparation for Roberts work on LUCENE-4246, I made MergeScheduler and its close() implement Closeable, so we can later also close the scheduler using IOUtils.
        • MergeScheduler, especially CMS's close() are not interruptible (which makes no sense). Close waits until all threads are dead. The fix was easy, just ignore InterruptedException in close() and loop.
        • The IndexWriter changes can be removed later once Robert finishes. Waiting for pending merges and all that stuff is done in this patch, but easy removeable. It just makes those call uninteruptible, too by falling back to be as quick as possible when interrupted. The MergeScheduler.close() call at end should be moved to a final finally block using IOUtils (later in LUCENE-4246).

        I think this is ready to commit. I will think about a good test.

        Show
        Uwe Schindler added a comment - Much more easier patch: In preparation for Roberts work on LUCENE-4246 , I made MergeScheduler and its close() implement Closeable, so we can later also close the scheduler using IOUtils. MergeScheduler, especially CMS's close() are not interruptible (which makes no sense). Close waits until all threads are dead. The fix was easy, just ignore InterruptedException in close() and loop. The IndexWriter changes can be removed later once Robert finishes. Waiting for pending merges and all that stuff is done in this patch, but easy removeable. It just makes those call uninteruptible, too by falling back to be as quick as possible when interrupted. The MergeScheduler.close() call at end should be moved to a final finally block using IOUtils (later in LUCENE-4246 ). I think this is ready to commit. I will think about a good test.
        Hide
        Uwe Schindler added a comment -

        Improved patch, much simplier. close() is no longer interruptible (throws no exception ThreadInterruptedException), if interrupted it just cancels unfinished merges and finishes earlier.

        I applied this patch and now on my slow-IO windows system no Solr tests are failing anymore, except ZK ones, which happens seldem here. With unpatched trunk, I get no passing test run at all. It seems that on slow IO systems, Jetty more often interrupts IndexWriter while closing.

        I will commit this if nobody objects tomorrow morning.

        Show
        Uwe Schindler added a comment - Improved patch, much simplier. close() is no longer interruptible (throws no exception ThreadInterruptedException), if interrupted it just cancels unfinished merges and finishes earlier. I applied this patch and now on my slow-IO windows system no Solr tests are failing anymore, except ZK ones, which happens seldem here. With unpatched trunk, I get no passing test run at all. It seems that on slow IO systems, Jetty more often interrupts IndexWriter while closing. I will commit this if nobody objects tomorrow morning.
        Hide
        Uwe Schindler added a comment -

        Patch that preserves Thread interrupt status and handles thread interrupts in non-interruptible methods according to Sun/Oracle handbook.

        This problem exists in more places at Lucene also test failures caused by this.

        There is a reason why InterruptedException is checked. We convert it to unchecked by wrapping with ThreadInterruptedException. But thats the wrong way to do. InterruptedException is checked, because on interrupting such a method that explicitely declares InterruptedException, you have to take care, otherwoise you hide interrupts. Simon already has a patch to cleanup (not yet committed, also addressing this issue).

        If you have a methods that is defined to not be interruptible, like Closeable#close() [interface dictates!], you are not allowe d to interrupt. The correct way to handle this is to save interrupt status on entering method, then call the interruptible things like wait(), join() but repeat them on a new interrupt (so make then non-interruptible). And finally reset the interrupt status to the previous one including new interrupts (see examples in patch).

        If we would not have hidden the InterruptedException in IndexWriter's close() - he issue we have here would not be there.

        One example also violating the close()-not-interruptible is e.g. NRTManagerReopenThread. If you call close() and interrupt that, the thread leaks because it no longer waits for it to finish. We have similar leaks in TestDocumentsWriterStallControl. This should be fixed by nuking the RuntimeExSubclass and handle all interrupts correctly (it's easy, you have to know how).

        I will open issue once this issue is solved.

        Show
        Uwe Schindler added a comment - Patch that preserves Thread interrupt status and handles thread interrupts in non-interruptible methods according to Sun/Oracle handbook. This problem exists in more places at Lucene also test failures caused by this. There is a reason why InterruptedException is checked . We convert it to unchecked by wrapping with ThreadInterruptedException. But thats the wrong way to do. InterruptedException is checked, because on interrupting such a method that explicitely declares InterruptedException, you have to take care, otherwoise you hide interrupts. Simon already has a patch to cleanup (not yet committed, also addressing this issue). If you have a methods that is defined to not be interruptible, like Closeable#close() [interface dictates!] , you are not allowe d to interrupt. The correct way to handle this is to save interrupt status on entering method, then call the interruptible things like wait(), join() but repeat them on a new interrupt (so make then non-interruptible). And finally reset the interrupt status to the previous one including new interrupts (see examples in patch). If we would not have hidden the InterruptedException in IndexWriter's close() - he issue we have here would not be there. One example also violating the close()-not-interruptible is e.g. NRTManagerReopenThread. If you call close() and interrupt that, the thread leaks because it no longer waits for it to finish. We have similar leaks in TestDocumentsWriterStallControl. This should be fixed by nuking the RuntimeExSubclass and handle all interrupts correctly (it's easy, you have to know how). I will open issue once this issue is solved.
        Hide
        Mark Miller added a comment -

        Great - thank you!

        Show
        Mark Miller added a comment - Great - thank you!
        Hide
        Uwe Schindler added a comment -

        Committed trunk revision: 1364896
        Committed 4.x revision: 1364898

        Show
        Uwe Schindler added a comment - Committed trunk revision: 1364896 Committed 4.x revision: 1364898
        Hide
        Uwe Schindler added a comment -

        I committed an additional fix to gracefully handle failed flushes by shutting down CMS (trunk r1364903, 4.x r1364904)

        Show
        Uwe Schindler added a comment - I committed an additional fix to gracefully handle failed flushes by shutting down CMS (trunk r1364903, 4.x r1364904)
        Hide
        Uwe Schindler added a comment -

        I committed another improvement to use IOUtils in a finally block to really cleanup the mergePolicy and mergeScheduler, no matter what happens: trunk r1364931, 4x r1364932

        Show
        Uwe Schindler added a comment - I committed another improvement to use IOUtils in a finally block to really cleanup the mergePolicy and mergeScheduler, no matter what happens: trunk r1364931, 4x r1364932
        Hide
        Michael McCandless added a comment -

        It still makes me nervous having suprising side effects (killing
        merges) from Thread.interrupt(). It reminds me how NIOFSDir and
        MMapDir close the file if they are interrupted during IO.

        But I guess we already have an unavoidable surprising side effect (if
        interrupt arrives during flush then that whole segment is discarded),
        and I guess it makes sense to kill all merges on interrupt since the
        intention is most likely to abort any unecessary things for close.

        So I'm OK with what was committed. Separately the cleanups are great!
        Thanks Uwe.

        Show
        Michael McCandless added a comment - It still makes me nervous having suprising side effects (killing merges) from Thread.interrupt(). It reminds me how NIOFSDir and MMapDir close the file if they are interrupted during IO. But I guess we already have an unavoidable surprising side effect (if interrupt arrives during flush then that whole segment is discarded), and I guess it makes sense to kill all merges on interrupt since the intention is most likely to abort any unecessary things for close. So I'm OK with what was committed. Separately the cleanups are great! Thanks Uwe.

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Mark Miller
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development