Lucene - Core
  1. Lucene - Core
  2. LUCENE-5987

Make indexwriter a mere mortal when exceptions strike

    Details

    • Type: Task Task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      IndexWriter's exception handling is overly complicated. Every method in general reads like this:

      try {
        try {
          try { 
           ...
           // lock order: COMPLICATED
           synchronized(this or that) {
           }
           ...
         } finally {
           if (!success5) {
             deleter.deleteThisFileOrThat();
           }
          ...
        }
      }
      

      Part of the problem is it acts like its an invincible superhero, e.g. can take a disk full on merge or flush to the face and just keep on trucking, and you can somehow fix the root cause and then just go about making commits on the same instance.

      But we have a hard enough time ensuring exceptions dont do the wrong thing (e.g. cause corruption), and I don't think we really test this crazy behavior anywhere: e.g. making commits AFTER hitting disk full and so on.

      It would probably be simpler if when such things happen, IW just considered them "tragic" just like OOM and rolled itself back, instead of doing all kinds of really scary stuff to try to "keep itself healthy" (like the little dance it plays with IFD in mergeMiddle manually deleting CFS files).

      Besides, without something like a WAL, Indexwriter isn't really fit to be a superhero anyway: it can't prevent you from losing data in such situations. It just doesn't have the right tools for the job.

      edit: just to be clear I am referring to abort (low level exception during flush) and exceptions during merge. For simple non-aborting cases like analyzer errors, of course we can deal with this. We already made great progress on turning a lot of BS exceptions that would cause aborts into non-aborting ones recently.

      1. LUCENE-5987.patch
        64 kB
        Michael McCandless
      2. LUCENE-5987.patch
        43 kB
        Robert Muir
      3. LUCENE-5987.patch
        43 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        +1, let's not be heroic here: IW's exc handling is terrifying.

        Show
        Michael McCandless added a comment - +1, let's not be heroic here: IW's exc handling is terrifying.
        Hide
        Michael McCandless added a comment -

        I'll try to tackle this (make abort a tragedy): it's a ridiculous situation, today, that IW can throw an exception and it turns out that exception silently just deleted tons of previously indexed documents and when you close/commit the IW, they are gone.

        Show
        Michael McCandless added a comment - I'll try to tackle this (make abort a tragedy): it's a ridiculous situation, today, that IW can throw an exception and it turns out that exception silently just deleted tons of previously indexed documents and when you close/commit the IW, they are gone.
        Hide
        Michael McCandless added a comment -

        Patch. I added new AbortIndexWriterException; it is package private,
        and is only used to communicate internally in IndexWriter (other
        approaches I tried kept leading to deadlock). I removed the
        "andReset" from DocumentsWriterPerThread.checkAndResetHasAborted: I
        think it's dangerous to ever set hasAborted back to false. It's like
        unplugging your carbon monoxide detector because it's making too much
        noise.

        This is an important behavior change.

        First off, an aborting exception, which is an exception that strikes
        at a "bad" time (e.g. when appending to stored fields or term vector
        files) such that the entire segment is now unusable, will now
        forcefully close the IndexWriter. This is so you know you lost any
        uncommitted data.

        Second off, when this happens, I don't think IndexWriter should try to
        be so crazy about deleting every last unref'd file when disaster
        strikes: if your house is burning down, you don't worry about washing
        the dirty dishes.

        This change made a number of tests angry (IW suddenly closing, and
        also leaving unref'd files), and I did distributed beasting to try to
        ferret them out, but I expect we'll have a long tail of Jenkins
        failures after committing this.

        Show
        Michael McCandless added a comment - Patch. I added new AbortIndexWriterException; it is package private, and is only used to communicate internally in IndexWriter (other approaches I tried kept leading to deadlock). I removed the "andReset" from DocumentsWriterPerThread.checkAndResetHasAborted: I think it's dangerous to ever set hasAborted back to false. It's like unplugging your carbon monoxide detector because it's making too much noise. This is an important behavior change. First off, an aborting exception, which is an exception that strikes at a "bad" time (e.g. when appending to stored fields or term vector files) such that the entire segment is now unusable, will now forcefully close the IndexWriter. This is so you know you lost any uncommitted data. Second off, when this happens, I don't think IndexWriter should try to be so crazy about deleting every last unref'd file when disaster strikes: if your house is burning down, you don't worry about washing the dirty dishes. This change made a number of tests angry (IW suddenly closing, and also leaving unref'd files), and I did distributed beasting to try to ferret them out, but I expect we'll have a long tail of Jenkins failures after committing this.
        Hide
        Robert Muir added a comment -

        awesome!

        Can we shorten the name of the exception? I don't like AbortIndexWriterException for an internal one, i think its too complicated. Can it either be AbortingException or IndexWriter.AbortingException? Can we use try-multi-catch instead of java 1.6-style? I will take a deeper look after coffee.

        Show
        Robert Muir added a comment - awesome! Can we shorten the name of the exception? I don't like AbortIndexWriterException for an internal one, i think its too complicated. Can it either be AbortingException or IndexWriter.AbortingException? Can we use try-multi-catch instead of java 1.6-style? I will take a deeper look after coffee.
        Hide
        Robert Muir added a comment -

        Here is patch with my suggestions. I didn't like having two tragicEvent methods on IndexWriter, to me thats confusing. so the single one just unboxes AbortingException and there is one codepath to worry about there, and for each case.

        I am still confused about the logic in various places in DocumentsWriter (I didnt try to tackle this here). It seems wierd we are both catching exception/handlign stuff in finally block but then "asking" the dwpt if it has aborted. Is there a cleaner way?

        I am happy AbortingException is a checked one, its a good use here. Its fine if it nests in IndexWriter.java to be clear too. But its already a terminology used in the codebase so i think its the right name.

        Show
        Robert Muir added a comment - Here is patch with my suggestions. I didn't like having two tragicEvent methods on IndexWriter, to me thats confusing. so the single one just unboxes AbortingException and there is one codepath to worry about there, and for each case. I am still confused about the logic in various places in DocumentsWriter (I didnt try to tackle this here). It seems wierd we are both catching exception/handlign stuff in finally block but then "asking" the dwpt if it has aborted. Is there a cleaner way? I am happy AbortingException is a checked one, its a good use here. Its fine if it nests in IndexWriter.java to be clear too. But its already a terminology used in the codebase so i think its the right name.
        Hide
        Robert Muir added a comment -

        What is happening with TestIndexWriterAbort? I only added that recently to test we were doing the right thing
        so that abort() could be removed from the codec API (LUCENE-6082).

        But with this patch, the unreferenced files check is disabled (and should be?), so the patch doesnt offer anything
        that TestIndexWriterExceptions2 isn't already checking.

        What is happening with TestIndexWriterDelete.testErrorInDocsWriterAdd(), it still causes aborting exceptions
        and checks that unreferenced files are removed. Why is it passing?

        Show
        Robert Muir added a comment - What is happening with TestIndexWriterAbort? I only added that recently to test we were doing the right thing so that abort() could be removed from the codec API ( LUCENE-6082 ). But with this patch, the unreferenced files check is disabled (and should be?), so the patch doesnt offer anything that TestIndexWriterExceptions2 isn't already checking. What is happening with TestIndexWriterDelete.testErrorInDocsWriterAdd(), it still causes aborting exceptions and checks that unreferenced files are removed. Why is it passing?
        Hide
        Michael McCandless added a comment -

        New patch, iterating from Rob's last patch.

        I cleaned up the exc handling somewhat, and fixed the places that "want to abort" to simply throw AbortingException instead of settings confusing booleans which are checked in finally clauses up the stack.

        Tests seem to be passing under some distributed beasting ...

        Show
        Michael McCandless added a comment - New patch, iterating from Rob's last patch. I cleaned up the exc handling somewhat, and fixed the places that "want to abort" to simply throw AbortingException instead of settings confusing booleans which are checked in finally clauses up the stack. Tests seem to be passing under some distributed beasting ...
        Hide
        Robert Muir added a comment -

        Overall looks good. I have not nitpicked, but we could stare at it for days and maybe not see bugs. I would rather us get it in jenkins sooner and iterate on adding more testing around this to really harden indexwriter for good, and not feel so scared in the future when trying to do cleanups.

        What is IgnoreAlreadyClosedExceptionConcurrentMergeScheduler? This is a little awkward and I can't figure out why we need it, or if we could do something else better instead.

        Show
        Robert Muir added a comment - Overall looks good. I have not nitpicked, but we could stare at it for days and maybe not see bugs. I would rather us get it in jenkins sooner and iterate on adding more testing around this to really harden indexwriter for good, and not feel so scared in the future when trying to do cleanups. What is IgnoreAlreadyClosedExceptionConcurrentMergeScheduler? This is a little awkward and I can't figure out why we need it, or if we could do something else better instead.
        Hide
        Michael McCandless added a comment -

        What is IgnoreAlreadyClosedExceptionConcurrentMergeScheduler?

        It's for tests that use CMS and also abort the IW, since CMS can hit ACE in such cases and it's "ok".

        We will likely need to use it in more tests...

        Show
        Michael McCandless added a comment - What is IgnoreAlreadyClosedExceptionConcurrentMergeScheduler? It's for tests that use CMS and also abort the IW, since CMS can hit ACE in such cases and it's "ok". We will likely need to use it in more tests...
        Hide
        ASF subversion and git services added a comment -

        Commit 1643432 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1643432 ]

        LUCENE-5987: IndexWriter forcefully closes itself on hitting aborting exception

        Show
        ASF subversion and git services added a comment - Commit 1643432 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1643432 ] LUCENE-5987 : IndexWriter forcefully closes itself on hitting aborting exception
        Hide
        ASF subversion and git services added a comment -

        Commit 1643466 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1643466 ]

        LUCENE-5987: IndexWriter forcefully closes itself on hitting aborting exception

        Show
        ASF subversion and git services added a comment - Commit 1643466 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1643466 ] LUCENE-5987 : IndexWriter forcefully closes itself on hitting aborting exception
        Hide
        ASF subversion and git services added a comment -

        Commit 1643661 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1643661 ]

        LUCENE-5987: remove dead code

        Show
        ASF subversion and git services added a comment - Commit 1643661 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1643661 ] LUCENE-5987 : remove dead code
        Hide
        ASF subversion and git services added a comment -

        Commit 1644072 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1644072 ]

        LUCENE-5987: don't double-wrap AbortingException

        Show
        ASF subversion and git services added a comment - Commit 1644072 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1644072 ] LUCENE-5987 : don't double-wrap AbortingException
        Hide
        ASF subversion and git services added a comment -

        Commit 1644075 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1644075 ]

        LUCENE-5987: don't double-wrap AbortingException

        Show
        ASF subversion and git services added a comment - Commit 1644075 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1644075 ] LUCENE-5987 : don't double-wrap AbortingException

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development