Lucene - Core
  1. Lucene - Core
  2. LUCENE-5958

OOM or exceptions during checkpoint make IndexWriter have a bad day

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10.1, 5.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      During finishCommit(), we run checkpoint after we wrote the commit to disk, but if things go wrong here (e.g. IOError when IFD deletes a pending file, OOM), then everything will go wrong (we won't even properly incref things, and may end out deleting wrong files if the user calls rollback, leaving a corrupt index).

      1. LUCENE-5958.patch
        8 kB
        Michael McCandless
      2. LUCENE-5958.patch
        29 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        here is a 4.x seed:

        ant test -Dtestcase=TestIndexWriterOutOfMemory -Dtests.method=testBasics -Dtests.seed=EEE122F7078D3AF2 -Dtests.slow=true -Dtests.locale=ro -Dtests.timezone=Asia/Ho_Chi_Minh -Dtests.file.encoding=ISO-8859-1

        in this case, we get an exception during IFD.deleteFile (one that isnt an IOException), and it happens during deletePendingFiles...

        Show
        Robert Muir added a comment - here is a 4.x seed: ant test -Dtestcase=TestIndexWriterOutOfMemory -Dtests.method=testBasics -Dtests.seed=EEE122F7078D3AF2 -Dtests.slow=true -Dtests.locale=ro -Dtests.timezone=Asia/Ho_Chi_Minh -Dtests.file.encoding=ISO-8859-1 in this case, we get an exception during IFD.deleteFile (one that isnt an IOException), and it happens during deletePendingFiles...
        Hide
        Robert Muir added a comment -

        The thing is, the commit has already gone down, we cant throw any exception to the user here, because there is "no going back".

        The only good solution i have, is similar to 'hitOOM' (maybe we can generalize that?) to hold exceptions like this, and deliver them in the future instead. In such a case (hitOOM=true or whatever we rename it to), we should definitely never delete any files for any reason. Leave it to the next writer.

        Show
        Robert Muir added a comment - The thing is, the commit has already gone down, we cant throw any exception to the user here, because there is "no going back". The only good solution i have, is similar to 'hitOOM' (maybe we can generalize that?) to hold exceptions like this, and deliver them in the future instead. In such a case (hitOOM=true or whatever we rename it to), we should definitely never delete any files for any reason. Leave it to the next writer.
        Hide
        Robert Muir added a comment -

        I think when things go wrong like this, its ideal for the user to get them asap (e.g. next addDocument, not commit or whatever). So I think we should close the writer (like closedbyinterruptexception in java when state gets messed up), so the next ensureOpen fails.

        Instead of a hitOOM-type boolean, I think we should save any exception/error that causes this, and ensureOpen passes it to ACE as the cause. In the general case it will be null.

        Show
        Robert Muir added a comment - I think when things go wrong like this, its ideal for the user to get them asap (e.g. next addDocument, not commit or whatever). So I think we should close the writer (like closedbyinterruptexception in java when state gets messed up), so the next ensureOpen fails. Instead of a hitOOM-type boolean, I think we should save any exception/error that causes this, and ensureOpen passes it to ACE as the cause. In the general case it will be null.
        Hide
        ASF subversion and git services added a comment -

        Commit 1625538 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1625538 ]

        LUCENE-5958: add more efficient test

        Show
        ASF subversion and git services added a comment - Commit 1625538 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1625538 ] LUCENE-5958 : add more efficient test
        Hide
        ASF subversion and git services added a comment -

        Commit 1625543 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1625543 ]

        LUCENE-5958: add more efficient test

        Show
        ASF subversion and git services added a comment - Commit 1625543 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1625543 ] LUCENE-5958 : add more efficient test
        Hide
        Michael McCandless added a comment -

        +1 for a "closed by OOM" approach, and saving the exc that "caused" the unexpected closing.

        Show
        Michael McCandless added a comment - +1 for a "closed by OOM" approach, and saving the exc that "caused" the unexpected closing.
        Hide
        Robert Muir added a comment -

        Here's a first stab: when we hit 'unrecoverable exception' like this, we record the exception (instead of hitOOM), close ourselves. In such a "screwed" state IndexFileDeleter also refuses to delete any files so we wont corrupt anything.

        This is better on the user i think, because previously they had to do this OOM handling themselves, but now IW will throw ACE (with root cause of why it had to do this) on any future operation.

        Show
        Robert Muir added a comment - Here's a first stab: when we hit 'unrecoverable exception' like this, we record the exception (instead of hitOOM), close ourselves. In such a "screwed" state IndexFileDeleter also refuses to delete any files so we wont corrupt anything. This is better on the user i think, because previously they had to do this OOM handling themselves, but now IW will throw ACE (with root cause of why it had to do this) on any future operation.
        Hide
        Robert Muir added a comment -

        By the way, this patch still relays the exception in finishCOmmit (after we have actually committed!). It just prevents index corruption.

        I think its confusing to relay the exception too? So i think we should suppress it in that case... otherwise the user probably thinks their commit failed when it succeeded...

        Show
        Robert Muir added a comment - By the way, this patch still relays the exception in finishCOmmit (after we have actually committed!). It just prevents index corruption. I think its confusing to relay the exception too? So i think we should suppress it in that case... otherwise the user probably thinks their commit failed when it succeeded...
        Hide
        Michael McCandless added a comment -

        Patch looks great, I love the "really bad day" comment.

        In IFD.ensureOpen, we currently check for writer == null case, but your new writer.tragedy check doesn't check for that ... but then I don't think it's possible for writer to be null: it's final, and we pass to ctor, and IFD is only ever created by IW, and IW passes "this" to it, so I think just remove the null check?

        Show
        Michael McCandless added a comment - Patch looks great, I love the "really bad day" comment. In IFD.ensureOpen, we currently check for writer == null case, but your new writer.tragedy check doesn't check for that ... but then I don't think it's possible for writer to be null: it's final, and we pass to ctor, and IFD is only ever created by IW, and IW passes "this" to it, so I think just remove the null check?
        Hide
        Robert Muir added a comment -

        Thanks Mike. I will remove that useless null check. I'll also add a comment about why ensureOpen "isnt enough" (because we pass false when we are closing, but we dont want to delete files if we are closing because of a disaster)

        Show
        Robert Muir added a comment - Thanks Mike. I will remove that useless null check. I'll also add a comment about why ensureOpen "isnt enough" (because we pass false when we are closing, but we dont want to delete files if we are closing because of a disaster)
        Hide
        ASF subversion and git services added a comment -

        Commit 1625853 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1625853 ]

        LUCENE-5958: OOM or exceptions during checkpoint make IndexWriter have a bad day

        Show
        ASF subversion and git services added a comment - Commit 1625853 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1625853 ] LUCENE-5958 : OOM or exceptions during checkpoint make IndexWriter have a bad day
        Hide
        ASF subversion and git services added a comment -

        Commit 1625859 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1625859 ]

        LUCENE-5958: OOM or exceptions during checkpoint make IndexWriter have a bad day

        Show
        ASF subversion and git services added a comment - Commit 1625859 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1625859 ] LUCENE-5958 : OOM or exceptions during checkpoint make IndexWriter have a bad day
        Hide
        Robert Muir added a comment -

        reopen for 4.10.1

        Show
        Robert Muir added a comment - reopen for 4.10.1
        Hide
        ASF subversion and git services added a comment -

        Commit 1626189 from Robert Muir in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1626189 ]

        LUCENE-5958: OOM or exceptions during checkpoint make IndexWriter have a bad day

        Show
        ASF subversion and git services added a comment - Commit 1626189 from Robert Muir in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1626189 ] LUCENE-5958 : OOM or exceptions during checkpoint make IndexWriter have a bad day
        Hide
        ASF subversion and git services added a comment -

        Commit 1626363 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1626363 ]

        LUCENE-5958: add logic to merge exc handling as well

        Show
        ASF subversion and git services added a comment - Commit 1626363 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1626363 ] LUCENE-5958 : add logic to merge exc handling as well
        Hide
        ASF subversion and git services added a comment -

        Commit 1626366 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1626366 ]

        LUCENE-5958: add logic to merge exc handling as well

        Show
        ASF subversion and git services added a comment - Commit 1626366 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1626366 ] LUCENE-5958 : add logic to merge exc handling as well
        Hide
        ASF subversion and git services added a comment -

        Commit 1626368 from Robert Muir in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1626368 ]

        LUCENE-5958: add logic to merge exc handling as well

        Show
        ASF subversion and git services added a comment - Commit 1626368 from Robert Muir in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1626368 ] LUCENE-5958 : add logic to merge exc handling as well
        Hide
        ASF subversion and git services added a comment -

        Commit 1626750 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1626750 ]

        LUCENE-5958: handle illegalstateexception 'checks' by IW as well

        Show
        ASF subversion and git services added a comment - Commit 1626750 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1626750 ] LUCENE-5958 : handle illegalstateexception 'checks' by IW as well
        Hide
        ASF subversion and git services added a comment -

        Commit 1626751 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1626751 ]

        LUCENE-5958: handle illegalstateexception 'checks' by IW as well

        Show
        ASF subversion and git services added a comment - Commit 1626751 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1626751 ] LUCENE-5958 : handle illegalstateexception 'checks' by IW as well
        Hide
        ASF subversion and git services added a comment -

        Commit 1626752 from Robert Muir in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1626752 ]

        LUCENE-5958: handle illegalstateexception 'checks' by IW as well

        Show
        ASF subversion and git services added a comment - Commit 1626752 from Robert Muir in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1626752 ] LUCENE-5958 : handle illegalstateexception 'checks' by IW as well
        Hide
        Michael McCandless added a comment -

        The changes caused a risk of deadlock in IW, which we hit in recent nightly test failure.

        Here's a patch to fix that; the issue is we cannot be sync'd on IW when dealing with a tragedy ... there were two places in IW where we were doing that. I added an assert that we don't hold the lock, and fixed those two places.

        I tried to simplify finishCommit as much as I could, and get the infoStream logging out of the tragedy handling as much as I could ...

        Show
        Michael McCandless added a comment - The changes caused a risk of deadlock in IW, which we hit in recent nightly test failure. Here's a patch to fix that; the issue is we cannot be sync'd on IW when dealing with a tragedy ... there were two places in IW where we were doing that. I added an assert that we don't hold the lock, and fixed those two places. I tried to simplify finishCommit as much as I could, and get the infoStream logging out of the tragedy handling as much as I could ...
        Hide
        Michael McCandless added a comment -

        I upgraded to blocker: we need to fix this before releasing 4.10.1.

        Show
        Michael McCandless added a comment - I upgraded to blocker: we need to fix this before releasing 4.10.1.
        Hide
        Robert Muir added a comment -

        perfect, i reviewed all the places calling this in the patch.

        patch looks great, thank you.

        Show
        Robert Muir added a comment - perfect, i reviewed all the places calling this in the patch. patch looks great, thank you.
        Hide
        ASF subversion and git services added a comment -

        Commit 1627003 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1627003 ]

        LUCENE-5958: fix IW deadlock

        Show
        ASF subversion and git services added a comment - Commit 1627003 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1627003 ] LUCENE-5958 : fix IW deadlock
        Hide
        ASF subversion and git services added a comment -

        Commit 1627005 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1627005 ]

        LUCENE-5958: fix IW deadlock

        Show
        ASF subversion and git services added a comment - Commit 1627005 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1627005 ] LUCENE-5958 : fix IW deadlock
        Hide
        ASF subversion and git services added a comment -

        Commit 1627009 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1627009 ]

        LUCENE-5958: fix IW deadlock

        Show
        ASF subversion and git services added a comment - Commit 1627009 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1627009 ] LUCENE-5958 : fix IW deadlock
        Hide
        Michael McCandless added a comment -

        Bulk close for Lucene/Solr 4.10.1 release

        Show
        Michael McCandless added a comment - Bulk close for Lucene/Solr 4.10.1 release

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development