Lucene - Core
  1. Lucene - Core
  2. LUCENE-5263

Deletes may be silently lost if an IOException is hit and later not hit (e.g., disk fills up and then frees up)

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5.1, 4.6, 5.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This case is tricky to handle, yet I think realistic: disk fills up
      temporarily, causes an exception in writeLiveDocs, and then the app
      keeps using the IW instance.

      Meanwhile disk later frees up again, IW is closed "successfully". In
      certain cases, we can silently lose deletes in this case.

      I had already committed
      TestIndexWriterDeletes.testNoLostDeletesOnDiskFull, and Jenkins seems
      happy with it so far, but when I added fangs to the test (cutover to
      RandomIndexWriter from IndexWriter, allow IOE during getReader, add
      randomness to when exc is thrown, etc.), it uncovered some real/nasty
      bugs:

      • ReaderPool.dropAll was suppressing any exception it hit, because
        if (priorE != null)

        should instead be

        if (priorE == null)
      • After a merge, we have to write deletes before committing the
        segment, because an exception when writing deletes means we need
        to abort the merge
      • Several places that were directly calling deleter.checkpoint must
        also increment the changeCount else on close IW thinks there are
        no changes and doesn't write a new segments file.
      • closeInternal was dropping pooled readers after writing the
        segments file, which would lose deletes still buffered due to a
        previous exc.
      1. LUCENE-5263.patch
        16 kB
        Michael McCandless
      2. LUCENE-5263.patch
        33 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch, I think it's ready.

        Show
        Michael McCandless added a comment - Patch, I think it's ready.
        Hide
        Shai Erera added a comment -

        These are nice catches . Few comments:

        • This is not only about disk-full, but about any transient IOE (e.g. ran out of file handles temporarily) that occur. So maybe change the issue's description (and also the CHANGES entry text)?
        • In TestIndexWriterReader I see this:
               r1.close();
          +    assertTrue(r2.isCurrent());
               writer.close();
               assertTrue(r2.isCurrent());
          

          Did you intend to check r2.isCurrent before and after writer.close?

        • You think maybe we should move FakeIOE to MDW so that other tests can use it? I use it in two other places already.
        • In ReaderPool.release: why don't you call writer.checkpoint()? It calls both deleter.checkpoint() and increments changed++. Did you want to avoid it also notifying sis.changed()? If so, maybe drop a comment why we don't do it?
          • If you choose to keep the code like that, there's a place in release() which calls deleter then increments changeCount, where in all other places the order is reverse. Maybe change it to be consistent? I don't know if it's an issue that changeCount isn't incremented, if e.g. deleter.checkpoint throws an ex?
        • Separately, I think it will be a useful utility to have a method like Utils.throwEx(Throwable t) which does the "if instanceof" logic (checking for IOE, RuntimeE, Error etc.).

        Otherwise looks good!

        Show
        Shai Erera added a comment - These are nice catches . Few comments: This is not only about disk-full, but about any transient IOE (e.g. ran out of file handles temporarily) that occur. So maybe change the issue's description (and also the CHANGES entry text)? In TestIndexWriterReader I see this: r1.close(); + assertTrue(r2.isCurrent()); writer.close(); assertTrue(r2.isCurrent()); Did you intend to check r2.isCurrent before and after writer.close? You think maybe we should move FakeIOE to MDW so that other tests can use it? I use it in two other places already. In ReaderPool.release: why don't you call writer.checkpoint()? It calls both deleter.checkpoint() and increments changed++. Did you want to avoid it also notifying sis.changed()? If so, maybe drop a comment why we don't do it? If you choose to keep the code like that, there's a place in release() which calls deleter then increments changeCount, where in all other places the order is reverse. Maybe change it to be consistent? I don't know if it's an issue that changeCount isn't incremented, if e.g. deleter.checkpoint throws an ex? Separately, I think it will be a useful utility to have a method like Utils.throwEx(Throwable t) which does the "if instanceof" logic (checking for IOE, RuntimeE, Error etc.). Otherwise looks good!
        Hide
        Shai Erera added a comment -

        One other comment – I think you can move the newIWC call to inside the 'if (w == null)'.

        Show
        Shai Erera added a comment - One other comment – I think you can move the newIWC call to inside the 'if (w == null)'.
        Hide
        Shai Erera added a comment -

        I ported this test to verify NDV updates aren't lost under such conditions, and I implemented the TODO about sub-classing CMS. Perhaps you'll want to include that in your patch as well:

        final MergeScheduler ms = iwc.getMergeScheduler();
        if (ms instanceof ConcurrentMergeScheduler) {
          final ConcurrentMergeScheduler suppressFakeIOE = new ConcurrentMergeScheduler() {
            @Override
            protected void handleMergeException(Throwable exc) {
              // suppress FakeIOExceptions
              if (!(exc instanceof FakeIOException)) {
                super.handleMergeException(exc);
              }
            }
          };
          final ConcurrentMergeScheduler cms = (ConcurrentMergeScheduler) ms;
          suppressFakeIOE.setMaxMergesAndThreads(cms.getMaxMergeCount(), cms.getMaxThreadCount());
          suppressFakeIOE.setMergeThreadPriority(cms.getMergeThreadPriority());
          iwc.setMergeScheduler(suppressFakeIOE);
          iwc.setMergeScheduler(suppressFakeIOE);
        }
        
        Show
        Shai Erera added a comment - I ported this test to verify NDV updates aren't lost under such conditions, and I implemented the TODO about sub-classing CMS. Perhaps you'll want to include that in your patch as well: final MergeScheduler ms = iwc.getMergeScheduler(); if (ms instanceof ConcurrentMergeScheduler) { final ConcurrentMergeScheduler suppressFakeIOE = new ConcurrentMergeScheduler() { @Override protected void handleMergeException(Throwable exc) { // suppress FakeIOExceptions if (!(exc instanceof FakeIOException)) { super .handleMergeException(exc); } } }; final ConcurrentMergeScheduler cms = (ConcurrentMergeScheduler) ms; suppressFakeIOE.setMaxMergesAndThreads(cms.getMaxMergeCount(), cms.getMaxThreadCount()); suppressFakeIOE.setMergeThreadPriority(cms.getMergeThreadPriority()); iwc.setMergeScheduler(suppressFakeIOE); iwc.setMergeScheduler(suppressFakeIOE); }
        Hide
        Michael McCandless added a comment -

        Thanks Shai, great feedback; I folded it all in.

        This is not only about disk-full, but about any transient IOE (e.g. ran out of file handles temporarily) that occur. So maybe change the issue's description (and also the CHANGES entry text)?

        I'll fix the issue description, and I added a CHANGES.txt entry.

        In TestIndexWriterReader I see this:

        Yes, this is intentional (it was helpful debugging to see exactly
        where I broke r2.isCurrent()).

        You think maybe we should move FakeIOE to MDW so that other tests can use it?

        I moved it, and shared from TestIndexWriterReader.java and
        TestIndexWriterDelete.java.

        In ReaderPool.release: why don't you call writer.checkpoint()?

        Right, I needed to increment changeCount (so that the writer writes a
        new segments file), plus checkpoint with IFD (so it knows about the
        new .del file) but NOT increment SIS.version (because there was in
        fact no "actual" change to the index contents, just moving state from
        RAM to disk). I put a comment explaining this, and also factored out
        a new method.

        Separately, I think it will be a useful utility to have a method like Utils.throwEx(Throwable t)

        Good idea! I factored that out.

        I think you can move the newIWC call to inside the 'if (w == null)'.

        OK I fixed that.

        I implemented the TODO about sub-classing CMS. Perhaps you'll want to include that in your patch as well:

        I folded this in too ... thanks!

        Also, separately, I managed to re-attach the fang I previously removed
        from testThreadInterruptDeadlock/testTwoThreadsInterruptDeadlock, by
        fixing IW.ReaderPool.dropAll to throw any exceptions it hits when save
        is true, rather than suppressing them and dropping all readers from
        the pool. I think there was another bug lurking there, because the
        pool would lose changes previously if an exc was hit writing deletes
        on close, and the app then called close again, and it succeeded.

        I also fixed a few tests to clean up their static fields ... they were
        sometimes angry during distributed beasting!

        Show
        Michael McCandless added a comment - Thanks Shai, great feedback; I folded it all in. This is not only about disk-full, but about any transient IOE (e.g. ran out of file handles temporarily) that occur. So maybe change the issue's description (and also the CHANGES entry text)? I'll fix the issue description, and I added a CHANGES.txt entry. In TestIndexWriterReader I see this: Yes, this is intentional (it was helpful debugging to see exactly where I broke r2.isCurrent()). You think maybe we should move FakeIOE to MDW so that other tests can use it? I moved it, and shared from TestIndexWriterReader.java and TestIndexWriterDelete.java. In ReaderPool.release: why don't you call writer.checkpoint()? Right, I needed to increment changeCount (so that the writer writes a new segments file), plus checkpoint with IFD (so it knows about the new .del file) but NOT increment SIS.version (because there was in fact no "actual" change to the index contents, just moving state from RAM to disk). I put a comment explaining this, and also factored out a new method. Separately, I think it will be a useful utility to have a method like Utils.throwEx(Throwable t) Good idea! I factored that out. I think you can move the newIWC call to inside the 'if (w == null)'. OK I fixed that. I implemented the TODO about sub-classing CMS. Perhaps you'll want to include that in your patch as well: I folded this in too ... thanks! Also, separately, I managed to re-attach the fang I previously removed from testThreadInterruptDeadlock/testTwoThreadsInterruptDeadlock, by fixing IW.ReaderPool.dropAll to throw any exceptions it hits when save is true, rather than suppressing them and dropping all readers from the pool. I think there was another bug lurking there, because the pool would lose changes previously if an exc was hit writing deletes on close, and the app then called close again, and it succeeded. I also fixed a few tests to clean up their static fields ... they were sometimes angry during distributed beasting!
        Hide
        Shai Erera added a comment -

        Patch looks great.

        • I like IOUtils.reThrow - takes away a lot of code .
        • There's a typo I think in FakeIOException: @code _u_IOException
        • You call iwc.setMergeScheduler() twice in a raw – I believe it's an error from my example above.

        I will separately look at adding field updates to the test by randomly delete/update/both (I'll do it under LUCENE-5189).

        +1 to commit.

        Show
        Shai Erera added a comment - Patch looks great. I like IOUtils.reThrow - takes away a lot of code . There's a typo I think in FakeIOException: @code _u_IOException You call iwc.setMergeScheduler() twice in a raw – I believe it's an error from my example above. I will separately look at adding field updates to the test by randomly delete/update/both (I'll do it under LUCENE-5189 ). +1 to commit.
        Hide
        ASF subversion and git services added a comment -

        Commit 1530414 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1530414 ]

        LUCENE-5263: fix cases where transient IOExc (e.g. due to disk full, file descriptor exhaustion) in IndexWriter could lead to silently losing deletions

        Show
        ASF subversion and git services added a comment - Commit 1530414 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1530414 ] LUCENE-5263 : fix cases where transient IOExc (e.g. due to disk full, file descriptor exhaustion) in IndexWriter could lead to silently losing deletions
        Hide
        ASF subversion and git services added a comment -

        Commit 1530416 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1530416 ]

        LUCENE-5263: fix cases where transient IOExc (e.g. due to disk full, file descriptor exhaustion) in IndexWriter could lead to silently losing deletions

        Show
        ASF subversion and git services added a comment - Commit 1530416 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1530416 ] LUCENE-5263 : fix cases where transient IOExc (e.g. due to disk full, file descriptor exhaustion) in IndexWriter could lead to silently losing deletions
        Hide
        ASF subversion and git services added a comment -

        Commit 1530741 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1530741 ]

        LUCENE-5263: remove extra deleter.checkpoint

        Show
        ASF subversion and git services added a comment - Commit 1530741 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1530741 ] LUCENE-5263 : remove extra deleter.checkpoint
        Hide
        ASF subversion and git services added a comment -

        Commit 1530845 from Robert Muir in branch 'dev/branches/lucene_solr_4_5'
        [ https://svn.apache.org/r1530845 ]

        LUCENE-4998, LUCENE-5242, LUCENE-5254, LUCENE-5262, LUCENE-5263, LUCENE-5264: svn merge -c 1522723 -c 1525896 -c 1529136 -c 1529141 -c 1530063 -c 1530416 -c 1530657

        Show
        ASF subversion and git services added a comment - Commit 1530845 from Robert Muir in branch 'dev/branches/lucene_solr_4_5' [ https://svn.apache.org/r1530845 ] LUCENE-4998 , LUCENE-5242 , LUCENE-5254 , LUCENE-5262 , LUCENE-5263 , LUCENE-5264 : svn merge -c 1522723 -c 1525896 -c 1529136 -c 1529141 -c 1530063 -c 1530416 -c 1530657
        Hide
        ASF subversion and git services added a comment -

        Commit 1531836 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1531836 ]

        LUCENE-5263: if merge hits exc when releasing merged reader then drop changes and remove it from the pool, since the merge will be aborted

        Show
        ASF subversion and git services added a comment - Commit 1531836 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1531836 ] LUCENE-5263 : if merge hits exc when releasing merged reader then drop changes and remove it from the pool, since the merge will be aborted
        Hide
        ASF subversion and git services added a comment -

        Commit 1531837 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1531837 ]

        LUCENE-5263: fix comment

        Show
        ASF subversion and git services added a comment - Commit 1531837 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1531837 ] LUCENE-5263 : fix comment
        Hide
        ASF subversion and git services added a comment -

        Commit 1531838 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1531838 ]

        LUCENE-5263: if merge hits exc when releasing merged reader then drop changes and remove it from the pool, since the merge will be aborted

        Show
        ASF subversion and git services added a comment - Commit 1531838 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1531838 ] LUCENE-5263 : if merge hits exc when releasing merged reader then drop changes and remove it from the pool, since the merge will be aborted
        Hide
        ASF subversion and git services added a comment -

        Commit 1531839 from Michael McCandless in branch 'dev/branches/lucene_solr_4_5'
        [ https://svn.apache.org/r1531839 ]

        LUCENE-5263: if merge hits exc when releasing merged reader then drop changes and remove it from the pool, since the merge will be aborted

        Show
        ASF subversion and git services added a comment - Commit 1531839 from Michael McCandless in branch 'dev/branches/lucene_solr_4_5' [ https://svn.apache.org/r1531839 ] LUCENE-5263 : if merge hits exc when releasing merged reader then drop changes and remove it from the pool, since the merge will be aborted

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development