Lucene - Core
  1. Lucene - Core
  2. LUCENE-5574

NRT Reader close can wipe index it doesn't own

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 4.7.1, 4.8, Trunk
    • Fix Version/s: 4.7.2, 4.8, Trunk
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today NRT Readers try to clean up unused files via their IW reference when they are closed. Yet, if the index writer is already closed another index could have been created on the same directory which can create the same files as the IW before. For the NRT Reader those files are not referenced and it will simply wipe them away. If you use this in a replication scenario where directories are reused this can simply wipe your index away or in combination with the FSync issue LUCENE-5570 create 0-byte files. I have a test that reproduces this issue

      1. LUCENE-5574.patch
        7 kB
        Michael McCandless
      2. LUCENE-5574.patch
        34 kB
        Michael McCandless
      3. LUCENE-5574.patch
        22 kB
        Michael McCandless
      4. LUCENE-5574.patch
        4 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          here is a patch that shows the issue and applies a fix. Yet there are still a couple of test taht fail because they now run into this issue... I will try to fix them too

          Show
          Simon Willnauer added a comment - here is a patch that shows the issue and applies a fix. Yet there are still a couple of test taht fail because they now run into this issue... I will try to fix them too
          Hide
          Michael McCandless added a comment -

          God this is really awful; we should have checked that the IW is not closed when we did LUCENE-5434. I think we should also ensureOpen() in deletePendingFiles?

          Show
          Michael McCandless added a comment - God this is really awful; we should have checked that the IW is not closed when we did LUCENE-5434 . I think we should also ensureOpen() in deletePendingFiles?
          Hide
          Michael McCandless added a comment -

          In fact, I wonder if we can get lower level protection here, e.g. fix IFD so that it only attempts file deletion if its "owning" IW is still open. I'll try this ...

          Show
          Michael McCandless added a comment - In fact, I wonder if we can get lower level protection here, e.g. fix IFD so that it only attempts file deletion if its "owning" IW is still open. I'll try this ...
          Hide
          Simon Willnauer added a comment -

          I agree - I tried to make this patch as simple as possible I think we should prevent IFD from doing anything if IW is closed. It should barf if IW is closed!

          Show
          Simon Willnauer added a comment - I agree - I tried to make this patch as simple as possible I think we should prevent IFD from doing anything if IW is closed. It should barf if IW is closed!
          Hide
          Michael McCandless added a comment -

          Patch, starting from Simon's, but also checking in IW.deletePendingFiles and down in IndexFileDeleter.deleteFile (where it actually does the delete).

          Core tests pass (once anyway), but lots of tests still need fixing, to either close the IW after any NRT readers, or to fix MDW to not care about extra files; there will be a long tail of random failures until we fix all tests... I'll do some distributed beasting.

          Show
          Michael McCandless added a comment - Patch, starting from Simon's, but also checking in IW.deletePendingFiles and down in IndexFileDeleter.deleteFile (where it actually does the delete). Core tests pass (once anyway), but lots of tests still need fixing, to either close the IW after any NRT readers, or to fix MDW to not care about extra files; there will be a long tail of random failures until we fix all tests... I'll do some distributed beasting.
          Hide
          Michael McCandless added a comment -

          Another iteration, fixing more tests to close their writers after closing all readers.

          Distributed beasting has gone through all Lucene core + modules tests 35 times with no failures ... I'll leave it running.

          Show
          Michael McCandless added a comment - Another iteration, fixing more tests to close their writers after closing all readers. Distributed beasting has gone through all Lucene core + modules tests 35 times with no failures ... I'll leave it running.
          Hide
          Shai Erera added a comment -

          I think this is a good change – it makes IndexReader more read-only in nature. I.e. previously it could modify the index, then we made it read-only in that aspect. Now we make it sort of read-only such that it cannot modify the Directory. So I wonder, if we viewed IndexReader as a "user" with read-only permissions on the index Directory, we wouldn't even attempt deleting unused files by it, right? So maybe we should just do that – never call IFD from an nrt-reader.close() chain? Then, if you have a write open, it will eventually delete those files, otherwise the files are there until you open a new writer?

          As for MDW, what if on close() it attempted to open a writer w/ OpenMode.OPEN, then close it, before it verifies there are no leftover files? Then the unused files will get deleted, and we'd still have a check in MDW that we didn't leave any writer/reader open, or leaking files?

          Show
          Shai Erera added a comment - I think this is a good change – it makes IndexReader more read-only in nature. I.e. previously it could modify the index, then we made it read-only in that aspect. Now we make it sort of read-only such that it cannot modify the Directory. So I wonder, if we viewed IndexReader as a "user" with read-only permissions on the index Directory, we wouldn't even attempt deleting unused files by it, right? So maybe we should just do that – never call IFD from an nrt-reader.close() chain? Then, if you have a write open, it will eventually delete those files, otherwise the files are there until you open a new writer? As for MDW, what if on close() it attempted to open a writer w/ OpenMode.OPEN, then close it, before it verifies there are no leftover files? Then the unused files will get deleted, and we'd still have a check in MDW that we didn't leave any writer/reader open, or leaking files?
          Hide
          Simon Willnauer added a comment -

          So maybe we should just do that – never call IFD from an nrt-reader.close() chain?

          well I guess we have to since we decrement the reference. but I agree I wonder if the writer.deletePendingFiles(); call is necessary?

          As for MDW, what if on close() it attempted to open a writer w/ OpenMode.OPEN, then close it, before it verifies there are no leftover files? Then the unused files will get deleted, and we'd still have a check in MDW that we didn't leave any writer/reader open, or leaking files?

          on the one hand I think this would be odd but on the other our current behavior requires you to close everything in order which might not be absolutely necessary and might actually lead to bug if we enforce calling inorder so I think that might actually be a good thing what you suggest? Mike WDYT

          Show
          Simon Willnauer added a comment - So maybe we should just do that – never call IFD from an nrt-reader.close() chain? well I guess we have to since we decrement the reference. but I agree I wonder if the writer.deletePendingFiles(); call is necessary? As for MDW, what if on close() it attempted to open a writer w/ OpenMode.OPEN, then close it, before it verifies there are no leftover files? Then the unused files will get deleted, and we'd still have a check in MDW that we didn't leave any writer/reader open, or leaking files? on the one hand I think this would be odd but on the other our current behavior requires you to close everything in order which might not be absolutely necessary and might actually lead to bug if we enforce calling inorder so I think that might actually be a good thing what you suggest? Mike WDYT
          Hide
          Shai Erera added a comment -

          I looked at MDW.close() and looks like it already does that – it opens IW and close it, then diff the files before and after the close – so that idea is not new . That way it protects against e.g. IFD bugs (or bugs elsewhere where you don't decRef() a file or something).

          But perhaps we can make MDW lenient by default (assertNoUnreferencedFilesOnClose=false) and turn it on in tests where we'd like to catch IFD bugs? Then in the majority of tests we can code "normally", but in the few tests that need to make sure IW is bullet-proof, we make sure to close things in order?

          Another idea I had is to use the newly added checksums to make sure that the file we're about to delete has the checksum that we think it should have. Of course, if you re-index the exact same set of documents in the exact same order, this is still a false positive, but I don't know how common that is. But then, I'm not sure if it's ok to rely on such logic, and perhaps the simplest thing we could do is treat the Directory read-only by IndexReader instances.

          Show
          Shai Erera added a comment - I looked at MDW.close() and looks like it already does that – it opens IW and close it, then diff the files before and after the close – so that idea is not new . That way it protects against e.g. IFD bugs (or bugs elsewhere where you don't decRef() a file or something). But perhaps we can make MDW lenient by default (assertNoUnreferencedFilesOnClose=false) and turn it on in tests where we'd like to catch IFD bugs? Then in the majority of tests we can code "normally", but in the few tests that need to make sure IW is bullet-proof, we make sure to close things in order? Another idea I had is to use the newly added checksums to make sure that the file we're about to delete has the checksum that we think it should have. Of course, if you re-index the exact same set of documents in the exact same order, this is still a false positive, but I don't know how common that is. But then, I'm not sure if it's ok to rely on such logic, and perhaps the simplest thing we could do is treat the Directory read-only by IndexReader instances.
          Hide
          Michael McCandless added a comment -

          I think we should remove IW.deletePendingFiles from StandardDirectoryReader.close: the less the IR does with deleting files on close the better!

          But we cannot remove the incRef/decRef unless we want to revert LUCENE-5434, ie stop protecting the files referenced by opened NRT readers from deletion. So leave that for now.

          On MDW, I think we could do a middle ground: it should only complain about unref'd files that nobody had previously tried to delete. I think this will 1) let us leave this checking on by default, and 2) should still catch bugs in IFD where it caused IW not to attempt deletion of files that it should have.

          Show
          Michael McCandless added a comment - I think we should remove IW.deletePendingFiles from StandardDirectoryReader.close: the less the IR does with deleting files on close the better! But we cannot remove the incRef/decRef unless we want to revert LUCENE-5434 , ie stop protecting the files referenced by opened NRT readers from deletion. So leave that for now. On MDW, I think we could do a middle ground: it should only complain about unref'd files that nobody had previously tried to delete. I think this will 1) let us leave this checking on by default, and 2) should still catch bugs in IFD where it caused IW not to attempt deletion of files that it should have.
          Hide
          Robert Muir added a comment -

          I don't think we should do a complicated bugfix.

          I think we should fix the bug straight up in a simple way that won't introduce other bugs: add the missing ensureOpen, disable the unref'd files check.

          Leave such complicated refactorings to the future and do them slowly. I would hate to see a bugfix release go out with a fix for this that just causes other bugs because we cant just be reasonable aout it.

          Show
          Robert Muir added a comment - I don't think we should do a complicated bugfix. I think we should fix the bug straight up in a simple way that won't introduce other bugs: add the missing ensureOpen, disable the unref'd files check. Leave such complicated refactorings to the future and do them slowly. I would hate to see a bugfix release go out with a fix for this that just causes other bugs because we cant just be reasonable aout it.
          Hide
          Michael McCandless added a comment -

          OK, I agree: let's do what's simplest/minimal here to fix the bug, and
          just disable MDW's checking of unref'd files.

          This new patch does that, and only one (test-the-tester) test needed
          fixing.

          I still also removed calling deletePendingFiles when closing the NRT
          reader: I think it's just too dangerous. And I added ensureOpen where
          IFD actually deletes files...

          Finally I made a test case, but I could not provoke corruption unless
          I did a "hard delete" of the files in the index. Ie, because IW is
          write-once, even if you close and open a new one with OpenMode.CREATE,
          the old writer won't delete files created by the new writer, I think.

          I think it's ready.

          Show
          Michael McCandless added a comment - OK, I agree: let's do what's simplest/minimal here to fix the bug, and just disable MDW's checking of unref'd files. This new patch does that, and only one (test-the-tester) test needed fixing. I still also removed calling deletePendingFiles when closing the NRT reader: I think it's just too dangerous. And I added ensureOpen where IFD actually deletes files... Finally I made a test case, but I could not provoke corruption unless I did a "hard delete" of the files in the index. Ie, because IW is write-once, even if you close and open a new one with OpenMode.CREATE, the old writer won't delete files created by the new writer, I think. I think it's ready.
          Hide
          Robert Muir added a comment -

          Thanks, this looks great. We should open a followup issue to do "more complicated stuff" so we can re-enable the unref'd files check.

          Show
          Robert Muir added a comment - Thanks, this looks great. We should open a followup issue to do "more complicated stuff" so we can re-enable the unref'd files check.
          Hide
          Michael McCandless added a comment -

          OK I opened LUCENE-5576 to re-enable unref'd checking in MDW. I'll commit this soon.

          Show
          Michael McCandless added a comment - OK I opened LUCENE-5576 to re-enable unref'd checking in MDW. I'll commit this soon.
          Hide
          Shai Erera added a comment -

          +1, looks good. Maybe just clarify why it's OK to ignore the AlreadyClosedEx when you call decRefDeleter?

          Show
          Shai Erera added a comment - +1, looks good. Maybe just clarify why it's OK to ignore the AlreadyClosedEx when you call decRefDeleter?
          Hide
          Michael McCandless added a comment -

          Good point, I'll add a comment explaining why it's safe to ignore the ACE.

          Show
          Michael McCandless added a comment - Good point, I'll add a comment explaining why it's safe to ignore the ACE.
          Hide
          ASF subversion and git services added a comment -

          Commit 1584840 from mikemccand@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1584840 ]

          LUCENE-5574: when closing an NRT reader, do not delete files if the original writer has since been closed

          Show
          ASF subversion and git services added a comment - Commit 1584840 from mikemccand@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1584840 ] LUCENE-5574 : when closing an NRT reader, do not delete files if the original writer has since been closed
          Hide
          ASF subversion and git services added a comment -

          Commit 1584842 from mikemccand@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1584842 ]

          LUCENE-5574: when closing an NRT reader, do not delete files if the original writer has since been closed

          Show
          ASF subversion and git services added a comment - Commit 1584842 from mikemccand@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1584842 ] LUCENE-5574 : when closing an NRT reader, do not delete files if the original writer has since been closed
          Hide
          ASF subversion and git services added a comment -

          Commit 1584846 from mikemccand@apache.org in branch 'dev/branches/lucene_solr_4_7'
          [ https://svn.apache.org/r1584846 ]

          LUCENE-5574: when closing an NRT reader, do not delete files if the original writer has since been closed

          Show
          ASF subversion and git services added a comment - Commit 1584846 from mikemccand@apache.org in branch 'dev/branches/lucene_solr_4_7' [ https://svn.apache.org/r1584846 ] LUCENE-5574 : when closing an NRT reader, do not delete files if the original writer has since been closed
          Hide
          Uwe Schindler added a comment - - edited

          Hi Mike,

          the test does not work on windows with a FSDirectory subclass. This is because Windows does not allow to write to a file thats already open. So we must disable this test for Windows filesystems:

          assumeFalse(Constants.WINDOWS && dir instanceof FSDirectory);
          

          That is the error I get:

          C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\core>ant test -Dtestcase=TestIndexWriter -Dtestmethod=testClosingNRTReaderDoesNotCorruptYourIndex -Dtests.directory=SimpleFSDirectory
          
             [junit4] <JUnit4> says kaixo! Master seed: 46D71A0205CD3C16
             [junit4] Your default console's encoding may not display certain unicode glyphs: windows-1252
             [junit4] Executing 1 suite with 1 JVM.
             [junit4] 
             [junit4] Started J0 PID(10652@VEGA).
             [junit4] Suite: org.apache.lucene.index.TestIndexWriter
             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriter -Dtests.method=testClosingNRTReaderDoesNotCorruptYourIndex -Dtests.seed=46D71A0205CD3C16 -Dtests.directory=SimpleFSDirectory -Dtests.locale=pt_BR -Dtests.timezone=Portugal -Dtests.file.encoding=Cp1252
             [junit4] ERROR   0.63s | TestIndexWriter.testClosingNRTReaderDoesNotCorruptYourIndex <<<
             [junit4]    > Throwable #1: java.io.IOException: Cannot delete C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\build\core\test\J0\index5311553115tmp\_0.doc
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([46D71A0205CD3C16:31060EF37DD6C629]:0)
             [junit4]    > 	at org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:256)
             [junit4]    > 	at org.apache.lucene.store.MockDirectoryWrapper.deleteFile(MockDirectoryWrapper.java:456)
             [junit4]    > 	at org.apache.lucene.store.MockDirectoryWrapper.deleteFile(MockDirectoryWrapper.java:409)
             [junit4]    > 	at org.apache.lucene.index.TestIndexWriter.testClosingNRTReaderDoesNotCorruptYourIndex(TestIndexWriter.java:2403)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:724)
             [junit4]   2> NOTE: test params are: codec=Asserting, sim=RandomSimilarityProvider(queryNorm=true,coord=no): {a=IB LL-L3(800.0)}, locale=pt_BR, timezone=Portugal
             [junit4]   2> NOTE: Windows 7 6.1 amd64/Oracle Corporation 1.7.0_25 (64-bit)/cpus=8,threads=1,free=133047752,total=156237824
             [junit4]   2> NOTE: All tests run in this JVM: [TestIndexWriter]
             [junit4] Completed in 1.11s, 1 test, 1 error <<< FAILURES!
             [junit4] 
             [junit4] 
             [junit4] Tests with failures:
             [junit4]   - org.apache.lucene.index.TestIndexWriter.testClosingNRTReaderDoesNotCorruptYourIndex
             [junit4] 
             [junit4] 
             [junit4] JVM J0:     0.67 ..     2.61 =     1.94s
             [junit4] Execution time total: 2.62 sec.
             [junit4] Tests summary: 1 suite, 1 test, 1 error
          

          In general, I don't think the whole problem exists on Windows! The file system protects you from doing this crazy stuff like nuking the index while a NRT reader has it open. Its also impossible to do stuff like replication while a Reader is open.

          Simon's system called El*s* should in any case create a new index in a new directory if it needs to recover...

          Show
          Uwe Schindler added a comment - - edited Hi Mike, the test does not work on windows with a FSDirectory subclass. This is because Windows does not allow to write to a file thats already open. So we must disable this test for Windows filesystems: assumeFalse(Constants.WINDOWS && dir instanceof FSDirectory); That is the error I get: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\core>ant test -Dtestcase=TestIndexWriter -Dtestmethod=testClosingNRTReaderDoesNotCorruptYourIndex -Dtests.directory=SimpleFSDirectory [junit4] <JUnit4> says kaixo! Master seed: 46D71A0205CD3C16 [junit4] Your default console's encoding may not display certain unicode glyphs: windows-1252 [junit4] Executing 1 suite with 1 JVM. [junit4] [junit4] Started J0 PID(10652@VEGA). [junit4] Suite: org.apache.lucene.index.TestIndexWriter [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexWriter -Dtests.method=testClosingNRTReaderDoesNotCorruptYourIndex -Dtests.seed=46D71A0205CD3C16 -Dtests.directory=SimpleFSDirectory -Dtests.locale=pt_BR -Dtests.timezone=Portugal -Dtests.file.encoding=Cp1252 [junit4] ERROR 0.63s | TestIndexWriter.testClosingNRTReaderDoesNotCorruptYourIndex <<< [junit4] > Throwable #1: java.io.IOException: Cannot delete C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\build\core\test\J0\index5311553115tmp\_0.doc [junit4] > at __randomizedtesting.SeedInfo.seed([46D71A0205CD3C16:31060EF37DD6C629]:0) [junit4] > at org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:256) [junit4] > at org.apache.lucene.store.MockDirectoryWrapper.deleteFile(MockDirectoryWrapper.java:456) [junit4] > at org.apache.lucene.store.MockDirectoryWrapper.deleteFile(MockDirectoryWrapper.java:409) [junit4] > at org.apache.lucene.index.TestIndexWriter.testClosingNRTReaderDoesNotCorruptYourIndex(TestIndexWriter.java:2403) [junit4] > at java.lang.Thread.run(Thread.java:724) [junit4] 2> NOTE: test params are: codec=Asserting, sim=RandomSimilarityProvider(queryNorm=true,coord=no): {a=IB LL-L3(800.0)}, locale=pt_BR, timezone=Portugal [junit4] 2> NOTE: Windows 7 6.1 amd64/Oracle Corporation 1.7.0_25 (64-bit)/cpus=8,threads=1,free=133047752,total=156237824 [junit4] 2> NOTE: All tests run in this JVM: [TestIndexWriter] [junit4] Completed in 1.11s, 1 test, 1 error <<< FAILURES! [junit4] [junit4] [junit4] Tests with failures: [junit4] - org.apache.lucene.index.TestIndexWriter.testClosingNRTReaderDoesNotCorruptYourIndex [junit4] [junit4] [junit4] JVM J0: 0.67 .. 2.61 = 1.94s [junit4] Execution time total: 2.62 sec. [junit4] Tests summary: 1 suite, 1 test, 1 error In general, I don't think the whole problem exists on Windows! The file system protects you from doing this crazy stuff like nuking the index while a NRT reader has it open. Its also impossible to do stuff like replication while a Reader is open. Simon's system called El*s* should in any case create a new index in a new directory if it needs to recover...
          Hide
          Michael McCandless added a comment -

          Argh... right, I cannot just "setNoDeleteOpenFile(false)" on Windows!

          But, yes, I think this corruption case won't happen on Windows.

          I'll disable the test on Windows w/ FSDir like you suggested...

          Show
          Michael McCandless added a comment - Argh... right, I cannot just "setNoDeleteOpenFile(false)" on Windows! But, yes, I think this corruption case won't happen on Windows. I'll disable the test on Windows w/ FSDir like you suggested...
          Hide
          Uwe Schindler added a comment -

          I'll disable the test on Windows w/ FSDir like you suggested...

          Thats not so easy because of crazy wrapping we do with MockDirs, NRTCaching and others. So I think you can only disable on WINDOWS. I tried to go the path up to the root directory, but that does only work if the dir extends FilterDirectory.

          Show
          Uwe Schindler added a comment - I'll disable the test on Windows w/ FSDir like you suggested... Thats not so easy because of crazy wrapping we do with MockDirs, NRTCaching and others. So I think you can only disable on WINDOWS. I tried to go the path up to the root directory, but that does only work if the dir extends FilterDirectory.
          Hide
          ASF subversion and git services added a comment -

          Commit 1584868 from mikemccand@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1584868 ]

          LUCENE-5574: disable this test on Windows

          Show
          ASF subversion and git services added a comment - Commit 1584868 from mikemccand@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1584868 ] LUCENE-5574 : disable this test on Windows
          Hide
          ASF subversion and git services added a comment -

          Commit 1584869 from mikemccand@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1584869 ]

          LUCENE-5574: disable this test on Windows

          Show
          ASF subversion and git services added a comment - Commit 1584869 from mikemccand@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1584869 ] LUCENE-5574 : disable this test on Windows
          Hide
          ASF subversion and git services added a comment -

          Commit 1584870 from mikemccand@apache.org in branch 'dev/branches/lucene_solr_4_7'
          [ https://svn.apache.org/r1584870 ]

          LUCENE-5574: disable this test on Windows

          Show
          ASF subversion and git services added a comment - Commit 1584870 from mikemccand@apache.org in branch 'dev/branches/lucene_solr_4_7' [ https://svn.apache.org/r1584870 ] LUCENE-5574 : disable this test on Windows
          Hide
          Uwe Schindler added a comment -

          Mike fixed the Windows bug.

          Show
          Uwe Schindler added a comment - Mike fixed the Windows bug.
          Hide
          Uwe Schindler added a comment -

          Close issue after release of 4.8.0

          Show
          Uwe Schindler added a comment - Close issue after release of 4.8.0

            People

            • Assignee:
              Unassigned
              Reporter:
              Simon Willnauer
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development