Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4
    • Component/s: core/index
    • Labels:
      None
    • Environment:

      Windows Server 2003, Standard Edition, Sun Hotspot Java 1.5

    • Lucene Fields:
      New

      Description

      When indexing a large number of documents, upon a hard power failure (e.g. pull the power cord), the index seems to get corrupted. We start a Java application as an Windows Service, and feed it documents. In some cases (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the following is observed.

      The 'segments' file contains only zeros. Its size is 265 bytes - all bytes are zeros.
      The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes are zeros.

      Before corruption, the segments file and deleted file appear to be correct. After this corruption, the index is corrupted and lost.

      This is a problem observed in Lucene 1.4.3. We are not able to upgrade our customer deployments to 1.9 or later version, but would be happy to back-port a patch, if the patch is small enough and if this problem is already solved.

      1. FSyncPerfTest.java
        6 kB
        Doron Cohen
      2. LUCENE-1044.patch
        6 kB
        Michael McCandless
      3. LUCENE-1044.take2.patch
        7 kB
        Michael McCandless
      4. LUCENE-1044.take3.patch
        16 kB
        Michael McCandless
      5. LUCENE-1044.take4.patch
        7 kB
        Michael McCandless
      6. LUCENE-1044.take5.patch
        86 kB
        Michael McCandless
      7. LUCENE-1044.take6.patch
        194 kB
        Michael McCandless
      8. LUCENE-1044.take7.patch
        205 kB
        Michael McCandless
      9. LUCENE-1044.take8.patch
        206 kB
        Michael McCandless

        Activity

        Hide
        Yeliz Eseryel added a comment -

        Thanks Michael!

        Show
        Yeliz Eseryel added a comment - Thanks Michael!
        Hide
        Michael McCandless added a comment -

        Yes, this is committed and available as of 2.4.0.

        Show
        Michael McCandless added a comment - Yes, this is committed and available as of 2.4.0.
        Hide
        Yeliz Eseryel added a comment -

        I had been following this thread. Just curious if the patch was committed.

        Show
        Yeliz Eseryel added a comment - I had been following this thread. Just curious if the patch was committed.
        Hide
        Michael McCandless added a comment -

        Attached new rev of the patch. Only changes were to add caveats in javadcos about IO devices that ignore fsync, and, updated patch to apply cleanly on current trunk.

        I plan to commit in a day or two.

        Show
        Michael McCandless added a comment - Attached new rev of the patch. Only changes were to add caveats in javadcos about IO devices that ignore fsync, and, updated patch to apply cleanly on current trunk. I plan to commit in a day or two.
        Hide
        Michael McCandless added a comment -

        OK I updated the patch:

        • Deprecate all IW ctors that take autoCommit param, and updated javadocs stating that autoCommit will be hardwired to false starting in 3.0
        • Default maxSyncPause to 10 seconds when Constant.WINDOWS; else, to 0

        I'll wait until end of week to commit!

        Show
        Michael McCandless added a comment - OK I updated the patch: Deprecate all IW ctors that take autoCommit param, and updated javadocs stating that autoCommit will be hardwired to false starting in 3.0 Default maxSyncPause to 10 seconds when Constant.WINDOWS; else, to 0 I'll wait until end of week to commit!
        Hide
        Michael McCandless added a comment -

        > deprecate autoCommit=true entirely

        +1 This sounds like a good plan.

        OK I'll work out a new patch with this approach.

        Are your performance numbers above with autoCommit true or false?

        They were all with autoCommit=true.

        Also, why not only sleep if Constants.WINDOWS?

        Good, I'll take that approach!

        Show
        Michael McCandless added a comment - > deprecate autoCommit=true entirely +1 This sounds like a good plan. OK I'll work out a new patch with this approach. Are your performance numbers above with autoCommit true or false? They were all with autoCommit=true. Also, why not only sleep if Constants.WINDOWS? Good, I'll take that approach!
        Hide
        Doug Cutting added a comment -

        > deprecate autoCommit=true entirely

        +1 This sounds like a good plan.

        Are your performance numbers above with autoCommit true or false?

        Also, why not only sleep if Constants.WINDOWS?

        Show
        Doug Cutting added a comment - > deprecate autoCommit=true entirely +1 This sounds like a good plan. Are your performance numbers above with autoCommit true or false? Also, why not only sleep if Constants.WINDOWS?
        Hide
        Michael McCandless added a comment -

        On thinking through the above costs of committing, I now think we
        should deprecate autoCommit=true entirely, making autocommit=false the
        only choice in 3.0.

        With that change, when you use an IndexWriter, its changes are never
        visible to a reader until you call commit() or close(). I think this
        is how KinoSearch and Ferret work, for example.

        Here are some reasons:

        • Commit has now become a costly event, because sync() is costly,
          and is forcing us to use this "syncPause" logic (hack) to game the
          OS, which really is ugly, dependent on OS/IO particulars, etc.
        • Since we make no guarantee on when a commit specifically happens,
          and this fix in particular will reduce its frequency from "every
          flush" to "every merge", autoCommit=true really is not that useful
          for applications (ie, they will have to call commit() on their
          anyway if they need to rely on its frequency).
        • It's always possible to build an autocommit layer above
          IndexWriter by calling commit on your own schedule, to tradeoff
          performance for commit frequency (but not vice/versa).
        • Not autocommitting by default opens up some good future
          optimizations on merging since we don't have to flush real
          segments to disk until commit. One simple example is we could
          skip building CFS files as we flush, and only merge & build CFS on
          commit/close.

        What do people think?

        If we do this, I would right now deprecate all ctors that take
        autoCommit and add comment explaining that in 3.0 autoCommit is wired
        to "false". I would leave the "syncPause" logic in there for now,
        because it's such a sizable performance gain on windows, but deprecate
        it, stating that with it will be removed when we switch to
        autoCommit=false in 3.0.

        Show
        Michael McCandless added a comment - On thinking through the above costs of committing, I now think we should deprecate autoCommit=true entirely, making autocommit=false the only choice in 3.0. With that change, when you use an IndexWriter, its changes are never visible to a reader until you call commit() or close(). I think this is how KinoSearch and Ferret work, for example. Here are some reasons: Commit has now become a costly event, because sync() is costly, and is forcing us to use this "syncPause" logic (hack) to game the OS, which really is ugly, dependent on OS/IO particulars, etc. Since we make no guarantee on when a commit specifically happens, and this fix in particular will reduce its frequency from "every flush" to "every merge", autoCommit=true really is not that useful for applications (ie, they will have to call commit() on their anyway if they need to rely on its frequency). It's always possible to build an autocommit layer above IndexWriter by calling commit on your own schedule, to tradeoff performance for commit frequency (but not vice/versa). Not autocommitting by default opens up some good future optimizations on merging since we don't have to flush real segments to disk until commit. One simple example is we could skip building CFS files as we flush, and only merge & build CFS on commit/close. What do people think? If we do this, I would right now deprecate all ctors that take autoCommit and add comment explaining that in 3.0 autoCommit is wired to "false". I would leave the "syncPause" logic in there for now, because it's such a sizable performance gain on windows, but deprecate it, stating that with it will be removed when we switch to autoCommit=false in 3.0.
        Hide
        Michael McCandless added a comment -

        New rev of this patch. All tests pass. I think it's ready to
        commit, but I'll wait a few days for comments.

        This patch has a small change to the segments_N file: it adds a
        checksum to the end. I added ChecksumIndexInput/Output that wrap an
        existing IndexInput/Output for this. This is used to verify the file
        is "intact" before trusting its contents when opening the index. We
        need this to guard against the machine crashing after we've written
        segments_N and before we've succeeded in syncing it.

        Unfortunately, in testing performance, I still see a sizable (~30-50%)
        performance hit to indexing throughput, on windows computers (XP Pro
        laptop & Win 2003 Server R64 computer). It seems that calling sync
        was causing IO in other threads (ie flushing a new segment) to
        drasically slow down. Note that this is only when autoCommit=true; if
        it's false then performance is only slightly worse (because only on
        closing the writer do we sync)

        So I tried sleeping, after writing and before syncing. I sleep based
        on number of bytes written, for up to 10 seconds, and amazingly, this
        greadly reduces the performance loss on the windows computers, and
        doesn't hurt performance on Linux/OS X computers.

        I think this must be because calling sync immediately forces the OS to
        write dirty buffers to disk "in a rush" (severely impacting IO writes
        from other threads), whereas if you wait first, you let the OS
        schedule those writes on its own, at good times (maybe when IO system
        is "relatively" idle).

        It's disappointing to have to "game" the OS to gain back this
        performance. I wish Java had a "waitUntilSync'd" to do the same
        things as fsync, but without "rushing" the OS.

        On Linux 2.6.22 on a RAID5 array I still see a net performance cost of
        ~12%, sleeping or no sleeping. On Mac OS X it's ~3% loss.

        Other fixes:

        • DirectoryIndexReader's doCommit now also syncs
        • Improved logic on when we must sync-before-CFS: it's not necessary
          if the just-merged segments are not referenced by the last commit
          point (ie if they were all flushed during this writer session)
        • Created SegmentInfos.commit() method, which writes and then syncs
          the next segments_N file
        • Simplified sync() logic now that merge threads are stopped before
          writer is closed
        • Changed CMS.newMergeThread to name its threads
        • More test cases
        • Various other small fixes

        Here are test details. I index first 200K Wikipedia docs with this
        alg:

        analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
        doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
        docs.file=/Volumes/External/lucene/wiki.txt
        doc.stored = true
        doc.term.vector = true
        doc.term.vector.offsets = true
        doc.term.vector.positions = true

        doc.maker.forever = false
        directory=FSDirectory

        { "BuildIndex"
        CreateIndex

        { "AddDocs" AddDoc > : 200000 CloseIndex }

        RepSumByPref BuildIndex

        Win2003 R64, JVM 1.6.0_03
        trunk: 523 sec
        patch: 547 sec (5% slower)

        Win XP Pro, laptop hard drive, JVM 1.4.2_15-b02
        trunk: 1237 sec
        patch: 1278 sec (3% slower)

        Linux ReiserFS on 6 drive RAID 5 array, JVM 1.5.0_08
        trunk: 483 sec
        patch: 539 sec (12% slower)

        Mac OS X 10.4 4-drive RAID 0 array, JVM 1.5.0_13
        trunk: 268 sec
        patch: 276 sec (3% slower)

        Show
        Michael McCandless added a comment - New rev of this patch. All tests pass. I think it's ready to commit, but I'll wait a few days for comments. This patch has a small change to the segments_N file: it adds a checksum to the end. I added ChecksumIndexInput/Output that wrap an existing IndexInput/Output for this. This is used to verify the file is "intact" before trusting its contents when opening the index. We need this to guard against the machine crashing after we've written segments_N and before we've succeeded in syncing it. Unfortunately, in testing performance, I still see a sizable (~30-50%) performance hit to indexing throughput, on windows computers (XP Pro laptop & Win 2003 Server R64 computer). It seems that calling sync was causing IO in other threads (ie flushing a new segment) to drasically slow down. Note that this is only when autoCommit=true; if it's false then performance is only slightly worse (because only on closing the writer do we sync) So I tried sleeping, after writing and before syncing. I sleep based on number of bytes written, for up to 10 seconds, and amazingly, this greadly reduces the performance loss on the windows computers, and doesn't hurt performance on Linux/OS X computers. I think this must be because calling sync immediately forces the OS to write dirty buffers to disk "in a rush" (severely impacting IO writes from other threads), whereas if you wait first, you let the OS schedule those writes on its own, at good times (maybe when IO system is "relatively" idle). It's disappointing to have to "game" the OS to gain back this performance. I wish Java had a "waitUntilSync'd" to do the same things as fsync, but without "rushing" the OS. On Linux 2.6.22 on a RAID5 array I still see a net performance cost of ~12%, sleeping or no sleeping. On Mac OS X it's ~3% loss. Other fixes: DirectoryIndexReader's doCommit now also syncs Improved logic on when we must sync-before-CFS: it's not necessary if the just-merged segments are not referenced by the last commit point (ie if they were all flushed during this writer session) Created SegmentInfos.commit() method, which writes and then syncs the next segments_N file Simplified sync() logic now that merge threads are stopped before writer is closed Changed CMS.newMergeThread to name its threads More test cases Various other small fixes Here are test details. I index first 200K Wikipedia docs with this alg: analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker docs.file=/Volumes/External/lucene/wiki.txt doc.stored = true doc.term.vector = true doc.term.vector.offsets = true doc.term.vector.positions = true doc.maker.forever = false directory=FSDirectory { "BuildIndex" CreateIndex { "AddDocs" AddDoc > : 200000 CloseIndex } RepSumByPref BuildIndex Win2003 R64, JVM 1.6.0_03 trunk: 523 sec patch: 547 sec (5% slower) Win XP Pro, laptop hard drive, JVM 1.4.2_15-b02 trunk: 1237 sec patch: 1278 sec (3% slower) Linux ReiserFS on 6 drive RAID 5 array, JVM 1.5.0_08 trunk: 483 sec patch: 539 sec (12% slower) Mac OS X 10.4 4-drive RAID 0 array, JVM 1.5.0_13 trunk: 268 sec patch: 276 sec (3% slower)
        Hide
        Michael McCandless added a comment -

        This is still in progress. It's clearly a serious bug since it's something out of your control that can easily cause index corruption.

        sync was removed because the simple approach is far too costly on some IO systems. The new approach (sync only on committing a merge) has more reasonable performance, but is not quite done yet.

        Show
        Michael McCandless added a comment - This is still in progress. It's clearly a serious bug since it's something out of your control that can easily cause index corruption. sync was removed because the simple approach is far too costly on some IO systems. The new approach (sync only on committing a merge) has more reasonable performance, but is not quite done yet.
        Hide
        Andrew Zhang added a comment -

        Hi, Any progress on this issue?

        I found sync call was removed from the source code. Is there an alternative to solve this problem? Thanks a lot!

        Show
        Andrew Zhang added a comment - Hi, Any progress on this issue? I found sync call was removed from the source code. Is there an alternative to solve this problem? Thanks a lot!
        Hide
        Michael McCandless added a comment -

        Initial patch attached:

        • Created new commit() method; deprecated public flush() method
        • Changed IndexWriter to not write segments_N when flushing, only
          when syncing (added new private sync() for this). The current
          "policy" is to sync only after merges are committed. When
          autoCommit=false we do not sync until close() or commit() is
          called
        • Added MockRAMDirectory.crash() to simulate a machine crash. It
          keeps track of un-synced files, and then in crash() it goes and
          corrupts any unsynced files rather aggressively.
        • Added a new unit test, TestCrash, to crash the MockRAMDirectory at
          various interesting times & make sure we can still load the
          resulting index.
        • Added new Directory.sync() method. In FSDirectory.sync, if I hit
          an IOException when opening or sync'ing, I retry (currently after
          waiting 5 msec, and retrying up to 5 times). If it still fails
          after that, the original exception is thrown and the new
          segments_N will not be written (and, the previous commit will also
          not be deleted).

        All tests now pass, but there is still alot to do, eg at least:

        • Javadocs
        • Refactor syncing code so DirectoryIndexReader.doCommit can use it
          as well.
        • Change format of segments_N to include a hash of its contents, at
          the end. I think this is now necessary in case we crash after
          writing segments_N but before we can sync it, to ensure that
          whoever next opens the reader can detect corruption in this
          segments_N file.
        Show
        Michael McCandless added a comment - Initial patch attached: Created new commit() method; deprecated public flush() method Changed IndexWriter to not write segments_N when flushing, only when syncing (added new private sync() for this). The current "policy" is to sync only after merges are committed. When autoCommit=false we do not sync until close() or commit() is called Added MockRAMDirectory.crash() to simulate a machine crash. It keeps track of un-synced files, and then in crash() it goes and corrupts any unsynced files rather aggressively. Added a new unit test, TestCrash, to crash the MockRAMDirectory at various interesting times & make sure we can still load the resulting index. Added new Directory.sync() method. In FSDirectory.sync, if I hit an IOException when opening or sync'ing, I retry (currently after waiting 5 msec, and retrying up to 5 times). If it still fails after that, the original exception is thrown and the new segments_N will not be written (and, the previous commit will also not be deleted). All tests now pass, but there is still alot to do, eg at least: Javadocs Refactor syncing code so DirectoryIndexReader.doCommit can use it as well. Change format of segments_N to include a hash of its contents, at the end. I think this is now necessary in case we crash after writing segments_N but before we can sync it, to ensure that whoever next opens the reader can detect corruption in this segments_N file.
        Hide
        Doron Cohen added a comment -

        I think we're walking on thin ice if we do that...

        Oh, I skimmed too fast that part of the discussion in the
        dev list. I agree with "thin ice" now.

        Show
        Doron Cohen added a comment - I think we're walking on thin ice if we do that... Oh, I skimmed too fast that part of the discussion in the dev list. I agree with "thin ice" now.
        Hide
        Michael McCandless added a comment -

        I think this would work too?

        FileInputStream fis = new FileInputStream(path);
        fis.getFD().sync();
        fis.close();
        

        This was suggested & debated on the java-dev list. But, the man page
        for "fsync" on Linux lists this as one of the errors:

        ERRORS
               EBADF  fd is not a valid file descriptor open for writing.
        

        And Yonik found at least one JVM implementation (I think Harmony) that
        simply skipped if the descriptor was not open for write.

        I think we're walking on thin ice if we do that...

        Show
        Michael McCandless added a comment - I think this would work too? FileInputStream fis = new FileInputStream(path); fis.getFD().sync(); fis.close(); This was suggested & debated on the java-dev list. But, the man page for "fsync" on Linux lists this as one of the errors: ERRORS EBADF fd is not a valid file descriptor open for writing. And Yonik found at least one JVM implementation (I think Harmony) that simply skipped if the descriptor was not open for write. I think we're walking on thin ice if we do that...
        Hide
        Doron Cohen added a comment -

        Though ... I am also a bit concerned about opening files for writing
        that we had already previously closed. It arguably makes Lucene "not
        quite" write-once.

        I think this would work too?

        FileInputStream fis = new FileInputStream(path);
        fis.getFD().sync();
        fis.close();
        
        Show
        Doron Cohen added a comment - Though ... I am also a bit concerned about opening files for writing that we had already previously closed. It arguably makes Lucene "not quite" write-once. I think this would work too? FileInputStream fis = new FileInputStream(path); fis.getFD().sync(); fis.close();
        Hide
        Michael McCandless added a comment -

        I've moved this issue to 2.4. I think it's too risky to rush it in
        just before 2.3 is released vs committing just after 2.3 and
        giving it more time on the trunk.

        But, I think for 2.3 we should revert the optional "doSync" argument
        to FSDirectory: I believe the performance impact of syncing is low enough
        with the approach we're now taking, so I don't think we should make it
        so trivial to turn it off. I've added a sync() method to Directory,
        so if someone really wants to prevent syncing they will be able to
        subclass FSDirectory and make that method a noop.

        Show
        Michael McCandless added a comment - I've moved this issue to 2.4. I think it's too risky to rush it in just before 2.3 is released vs committing just after 2.3 and giving it more time on the trunk. But, I think for 2.3 we should revert the optional "doSync" argument to FSDirectory: I believe the performance impact of syncing is low enough with the approach we're now taking, so I don't think we should make it so trivial to turn it off. I've added a sync() method to Directory, so if someone really wants to prevent syncing they will be able to subclass FSDirectory and make that method a noop.
        Hide
        Doug Cutting added a comment -

        > I think this means when we do a "soft commit" we should not in fact
        > write a new segments_N file (as we do today).

        +1 As long as we commit periodically when autoCommit=true I don't think we're breaking any previously advertised contract.

        Show
        Doug Cutting added a comment - > I think this means when we do a "soft commit" we should not in fact > write a new segments_N file (as we do today). +1 As long as we commit periodically when autoCommit=true I don't think we're breaking any previously advertised contract.
        Hide
        Michael McCandless added a comment -

        From java-dev, Robert Engels wrote:

        My reading of the Unix specification shows it should work (the _commit under Windows is less clear, and since Windows is not inode based, there may be different issues).

        http://www.opengroup.org/onlinepubs/007908799/xsh/fsync.html

        OK thanks Robert.

        I think very likely this approach (let's call it "sync after close")
        will work. The _commit docs (for WIN32) also seems to indicate that
        the file referenced by the descriptor is fully flushed (as we want):

        http://msdn2.microsoft.com/en-us/library/17618685

        Also at least PostgreSQL and Berkeley DB "trust" _commit as the
        equivalent of fsync (though I have no idea if they use it the same way
        we want to).

        Though ... I am also a bit concerned about opening files for writing
        that we had already previously closed. It arguably makes Lucene "not
        quite" write-once. And, we may need a retry loop on syncing because
        on Windows, various tools might wake up and peek into a file right
        after we close them, possibly interfering w/ our reopening/syncing.

        I think the alternative ("sync before close") is something like:

        • Add a new method IndexOutput.close(boolean doSync)
        • When a merge finishes, it must close all of its files with
          doSync=true; and write the new segments_N with doSync=true.
        • To implement commit() ... I think we'd have to force a merge of
          all written segments that were not sync'd. And on closing the
          writer we'd call commit(). This is obviously non-ideal because
          you can get very different sized level 1 segments out. Although
          the cost would be contained since it's only up to mergeFactor
          level 0 segments that we will merge.

        OK ... I'm leaning towards sticking with "sync after close", so I'll
        keep coding up this approach for now.

        Show
        Michael McCandless added a comment - From java-dev, Robert Engels wrote: My reading of the Unix specification shows it should work (the _commit under Windows is less clear, and since Windows is not inode based, there may be different issues). http://www.opengroup.org/onlinepubs/007908799/xsh/fsync.html OK thanks Robert. I think very likely this approach (let's call it "sync after close") will work. The _commit docs (for WIN32) also seems to indicate that the file referenced by the descriptor is fully flushed (as we want): http://msdn2.microsoft.com/en-us/library/17618685 Also at least PostgreSQL and Berkeley DB "trust" _commit as the equivalent of fsync (though I have no idea if they use it the same way we want to). Though ... I am also a bit concerned about opening files for writing that we had already previously closed. It arguably makes Lucene "not quite" write-once. And, we may need a retry loop on syncing because on Windows, various tools might wake up and peek into a file right after we close them, possibly interfering w/ our reopening/syncing. I think the alternative ("sync before close") is something like: Add a new method IndexOutput.close(boolean doSync) When a merge finishes, it must close all of its files with doSync=true; and write the new segments_N with doSync=true. To implement commit() ... I think we'd have to force a merge of all written segments that were not sync'd. And on closing the writer we'd call commit(). This is obviously non-ideal because you can get very different sized level 1 segments out. Although the cost would be contained since it's only up to mergeFactor level 0 segments that we will merge. OK ... I'm leaning towards sticking with "sync after close", so I'll keep coding up this approach for now.
        Hide
        Michael McCandless added a comment -

        Another nuance here is ... say we do a "soft commit" (write a new
        segment & segments_N but do not sync the files), and, the machine
        crashes. This is fine because there will always be an earlier commit
        point (segments_M) that was a "hard commit" (sync was done).

        Then, machine comes back up and we open a reader. The reader sees
        both segments_M (the hard commit) and segments_N (the soft commit) and
        chooses segments_N because it's more recent.

        We have retry logic in SegmentInfos to fallback to segments_M if we
        hit an IOException on opening the index described by segments_N.

        But, the problem is: the extent of the "corruption" caused by the
        crash could be somewhat subtle. EG a given file might be the right
        length, but, filled w/ zeroes. This is a problem because we may not
        then hit an IOException while opening the reader, but only later hit
        some exception while searching.

        I think this means when we do a "soft commit" we should not in fact
        write a new segments_N file (as we do today). When we do a "hard
        commit" we should first sync all files except the new segments_N file,
        then write the segments_N file, then sync it.

        The thing is, while we have been (and want to continue to be) vague
        about exactly when a "commit" takes place as you add docs to
        IndexWriter, users have presumably gotten used to every flush (when
        autoCommit=true) committing a new segments_N file that an IndexReader
        can then see. So, this change (do not write segments_N file except
        for a hard commit) will break that behavior. Maybe, with the addition
        of the explicit commit() method, this is OK?

        Show
        Michael McCandless added a comment - Another nuance here is ... say we do a "soft commit" (write a new segment & segments_N but do not sync the files), and, the machine crashes. This is fine because there will always be an earlier commit point (segments_M) that was a "hard commit" (sync was done). Then, machine comes back up and we open a reader. The reader sees both segments_M (the hard commit) and segments_N (the soft commit) and chooses segments_N because it's more recent. We have retry logic in SegmentInfos to fallback to segments_M if we hit an IOException on opening the index described by segments_N. But, the problem is: the extent of the "corruption" caused by the crash could be somewhat subtle. EG a given file might be the right length, but, filled w/ zeroes. This is a problem because we may not then hit an IOException while opening the reader, but only later hit some exception while searching. I think this means when we do a "soft commit" we should not in fact write a new segments_N file (as we do today). When we do a "hard commit" we should first sync all files except the new segments_N file, then write the segments_N file, then sync it. The thing is, while we have been (and want to continue to be) vague about exactly when a "commit" takes place as you add docs to IndexWriter, users have presumably gotten used to every flush (when autoCommit=true) committing a new segments_N file that an IndexReader can then see. So, this change (do not write segments_N file except for a hard commit) will break that behavior. Maybe, with the addition of the explicit commit() method, this is OK?
        Hide
        Michael McCandless added a comment -

        You could just queue the file names for sync, close them, and then have the background thread open, sync and close them. The close could trigger the OS to sync things faster in the background. Then the open/sync/close could mostly be a no-op. Might be worth a try.

        I am taking this approach now, but one nagging question I have is: do
        we know with some certainty that re-opening a file and then sync'ing
        it in fact syncs all writes that were ever done to this file in this
        JVM, even with previously opened and now closed descriptors? VS, eg,
        only sync'ing any new writes done with that particular descriptor?

        In code:

        file = new RandomAccess(path, "rw");
        <do many writes to file>
        file.close();
        new RandomAccess(path, "rw").getFD().sync();
        

        Are we pretty sure that all of the "many writes" will in fact be
        sync'd by that sync call, on all OSs?

        I haven't been able to find convincing evidence one way or another. I
        did run a timing test comparing overall time if you sync with the same
        descriptor you used for writing vs closing it, opening a new one, and
        syncing with that one, and on Linux at least it seems both approaches
        seem to be syncing because the total elapsed time is roughly the
        same.

        Robert do you know?

        I sure hope the answer is yes ... because if not, the alternative is
        we must sync() before closing the original descriptor, which makes
        things less flexible because eg we cannot cleanly implement
        IndexWriter.commit().

        Show
        Michael McCandless added a comment - You could just queue the file names for sync, close them, and then have the background thread open, sync and close them. The close could trigger the OS to sync things faster in the background. Then the open/sync/close could mostly be a no-op. Might be worth a try. I am taking this approach now, but one nagging question I have is: do we know with some certainty that re-opening a file and then sync'ing it in fact syncs all writes that were ever done to this file in this JVM, even with previously opened and now closed descriptors? VS, eg, only sync'ing any new writes done with that particular descriptor? In code: file = new RandomAccess(path, "rw" ); < do many writes to file> file.close(); new RandomAccess(path, "rw" ).getFD().sync(); Are we pretty sure that all of the "many writes" will in fact be sync'd by that sync call, on all OSs? I haven't been able to find convincing evidence one way or another. I did run a timing test comparing overall time if you sync with the same descriptor you used for writing vs closing it, opening a new one, and syncing with that one, and on Linux at least it seems both approaches seem to be syncing because the total elapsed time is roughly the same. Robert do you know? I sure hope the answer is yes ... because if not, the alternative is we must sync() before closing the original descriptor, which makes things less flexible because eg we cannot cleanly implement IndexWriter.commit().
        Hide
        Doug Cutting added a comment -

        > I think deprecating flush(), renaming it to commit()

        +1 That's clearer, since flushes are internal optimizations, while commits are important events to clients.

        Show
        Doug Cutting added a comment - > I think deprecating flush(), renaming it to commit() +1 That's clearer, since flushes are internal optimizations, while commits are important events to clients.
        Hide
        Michael McCandless added a comment -

        When autoCommit is true, then we should periodically commit automatically. When autoCommit is false, then nothing should be committed until the IndexWriter is closed. The ambiguous case is flush(). I think the reason for exposing flush() was to permit folks to commit without closing, so I think flush() should commit too, but we could add a separate commit() method that flushes and commits.

        I think deprecating flush(), renaming it to commit(), and clarifying
        the semantics to mean that commit() flushes pending docs/deletes,
        commits a new segments_N, syncs all files referenced by this commit,
        and blocks until the sync is complete, would make sense? And,
        commit() would in fact commit even when autoCommit is false (flush()
        doesn't commit now when autoCommit=false, which is indeed confusing).

        Perhaps the semantics of autoCommit=true should be altered so that it commits less than every flush. Is that what you were proposing? If so, then I think it's a good solution. Prior to 2.2 the commit semantics were poorly defined. Folks were encouraged to close() their IndexWriter to persist changes, and that's about all we said. 2.2's docs say that things are committed at every flush, but there was no sync, so I don't think changing this could break any applications.

        So I'm +1 for changing autoCommit=true to sync less than every flush, e.g., only after merges. I'd also argue that we should be vague in the documentation about precisely when autoCommit=true commits. If someone needs to know exactly when things are committed then they should be encouraged to explicitly flush(), not to rely on autoCommit.

        OK, I will test the "sync only when committing a merge" approach for
        performance. Hopefully a foreground sync() is fine given that with
        ConcurrentMergePolicy that's already in a background thread. This
        would be a nice simplification.

        And I agree we should be vague about, and users should never rely on,
        precisely when Lucene has really committed (sync'd) the changes to
        disk. I'll fix the javadocs.

        Show
        Michael McCandless added a comment - When autoCommit is true, then we should periodically commit automatically. When autoCommit is false, then nothing should be committed until the IndexWriter is closed. The ambiguous case is flush(). I think the reason for exposing flush() was to permit folks to commit without closing, so I think flush() should commit too, but we could add a separate commit() method that flushes and commits. I think deprecating flush(), renaming it to commit(), and clarifying the semantics to mean that commit() flushes pending docs/deletes, commits a new segments_N, syncs all files referenced by this commit, and blocks until the sync is complete, would make sense? And, commit() would in fact commit even when autoCommit is false (flush() doesn't commit now when autoCommit=false, which is indeed confusing). Perhaps the semantics of autoCommit=true should be altered so that it commits less than every flush. Is that what you were proposing? If so, then I think it's a good solution. Prior to 2.2 the commit semantics were poorly defined. Folks were encouraged to close() their IndexWriter to persist changes, and that's about all we said. 2.2's docs say that things are committed at every flush, but there was no sync, so I don't think changing this could break any applications. So I'm +1 for changing autoCommit=true to sync less than every flush, e.g., only after merges. I'd also argue that we should be vague in the documentation about precisely when autoCommit=true commits. If someone needs to know exactly when things are committed then they should be encouraged to explicitly flush(), not to rely on autoCommit. OK, I will test the "sync only when committing a merge" approach for performance. Hopefully a foreground sync() is fine given that with ConcurrentMergePolicy that's already in a background thread. This would be a nice simplification. And I agree we should be vague about, and users should never rely on, precisely when Lucene has really committed (sync'd) the changes to disk. I'll fix the javadocs.
        Hide
        Michael McCandless added a comment -

        I modified the CFS sync case to NOT bother syncing the files that go
        into the CFS. I also turned off syncing of segments.gen. I also
        tested on a Windows Server 2003 box.

        New patched attached (still a hack just to test performance!) and new
        results. All tests are with the "sync every commit" policy:

        IO System CFS sync CFS nosync CFS % slower non-CFS sync non-CFS nosync non-CFS % slower
        2 drive RAID0 Windows 2003 Server R2 Enterprise x64 250 244 2.6% 241 241 0.1%
        ReiserFS 6-drive RAID5 array Linux (2.6.22.1) 186 166 11.9% 145 142 2.0%
        EXT3 single internal drive Linux (2.6.22.1) 160 158 0.9% 142 135 4.8%
        4 drive RAID0 array Mac Pro (10.4 Tiger) 152 155 -2.4% 149 147 1.3%
        Win XP Pro laptop, single drive 408 398 2.6% 343 346 -1.1%
        Mac Pro single external drive 211 209 1.0% 167 149 12.4%
        Show
        Michael McCandless added a comment - I modified the CFS sync case to NOT bother syncing the files that go into the CFS. I also turned off syncing of segments.gen. I also tested on a Windows Server 2003 box. New patched attached (still a hack just to test performance!) and new results. All tests are with the "sync every commit" policy: IO System CFS sync CFS nosync CFS % slower non-CFS sync non-CFS nosync non-CFS % slower 2 drive RAID0 Windows 2003 Server R2 Enterprise x64 250 244 2.6% 241 241 0.1% ReiserFS 6-drive RAID5 array Linux (2.6.22.1) 186 166 11.9% 145 142 2.0% EXT3 single internal drive Linux (2.6.22.1) 160 158 0.9% 142 135 4.8% 4 drive RAID0 array Mac Pro (10.4 Tiger) 152 155 -2.4% 149 147 1.3% Win XP Pro laptop, single drive 408 398 2.6% 343 346 -1.1% Mac Pro single external drive 211 209 1.0% 167 149 12.4%
        Hide
        Doug Cutting added a comment -

        > But must every "automatic buffer flush" by IndexWriter really be a
        "permanent commit"?

        When autoCommit is true, then we should periodically commit automatically. When autoCommit is false, then nothing should be committed until the IndexWriter is closed. The ambiguous case is flush(). I think the reason for exposing flush() was to permit folks to commit without closing, so I think flush() should commit too, but we could add a separate commit() method that flushes and commits.

        > People who upgrade will suddenly get much worse performance.

        Yes, that would be bad. Perhaps the semantics of autoCommit=true should be altered so that it commits less than every flush. Is that what you were proposing? If so, then I think it's a good solution. Prior to 2.2 the commit semantics were poorly defined. Folks were encouraged to close() their IndexWriter to persist changes, and that's about all we said. 2.2's docs say that things are committed at every flush, but there was no sync, so I don't think changing this could break any applications.

        So I'm +1 for changing autoCommit=true to sync less than every flush, e.g., only after merges. I'd also argue that we should be vague in the documentation about precisely when autoCommit=true commits. If someone needs to know exactly when things are committed then they should be encouraged to explicitly flush(), not to rely on autoCommit.

        Show
        Doug Cutting added a comment - > But must every "automatic buffer flush" by IndexWriter really be a "permanent commit"? When autoCommit is true, then we should periodically commit automatically. When autoCommit is false, then nothing should be committed until the IndexWriter is closed. The ambiguous case is flush(). I think the reason for exposing flush() was to permit folks to commit without closing, so I think flush() should commit too, but we could add a separate commit() method that flushes and commits. > People who upgrade will suddenly get much worse performance. Yes, that would be bad. Perhaps the semantics of autoCommit=true should be altered so that it commits less than every flush. Is that what you were proposing? If so, then I think it's a good solution. Prior to 2.2 the commit semantics were poorly defined. Folks were encouraged to close() their IndexWriter to persist changes, and that's about all we said. 2.2's docs say that things are committed at every flush, but there was no sync, so I don't think changing this could break any applications. So I'm +1 for changing autoCommit=true to sync less than every flush, e.g., only after merges. I'd also argue that we should be vague in the documentation about precisely when autoCommit=true commits. If someone needs to know exactly when things are committed then they should be encouraged to explicitly flush(), not to rely on autoCommit.
        Hide
        Michael McCandless added a comment -

        I'm confused. The semantics of commit should be that all changes prior are made permanent, and no subsequent changes are permanent until the next commit. So syncs, if any, should map 1:1 to commits, no? Folks can make indexing faster by committing/syncing less often.

        But must every "automatic buffer flush" by IndexWriter really be a
        "permanent commit"? I do agree that when you close an IndexWriter, we
        should should do a "permanent commit" (and block until it's done).

        Even if we use that policy, the BG sync thread can still fall behind
        such that the last few/many flushes are still in-process of being made
        permanent (eg I see this happening while a merge is running). In fact
        I'll have to block further flushes if syncing falls "too far" behind,
        by some metric. So, we already won't have any "guarantee" on when a
        given flush actually becomes permanent even if we adopt this policy.

        I think "merge finished" should be made a "permanent commit" because
        otherwise we are tying up potentially alot of disk space,
        temporarily. But for a flush there's only a tiny amount of space (the
        old segments_N files) being tied up.

        Maybe we could make some flushes permanent but not all, depending on
        how far behind the sync thread is. EG if you do a flush, but, the
        sync thread is still trying to make the last flush permanent, don't
        force the new flush to be permanent?

        In general, I think the longer we can wait after flushing before
        forcing the OS to make those writes "permanent", the better the
        chances that the OS has in fact already sync'd those files anyway, and
        so the sync cost should be lower. So maybe we could make every flush
        permanent, but wait a little while before doing so?

        Regardless of what policy we choose here (which commits must be made
        "permanent", and, when) I think the approach requires that
        IndexFileDeleter query the Directory so that it's only allowed to
        delete older commit points once a newer commit point has successfully
        become permanent.

        I also worry about those applications that are accidentally flushing
        too often now. Say your app now sets maxBufferedDocs=100. Right now,
        that gives you poor performance but not disastrous, but I fear if we
        do the "every commit is permanent" policy then performance could
        easily become disastrous. People who upgrade will suddenly get much
        worse performance.

        Show
        Michael McCandless added a comment - I'm confused. The semantics of commit should be that all changes prior are made permanent, and no subsequent changes are permanent until the next commit. So syncs, if any, should map 1:1 to commits, no? Folks can make indexing faster by committing/syncing less often. But must every "automatic buffer flush" by IndexWriter really be a "permanent commit"? I do agree that when you close an IndexWriter, we should should do a "permanent commit" (and block until it's done). Even if we use that policy, the BG sync thread can still fall behind such that the last few/many flushes are still in-process of being made permanent (eg I see this happening while a merge is running). In fact I'll have to block further flushes if syncing falls "too far" behind, by some metric. So, we already won't have any "guarantee" on when a given flush actually becomes permanent even if we adopt this policy. I think "merge finished" should be made a "permanent commit" because otherwise we are tying up potentially alot of disk space, temporarily. But for a flush there's only a tiny amount of space (the old segments_N files) being tied up. Maybe we could make some flushes permanent but not all, depending on how far behind the sync thread is. EG if you do a flush, but, the sync thread is still trying to make the last flush permanent, don't force the new flush to be permanent? In general, I think the longer we can wait after flushing before forcing the OS to make those writes "permanent", the better the chances that the OS has in fact already sync'd those files anyway, and so the sync cost should be lower. So maybe we could make every flush permanent, but wait a little while before doing so? Regardless of what policy we choose here (which commits must be made "permanent", and, when) I think the approach requires that IndexFileDeleter query the Directory so that it's only allowed to delete older commit points once a newer commit point has successfully become permanent. I also worry about those applications that are accidentally flushing too often now. Say your app now sets maxBufferedDocs=100. Right now, that gives you poor performance but not disastrous, but I fear if we do the "every commit is permanent" policy then performance could easily become disastrous. People who upgrade will suddenly get much worse performance.
        Hide
        Doug Cutting added a comment -

        > How about if we don't sync every single commit point?

        I'm confused. The semantics of commit should be that all changes prior are made permanent, and no subsequent changes are permanent until the next commit. So syncs, if any, should map 1:1 to commits, no? Folks can make indexing faster by committing/syncing less often.

        Show
        Doug Cutting added a comment - > How about if we don't sync every single commit point? I'm confused. The semantics of commit should be that all changes prior are made permanent, and no subsequent changes are permanent until the next commit. So syncs, if any, should map 1:1 to commits, no? Folks can make indexing faster by committing/syncing less often.
        Hide
        Michael McCandless added a comment -

        How about if we don't sync every single commit point?

        I think on a crash what's important when you come back up is 1) index
        is consistent and 2) you have not lost that many docs from your index.
        Losing the last N (up to mergeFactor) flushes might be acceptable?

        EG we could force a full sync only when we commit the merge, before we
        remove the merged segments. This would mean on a crash that you're
        "guaranteed" to have the last successfully committed & sync'd merge to
        fall back to, and possibly a newer commit point if the OS had sync'd
        those files on its own?

        That would be a big simplification because I think we could just do
        the sync() in the foreground since ConcurrentMergeScheduler is already
        using BG threads to do merges.

        This would also mean we cannot delete the commit points that were not
        sync'd. So the first 10 flushes would result in 10 segments_N files.
        But then when the merge of these segments completes, and the result is
        sync'd, those files could all be deleted.

        Plus we would have to fix retry logic on loading the segments file to
        try more than just the 2 most recent commit points but that's a pretty
        minor change.

        I think it should mean better performance, because the longer you wait
        to call sync() presumably the more likely it is a no-op if the OS has
        already sync'd the file.

        Show
        Michael McCandless added a comment - How about if we don't sync every single commit point? I think on a crash what's important when you come back up is 1) index is consistent and 2) you have not lost that many docs from your index. Losing the last N (up to mergeFactor) flushes might be acceptable? EG we could force a full sync only when we commit the merge, before we remove the merged segments. This would mean on a crash that you're "guaranteed" to have the last successfully committed & sync'd merge to fall back to, and possibly a newer commit point if the OS had sync'd those files on its own? That would be a big simplification because I think we could just do the sync() in the foreground since ConcurrentMergeScheduler is already using BG threads to do merges. This would also mean we cannot delete the commit points that were not sync'd. So the first 10 flushes would result in 10 segments_N files. But then when the merge of these segments completes, and the result is sync'd, those files could all be deleted. Plus we would have to fix retry logic on loading the segments file to try more than just the 2 most recent commit points but that's a pretty minor change. I think it should mean better performance, because the longer you wait to call sync() presumably the more likely it is a no-op if the OS has already sync'd the file.
        Hide
        Michael McCandless added a comment -

        Woops, the last line in the table above is wrong (it's a copy of the line before it). I'll re-run the test.

        Show
        Michael McCandless added a comment - Woops, the last line in the table above is wrong (it's a copy of the line before it). I'll re-run the test.
        Hide
        Michael McCandless added a comment -

        OK I did a simplistic patch (attached) whereby FSDirectory has a
        background thread that re-opens, syncs, and closes those files that
        Lucene has written. (I'm using a modified version of the class from
        Doron's test).

        This patch is nowhere near ready to commit; I just coded up enough so
        we could get a rough measure of performance cost of syncing. EG we
        must prevent deletion of a commit point until a future commit point is
        fully sync'd to stable storage; we must also take care not to sync a
        file that has been deleted before we sync'd it; don't sync until the
        end when running with autoCommit=false; merges if run by
        ConcurrentMergeScheduler should [maybe] sync in the foreground; maybe
        forcefully throttle back updates if syncing is falling too far behind;
        etc.

        I ran the same alg as the tests above (index first 150K docs of
        Wikipedia). I ran CFS and no CFS X sync and nosync (4 tests) for each
        IO system. Time is the fastest of 2 runs:

        IO System CFS sync CFS nosync CFS % slower non-CFS sync non-CFS nosync non-CFS % slower
        ReiserFS 6-drive RAID5 array Linux (2.6.22.1) 188 157 19.7% 143 147 -2.7%
        EXT3 single internal drive Linux (2.6.22.1) 173 157 10.2% 136 132 3.0%
        4 drive RAID0 array Mac Pro (10.4 Tiger) 153 152 0.7% 150 149 0.7%
        Win XP Pro laptop, single drive 463 352 31.5% 343 335 2.4%
        Mac Pro single external drive 463 352 31.5% 343 335 2.4%

        The good news is, the non-CFS case shows very little cost when we do
        BG sync'ing!

        The bad news is, the CFS case still shows a high cost. However, by
        not sync'ing the files that go into the CFS (and also not committing a
        new segments_N file until after the CFS is written) I expect that cost
        to go way down.

        One caveat: I'm using a 8 MB RAM buffer for all of these tests. As
        Yonik pointed out, if you have a smaller buffer, or, you add just a
        few docs and then close your writer, the sync cost as a pctg of net
        indexing time will be quite a bit higher.

        Show
        Michael McCandless added a comment - OK I did a simplistic patch (attached) whereby FSDirectory has a background thread that re-opens, syncs, and closes those files that Lucene has written. (I'm using a modified version of the class from Doron's test). This patch is nowhere near ready to commit; I just coded up enough so we could get a rough measure of performance cost of syncing. EG we must prevent deletion of a commit point until a future commit point is fully sync'd to stable storage; we must also take care not to sync a file that has been deleted before we sync'd it; don't sync until the end when running with autoCommit=false; merges if run by ConcurrentMergeScheduler should [maybe] sync in the foreground; maybe forcefully throttle back updates if syncing is falling too far behind; etc. I ran the same alg as the tests above (index first 150K docs of Wikipedia). I ran CFS and no CFS X sync and nosync (4 tests) for each IO system. Time is the fastest of 2 runs: IO System CFS sync CFS nosync CFS % slower non-CFS sync non-CFS nosync non-CFS % slower ReiserFS 6-drive RAID5 array Linux (2.6.22.1) 188 157 19.7% 143 147 -2.7% EXT3 single internal drive Linux (2.6.22.1) 173 157 10.2% 136 132 3.0% 4 drive RAID0 array Mac Pro (10.4 Tiger) 153 152 0.7% 150 149 0.7% Win XP Pro laptop, single drive 463 352 31.5% 343 335 2.4% Mac Pro single external drive 463 352 31.5% 343 335 2.4% The good news is, the non-CFS case shows very little cost when we do BG sync'ing! The bad news is, the CFS case still shows a high cost. However, by not sync'ing the files that go into the CFS (and also not committing a new segments_N file until after the CFS is written) I expect that cost to go way down. One caveat: I'm using a 8 MB RAM buffer for all of these tests. As Yonik pointed out, if you have a smaller buffer, or, you add just a few docs and then close your writer, the sync cost as a pctg of net indexing time will be quite a bit higher.
        Hide
        Doug Cutting added a comment -

        > I found out however that delaying the syncs (but intending to sync) also
        means keeping the file handles open [...]

        Not necessarily. You could just queue the file names for sync, close them, and then have the background thread open, sync and close them. The close could trigger the OS to sync things faster in the background. Then the open/sync/close could mostly be a no-op. Might be worth a try.

        Show
        Doug Cutting added a comment - > I found out however that delaying the syncs (but intending to sync) also means keeping the file handles open [...] Not necessarily. You could just queue the file names for sync, close them, and then have the background thread open, sync and close them. The close could trigger the OS to sync things faster in the background. Then the open/sync/close could mostly be a no-op. Might be worth a try.
        Hide
        Doron Cohen added a comment - - edited

        Attached FSyncPerfTest.java is the standalone (non Lucene) perf test that I used.

        Show
        Doron Cohen added a comment - - edited Attached FSyncPerfTest.java is the standalone (non Lucene) perf test that I used.
        Hide
        Doron Cohen added a comment -

        With some artificial CPU activity added to the test program:

        num files num chars per file No Sync Sync At End Background Sync Immediate Sync
        100 10000 6690 11516 10706 11216
        100 10000 7200 11006 10575 10846
        1000 1000 8002 48570 48479 51825
        1000 1000 7801 43142 43693 43342
        10000 100 16303 152730 326810 207939
        10000 100 17805 156375 160040 165398
        Show
        Doron Cohen added a comment - With some artificial CPU activity added to the test program: num files num chars per file No Sync Sync At End Background Sync Immediate Sync 100 10000 6690 11516 10706 11216 100 10000 7200 11006 10575 10846 1000 1000 8002 48570 48479 51825 1000 1000 7801 43142 43693 43342 10000 100 16303 152730 326810 207939 10000 100 17805 156375 160040 165398
        Hide
        Doron Cohen added a comment -

        I'll look into the separate thread to sync/close files in the
        background next...

        I was wondering if delaying sync to actual commit point would run faster
        than a background thread. I thought it would, because the background
        thread, though not holding current thread from continue with indexing,
        does force the sync now rather than letting the IO subsystem actually
        write stuff on its time. I was also hoping that by doing them later,
        some of the syncs would become no-ops, and hence faster. I found
        out however that delaying the syncs (but intending to sync) also
        means keeping the file handles open, and therefore this is not
        a practical approach. Still it was interesting to compare.

        So... my small test sequentially writes M characters to N files
        and either do not sync (just close), or does sync in one of three
        ways: (1) at the end, (2) immediately, (3) in a background thread.
        The results (in millis) on my Windows XP were:

        num files num chars per file No Sync Sync At End Background Sync Immediate Sync
        100 10000 631 5778 5729 5828
        100 10000 581 4486 4117 4687
        1000 1000 1612 38996 34900 35852
        1000 1000 1432 37153 35051 37263
        10000 100 10335 154262 162103 174251
        10000 100 11276 147752 159480 222450

        Each configuration ran twice and there are fluctuations,
        but it is obvious (as Mike noticed) that no-sync is much faster
        then sync. In fact in my test no-sync is at least 10 times faster
        than any sync approach, while in Mike's test which is using
        Lucene the penalty is smaller. Difference might be because
        in my test there is no CPU work involved, just IO.

        Comparing "immediate" to "background" I it is not clearly worth it
        to add a background thread (unless Mike's test proves otherwise..)

        Show
        Doron Cohen added a comment - I'll look into the separate thread to sync/close files in the background next... I was wondering if delaying sync to actual commit point would run faster than a background thread. I thought it would, because the background thread, though not holding current thread from continue with indexing, does force the sync now rather than letting the IO subsystem actually write stuff on its time. I was also hoping that by doing them later, some of the syncs would become no-ops, and hence faster. I found out however that delaying the syncs (but intending to sync) also means keeping the file handles open, and therefore this is not a practical approach. Still it was interesting to compare. So... my small test sequentially writes M characters to N files and either do not sync (just close), or does sync in one of three ways: (1) at the end, (2) immediately, (3) in a background thread. The results (in millis) on my Windows XP were: num files num chars per file No Sync Sync At End Background Sync Immediate Sync 100 10000 631 5778 5729 5828 100 10000 581 4486 4117 4687 1000 1000 1612 38996 34900 35852 1000 1000 1432 37153 35051 37263 10000 100 10335 154262 162103 174251 10000 100 11276 147752 159480 222450 Each configuration ran twice and there are fluctuations, but it is obvious (as Mike noticed) that no-sync is much faster then sync. In fact in my test no-sync is at least 10 times faster than any sync approach, while in Mike's test which is using Lucene the penalty is smaller. Difference might be because in my test there is no CPU work involved, just IO. Comparing "immediate" to "background" I it is not clearly worth it to add a background thread (unless Mike's test proves otherwise..)
        Hide
        Michael McCandless added a comment -

        Woops, OK I will put it back ...

        Show
        Michael McCandless added a comment - Woops, OK I will put it back ...
        Hide
        Michael Busch added a comment -

        I think changing the only constructor in FSDirectory.FSIndexOutput is
        an API change. I have a class that extends FSIndexOutput and it
        doesn't compile anymore after switching to the 2.3-dev jar.

        I think we should put this ctr back:
        public FSIndexOutput(File path) throws IOException

        { this(path, DEFAULT_DO_SYNC); }
        Show
        Michael Busch added a comment - I think changing the only constructor in FSDirectory.FSIndexOutput is an API change. I have a class that extends FSIndexOutput and it doesn't compile anymore after switching to the 2.3-dev jar. I think we should put this ctr back: public FSIndexOutput(File path) throws IOException { this(path, DEFAULT_DO_SYNC); }
        Hide
        Michael McCandless added a comment -

        OK, I tested calling command-line "sync", after writing each segments
        file. It's in fact even slower than fsync on each file for these 3
        cases:

        Linux (2.6.22.1), reiserfs 6 drive RAID5 array 93% slower
        sync - 330.74
        nosync - 171.24

        Linux (2.6.22.1), ext3 single drive 60% slower
        sync - 242.02
        nosync - 150.91

        Mac Pro (10.4 Tiger), 4 drive RAID0 array 28% slower
        sync - 204.77
        nosync - 159.90

        I'll look into the separate thread to sync/close files in the
        background next...

        Show
        Michael McCandless added a comment - OK, I tested calling command-line "sync", after writing each segments file. It's in fact even slower than fsync on each file for these 3 cases: Linux (2.6.22.1), reiserfs 6 drive RAID5 array 93% slower sync - 330.74 nosync - 171.24 Linux (2.6.22.1), ext3 single drive 60% slower sync - 242.02 nosync - 150.91 Mac Pro (10.4 Tiger), 4 drive RAID0 array 28% slower sync - 204.77 nosync - 159.90 I'll look into the separate thread to sync/close files in the background next...
        Hide
        Michael McCandless added a comment -

        Perhaps for the short-term, but long-term it would be better to find a solution that's both reliable and doesn't have such a big performance impact.

        Agreed. I will default doSync back to false, for now.

        We really don't need to sync until we commit. It would be interesting to know how much it slows things to do that. As a quick hack we could try running the 'sync' command line program at each commit.

        I will test this as a hack first just to see how performance compares
        to the current approach.

        If performance looks good, then we might look into implementing this in pure Java, changing FSDirectory.close() to queue FileDescriptors, add a background thread that syncs queued files, and add a Directory.sync() method that blocks until the queue is empty.

        Will do!

        Show
        Michael McCandless added a comment - Perhaps for the short-term, but long-term it would be better to find a solution that's both reliable and doesn't have such a big performance impact. Agreed. I will default doSync back to false, for now. We really don't need to sync until we commit. It would be interesting to know how much it slows things to do that. As a quick hack we could try running the 'sync' command line program at each commit. I will test this as a hack first just to see how performance compares to the current approach. If performance looks good, then we might look into implementing this in pure Java, changing FSDirectory.close() to queue FileDescriptors, add a background thread that syncs queued files, and add a Directory.sync() method that blocks until the queue is empty. Will do!
        Hide
        Doug Cutting added a comment -

        > Maybe we should leave the default as false for now?

        Perhaps for the short-term, but long-term it would be better to find a solution that's both reliable and doesn't have such a big performance impact.

        We really don't need to sync until we commit. It would be interesting to know how much it slows things to do that. As a quick hack we could try running the 'sync' command line program at each commit. If performance looks good, then we might look into implementing this in pure Java, changing FSDirectory.close() to queue FileDescriptors, add a background thread that syncs queued files, and add a Directory.sync() method that blocks until the queue is empty.

        Show
        Doug Cutting added a comment - > Maybe we should leave the default as false for now? Perhaps for the short-term, but long-term it would be better to find a solution that's both reliable and doesn't have such a big performance impact. We really don't need to sync until we commit. It would be interesting to know how much it slows things to do that. As a quick hack we could try running the 'sync' command line program at each commit. If performance looks good, then we might look into implementing this in pure Java, changing FSDirectory.close() to queue FileDescriptors, add a background thread that syncs queued files, and add a Directory.sync() method that blocks until the queue is empty.
        Hide
        Michael McCandless added a comment -

        OK I ran sync/nosync tests across various platforms/IO system. In
        each case I ran the test once with doSync=true and once with
        doSync=false, using this alg:

        analyzer=org.apache.lucene.analysis.SimpleAnalyzer
        doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
        docs.file=/lucene/wikifull.txt

        doc.maker.forever=false
        ram.flush.mb = 8
        max.buffered = 0
        directory = FSDirectory
        max.field.length = 2147483647
        doc.term.vector=false
        doc.stored=false
        work.dir = /tmp/lucene
        fsdirectory.dosync = false

        ResetSystemErase
        CreateIndex
        {AddDoc >: 150000
        CloseIndex

        RepSumByName

        Ie, time to index the first 150K docs from Wikipedia.

        Results for single hard drive:

        Mac mini (10.5 Leopard) single 4200 RPM "notebook" (2.5") drive – 2.3% slower:

        sync - 296.80 sec
        nosync - 290.06 sec

        Mac pro (10.4 Tiger), single external drive – 35.5% slower:

        sync - 259.61 sec
        nosync - 191.53 sec

        Win XP Pro laptop, single drive – 38.2% slower

        sync - 536.00 sec
        nosync - 387.90 sec

        Linux (2.6.22.1), ext3 single drive – 23% slower

        sync - 185.42 sec
        nosync - 150.56 sec

        Results for multiple hard drives (RAID arrays):

        Linux (2.6.22.1), reiserfs 6 drive RAID5 array – 49% slower (!!)

        sync - 239.32 sec
        nosync - 160.56 sec

        Mac Pro (10.4 Tiger), 4 drive RAID0 array – 1% faster

        sync - 157.26 sec
        nosync - 158.93 sec

        So at this point I'm torn...

        The performance cost of the simplest approach (sync() before close())
        is very costly in many cases (not just laptop IO subsystems). The
        reiserfs test was rather shocking. Then, it's oddly very lost cost in
        other cases: the Mac Mini test I find amazing.

        It's frustrating to lose such performance "out of the box" for the
        presumably extremely rare event of OS/machine crash/power cut.

        Maybe we should leave the default as false for now?

        Show
        Michael McCandless added a comment - OK I ran sync/nosync tests across various platforms/IO system. In each case I ran the test once with doSync=true and once with doSync=false, using this alg: analyzer=org.apache.lucene.analysis.SimpleAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker docs.file=/lucene/wikifull.txt doc.maker.forever=false ram.flush.mb = 8 max.buffered = 0 directory = FSDirectory max.field.length = 2147483647 doc.term.vector=false doc.stored=false work.dir = /tmp/lucene fsdirectory.dosync = false ResetSystemErase CreateIndex {AddDoc >: 150000 CloseIndex RepSumByName Ie, time to index the first 150K docs from Wikipedia. Results for single hard drive: Mac mini (10.5 Leopard) single 4200 RPM "notebook" (2.5") drive – 2.3% slower: sync - 296.80 sec nosync - 290.06 sec Mac pro (10.4 Tiger), single external drive – 35.5% slower: sync - 259.61 sec nosync - 191.53 sec Win XP Pro laptop, single drive – 38.2% slower sync - 536.00 sec nosync - 387.90 sec Linux (2.6.22.1), ext3 single drive – 23% slower sync - 185.42 sec nosync - 150.56 sec Results for multiple hard drives (RAID arrays): Linux (2.6.22.1), reiserfs 6 drive RAID5 array – 49% slower (!!) sync - 239.32 sec nosync - 160.56 sec Mac Pro (10.4 Tiger), 4 drive RAID0 array – 1% faster sync - 157.26 sec nosync - 158.93 sec So at this point I'm torn... The performance cost of the simplest approach (sync() before close()) is very costly in many cases (not just laptop IO subsystems). The reiserfs test was rather shocking. Then, it's oddly very lost cost in other cases: the Mac Mini test I find amazing. It's frustrating to lose such performance "out of the box" for the presumably extremely rare event of OS/machine crash/power cut. Maybe we should leave the default as false for now?
        Hide
        Doug Cutting added a comment -

        > Is a sync before every file close really needed [...] ?

        It might be nice if we could use the Linux sync() system call, instead of fsync(). Then we could call that only when the new segments file is moved into place rather than as each file is closed. We could exec the sync shell command when running on Unix, but I don't know whether there's an equivalent command for Windows, and it wouldn't be Java...

        Show
        Doug Cutting added a comment - > Is a sync before every file close really needed [...] ? It might be nice if we could use the Linux sync() system call, instead of fsync(). Then we could call that only when the new segments file is moved into place rather than as each file is closed. We could exec the sync shell command when running on Unix, but I don't know whether there's an equivalent command for Windows, and it wouldn't be Java...
        Hide
        Michael McCandless added a comment -

        Was that compound or non-compound index format? I imagine
        non-compound will take a bigger hit since each file will be
        synchronized separately and in a serialized fashion.

        The test was with compound file.

        But, the close() on each component file that goes into the compound
        file also does a sync, so compound file would be a slightly bigger hit
        because it has one additional sync()?

        We can't safely remove the sync() on each component file before
        building the compound file because we currently do a commit of the new
        segments file before building the compound file.

        I guess we could revisit whether that commit (before building the
        compound file) is really necessary? I think it's there from when
        flushing & merging were the same thing, and you do want to do this
        when merging to save 1X extra peak on the disk usage, but now that
        flushing is separate from merging we could remove that intermediate
        commit?

        I also imagine that the hit will be larger for a weaker disk
        subsystem, and for usage patterns that continually add a few docs and
        close?

        OK I'll run the same test, but once on a laptop and once over NFS to
        see what the cost is for those cases.

        Yes, continually adding docs & flushing/closing your writer will in
        theory be most affected here. I think for such apps performance is
        not usually top priority (indexing latency is)? Ie if you wanted
        performance you would batch up the added docs more? Anyway, for such
        cases users can turn off sync() if they want to risk it?

        Is a sync before every file close really needed, or can some of them
        be avoided when autocommit==false?

        It's somewhat tricky to safely remove sync() even when
        autoCommit=false, because you don't know at close() whether this file
        you are closing will be referenced (and not merged away) when the
        commit is finally done (when IndexWriter is closed).

        If there were a way to sync a file after having closed it (is there?)
        then we could go and sync() all new files we had created that are now
        referenced by the segments file we are writing.

        Also, I was thinking we could start simple (call sync() before every
        close()) and then with time, and if necessary, work out smarter ways
        to safely remove some of those sync()'s.

        Also, the 'sync' should be optional. BerkleyDB offers similar
        functionality.

        It is optional: I added doSync boolean to
        FSDirectory.getDirectory(...).

        And, I agree: for cases where there is very low cost to regenerate the
        index, and you want absolute best performance, you can turn off
        syncing.

        Show
        Michael McCandless added a comment - Was that compound or non-compound index format? I imagine non-compound will take a bigger hit since each file will be synchronized separately and in a serialized fashion. The test was with compound file. But, the close() on each component file that goes into the compound file also does a sync, so compound file would be a slightly bigger hit because it has one additional sync()? We can't safely remove the sync() on each component file before building the compound file because we currently do a commit of the new segments file before building the compound file. I guess we could revisit whether that commit (before building the compound file) is really necessary? I think it's there from when flushing & merging were the same thing, and you do want to do this when merging to save 1X extra peak on the disk usage, but now that flushing is separate from merging we could remove that intermediate commit? I also imagine that the hit will be larger for a weaker disk subsystem, and for usage patterns that continually add a few docs and close? OK I'll run the same test, but once on a laptop and once over NFS to see what the cost is for those cases. Yes, continually adding docs & flushing/closing your writer will in theory be most affected here. I think for such apps performance is not usually top priority (indexing latency is)? Ie if you wanted performance you would batch up the added docs more? Anyway, for such cases users can turn off sync() if they want to risk it? Is a sync before every file close really needed, or can some of them be avoided when autocommit==false? It's somewhat tricky to safely remove sync() even when autoCommit=false, because you don't know at close() whether this file you are closing will be referenced (and not merged away) when the commit is finally done (when IndexWriter is closed). If there were a way to sync a file after having closed it (is there?) then we could go and sync() all new files we had created that are now referenced by the segments file we are writing. Also, I was thinking we could start simple (call sync() before every close()) and then with time, and if necessary, work out smarter ways to safely remove some of those sync()'s. Also, the 'sync' should be optional. BerkleyDB offers similar functionality. It is optional: I added doSync boolean to FSDirectory.getDirectory(...). And, I agree: for cases where there is very low cost to regenerate the index, and you want absolute best performance, you can turn off syncing.
        Hide
        robert engels added a comment -

        I agree. Just for a baseline, I think the test needs to be done on a single drive system.

        Also, the 'sync' should be optional. BerkleyDB offers similar functionality.

        The reason being, if the index can be completely recreated from other sources, you might not want to pay the performance hit, instead recreate the index if corruption/hard failure occurs.

        Show
        robert engels added a comment - I agree. Just for a baseline, I think the test needs to be done on a single drive system. Also, the 'sync' should be optional. BerkleyDB offers similar functionality. The reason being, if the index can be completely recreated from other sources, you might not want to pay the performance hit, instead recreate the index if corruption/hard failure occurs.
        Hide
        Yonik Seeley added a comment -

        This is on a quad core Mac OS X (Mac Pro) with a 4-drive RAID 0 IO
        system. The baseline (non-sync) test took 19:54 and the sync test
        took 20:21, which I think is a fairly minor slowdown.

        Was that compound or non-compound index format? I imagine non-compound will take a bigger hit since each file will be synchronized separately and in a serialized fashion. I also imagine that the hit will be larger for a weaker disk subsystem, and for usage patterns that continually add a few docs and close?

        Is a sync before every file close really needed, or can some of them be avoided when autocommit==false?

        Show
        Yonik Seeley added a comment - This is on a quad core Mac OS X (Mac Pro) with a 4-drive RAID 0 IO system. The baseline (non-sync) test took 19:54 and the sync test took 20:21, which I think is a fairly minor slowdown. Was that compound or non-compound index format? I imagine non-compound will take a bigger hit since each file will be synchronized separately and in a serialized fashion. I also imagine that the hit will be larger for a weaker disk subsystem, and for usage patterns that continually add a few docs and close? Is a sync before every file close really needed, or can some of them be avoided when autocommit==false?
        Hide
        Michael McCandless added a comment -

        I just committed this. Thanks Venkat!

        Show
        Michael McCandless added a comment - I just committed this. Thanks Venkat!
        Hide
        Michael McCandless added a comment -

        Attached another rev of the patch.

        I changed the default to "true": I think the small performance hit is
        worth the added safety.

        Also put a try/finally around the the call to sync to make sure we
        close even if we hit exception during sync(), and improved the
        javadocs. I plan to commit in a day or two.

        Show
        Michael McCandless added a comment - Attached another rev of the patch. I changed the default to "true": I think the small performance hit is worth the added safety. Also put a try/finally around the the call to sync to make sure we close even if we hit exception during sync(), and improved the javadocs. I plan to commit in a day or two.
        Hide
        Michael McCandless added a comment -

        Attached another rev of the patch, that adds "fsdirectory.dosync"
        boolean config option to contrib/benchmark.

        I ran a quick perf test of sync vs no sync. I indexed all of
        Wikipedia using this alg:

        analyzer=org.apache.lucene.analysis.SimpleAnalyzer

        1. Feed that knows how to process the line file format:
          doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker

        docs.file=/lucene/wikifull.txt

        doc.maker.forever=false
        ram.flush.mb = 8
        max.buffered = 0
        directory = FSDirectory
        max.field.length = 2147483647
        doc.term.vector=false
        doc.stored=false
        fsdirectory.dosync = true

        ResetSystemErase
        CreateIndex
        {AddDoc >: *
        CloseIndex

        RepSumByName

        This is on a quad core Mac OS X (Mac Pro) with a 4-drive RAID 0 IO
        system. The baseline (non-sync) test took 19:54 and the sync test
        took 20:21, which I think is a fairly minor slowdown.

        I also tried opening the file descriptor with "rws", which I think is
        overkill for us (we don't need every IO operation to be sync'd) and it
        took 31:11, which is a major slowdown

        Maybe we should actually make doSync=true the default? It seems like
        a small price to pay for the added safety. The option would still be
        there to turn off if people wanted to made the opposite tradeoff.

        Show
        Michael McCandless added a comment - Attached another rev of the patch, that adds "fsdirectory.dosync" boolean config option to contrib/benchmark. I ran a quick perf test of sync vs no sync. I indexed all of Wikipedia using this alg: analyzer=org.apache.lucene.analysis.SimpleAnalyzer Feed that knows how to process the line file format: doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker docs.file=/lucene/wikifull.txt doc.maker.forever=false ram.flush.mb = 8 max.buffered = 0 directory = FSDirectory max.field.length = 2147483647 doc.term.vector=false doc.stored=false fsdirectory.dosync = true ResetSystemErase CreateIndex {AddDoc >: * CloseIndex RepSumByName This is on a quad core Mac OS X (Mac Pro) with a 4-drive RAID 0 IO system. The baseline (non-sync) test took 19:54 and the sync test took 20:21, which I think is a fairly minor slowdown. I also tried opening the file descriptor with "rws", which I think is overkill for us (we don't need every IO operation to be sync'd) and it took 31:11, which is a major slowdown Maybe we should actually make doSync=true the default? It seems like a small price to pay for the added safety. The option would still be there to turn off if people wanted to made the opposite tradeoff.
        Hide
        Michael McCandless added a comment -

        Attached patch that adds optional "doSync" boolean to
        FSDirectory.getDirectory(...). It defaults to "false". When true, I
        call file.getFD().sync() just before file.close() in
        FSIndexOutput.close().

        However, I can't figure out how to also sync the directory. Does
        anyone know how to do this in Java?

        All tests pass if I default it to true or to false.

        Show
        Michael McCandless added a comment - Attached patch that adds optional "doSync" boolean to FSDirectory.getDirectory(...). It defaults to "false". When true, I call file.getFD().sync() just before file.close() in FSIndexOutput.close(). However, I can't figure out how to also sync the directory. Does anyone know how to do this in Java? All tests pass if I default it to true or to false.
        Hide
        Michael McCandless added a comment -

        This recent thread is also relevant here:

        http://www.gossamer-threads.com/lists/lucene/java-dev/39898

        Show
        Michael McCandless added a comment - This recent thread is also relevant here: http://www.gossamer-threads.com/lists/lucene/java-dev/39898
        Hide
        Michael McCandless added a comment -

        See the healthy follow-on discussion here:

        http://www.gossamer-threads.com/lists/lucene/java-dev/54300

        I plan to add optional argument when calling FSDirectory.getDirectory() to ask all created FSIndexOutputs to always call sync() on the file descriptor before closing it.

        Show
        Michael McCandless added a comment - See the healthy follow-on discussion here: http://www.gossamer-threads.com/lists/lucene/java-dev/54300 I plan to add optional argument when calling FSDirectory.getDirectory() to ask all created FSIndexOutputs to always call sync() on the file descriptor before closing it.
        Hide
        venkat rangan added a comment -

        Robert,
        Are your comments applicable for version 1.4.3? The behavior of an all-zero 'segments' and 'deleted' files is very easily reproduced. Also, there is no left over 'segments.new' after a power-cord yank.
        Thanks.

        Show
        venkat rangan added a comment - Robert, Are your comments applicable for version 1.4.3? The behavior of an all-zero 'segments' and 'deleted' files is very easily reproduced. Also, there is no left over 'segments.new' after a power-cord yank. Thanks.
        Hide
        robert engels added a comment -

        The last comment is not correct, in that, there are many Java based applications (and non-java) that offer true transactional integrity.

        It usually involves a log file, and using sync to ensure data is written to disk.

        The Lucene structure allows for this VERY easily, as the 'segments' file controls everything.

        If all previous files are "synced", and then the 'segments.new' file written, and synced (with a marker/checksum). Then the old 'segments' deleted, and 'segments.new' 'renamed to 'segments'. It is trivial to ensure transactional integrity.

        Upon index open, check for segments.new - if doesn't exist, or does not have a valid checksum, delete all segments not in 'segments', if it is valid, then reattempt the rename. Then open the index.

        Show
        robert engels added a comment - The last comment is not correct, in that, there are many Java based applications (and non-java) that offer true transactional integrity. It usually involves a log file, and using sync to ensure data is written to disk. The Lucene structure allows for this VERY easily, as the 'segments' file controls everything. If all previous files are "synced", and then the 'segments.new' file written, and synced (with a marker/checksum). Then the old 'segments' deleted, and 'segments.new' 'renamed to 'segments'. It is trivial to ensure transactional integrity. Upon index open, check for segments.new - if doesn't exist, or does not have a valid checksum, delete all segments not in 'segments', if it is valid, then reattempt the rename. Then open the index.
        Hide
        Hoss Man added a comment -

        first off: there have been numerous changes to the way lucene writes to files (particularly relating to segment files, write locks, and fault tollerance) between 2.0 and 2.2 (not to mention differences between 1.4.3 and 2.0 that i may not be aware of) – so you may see many differences in behavior if you upgrade.

        second: to quote myself from a recent thread regarding lucene and "kill -9" ...

        http://www.nabble.com/Help-with-Lucene-Indexer-crash-recovery-tf4572570.html#a13068939

        : That said, it should never in fact cause index corruption, as far as I
        : know. Lucene is "semi-transactional": at any & all moments you should
        : be able to destroy the JVM and the index will be unharmed. I would
        : really like to get to the bottom of why this is not the case here.

        At any point you can shutdown the JVM and the index will be unharmed, but
        "destroying" it with "kill -9" goes a little farther then that.

        Lucene can't make that claim because the JVM can't even garuntee that
        bytes are written to physical disk when we close() an OutputStream – all
        it garuntees is that the bytes have been handed to the OS. When you "kill
        -9" a process the OS is free to make EVERYTHING about that process
        vanish without cleaning up after it ... i'm pretty sure even pending IO
        operations are fair game for disappearing.

        ...what's true for "kill -9" is true for hanking the power cord ... if the JVM isn't shut down cleanly, there is nothing Lucene or the JVM can do to guarantee that your index is in a consistent state.

        Show
        Hoss Man added a comment - first off: there have been numerous changes to the way lucene writes to files (particularly relating to segment files, write locks, and fault tollerance) between 2.0 and 2.2 (not to mention differences between 1.4.3 and 2.0 that i may not be aware of) – so you may see many differences in behavior if you upgrade. second: to quote myself from a recent thread regarding lucene and "kill -9" ... http://www.nabble.com/Help-with-Lucene-Indexer-crash-recovery-tf4572570.html#a13068939 : That said, it should never in fact cause index corruption, as far as I : know. Lucene is "semi-transactional": at any & all moments you should : be able to destroy the JVM and the index will be unharmed. I would : really like to get to the bottom of why this is not the case here. At any point you can shutdown the JVM and the index will be unharmed, but "destroying" it with "kill -9" goes a little farther then that. Lucene can't make that claim because the JVM can't even garuntee that bytes are written to physical disk when we close() an OutputStream – all it garuntees is that the bytes have been handed to the OS. When you "kill -9" a process the OS is free to make EVERYTHING about that process vanish without cleaning up after it ... i'm pretty sure even pending IO operations are fair game for disappearing. ...what's true for "kill -9" is true for hanking the power cord ... if the JVM isn't shut down cleanly, there is nothing Lucene or the JVM can do to guarantee that your index is in a consistent state.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            venkat rangan
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development