Lucene - Core
  1. Lucene - Core
  2. LUCENE-3627

CorruptIndexException on indexing after a failure occurs after segments file creation but before any bytes are written

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Environment:
    • Lucene Fields:
      New, Patch Available

      Description

      FSDirectory.createOutput(..) uses a RandomAccessFile to do its work. On my system the default FSDirectory.open(..) creates an NIOFSDirectory. If createOutput is called on a segments_* file and a crash occurs between RandomAccessFile creation (file system shows a segments_* file exists but has zero bytes) but before any bytes are written to the file, subsequent IndexWriters cannot proceed. The difficulty is that it does not know how to clear the empty segments_* file. None of the file deletions will happen on such a segment file because the opening bytes cannot not be read to determine format and version.

      An initial proposed patch file is attached below.

      1. TestCrashCausesCorruptIndex.java
        11 kB
        Ken McCracken
      2. LUCENE-3627.patch
        10 kB
        Michael McCandless
      3. LUCENE-3627_initial_proposal.txt
        14 kB
        Ken McCracken

        Activity

        Hide
        Ken McCracken added a comment -

        Drop this file in src/test/org/apache/lucene/store for the lucene 3.5.0 source release, and run your unit tests with junit4.

        Show
        Ken McCracken added a comment - Drop this file in src/test/org/apache/lucene/store for the lucene 3.5.0 source release, and run your unit tests with junit4.
        Hide
        Ken McCracken added a comment -

        I have been reviewing https://issues.apache.org/jira/browse/LUCENE-3255 which seems related in that the error is encountered on the same section of SegmentInfos.java.
        One way to fix this might be to change SegmentInfos.java as follows where int format = input.readInt(); to

        int format;
        try

        { format = input.readInt(); }

        catch (IOException ioe) {
        if (input.length() == 0) {
        try

        { input.close(); }

        finally

        { directory.deleteFile(segmentFileName); }

        return;
        }
        throw ioe;
        }

        however, there are unit tests that seem to verify that no file deletions are happening at this low a level.

        Show
        Ken McCracken added a comment - I have been reviewing https://issues.apache.org/jira/browse/LUCENE-3255 which seems related in that the error is encountered on the same section of SegmentInfos.java. One way to fix this might be to change SegmentInfos.java as follows where int format = input.readInt(); to int format; try { format = input.readInt(); } catch (IOException ioe) { if (input.length() == 0) { try { input.close(); } finally { directory.deleteFile(segmentFileName); } return; } throw ioe; } however, there are unit tests that seem to verify that no file deletions are happening at this low a level.
        Hide
        Ken McCracken added a comment - - edited

        Initial proposed patch. I'm not sure if this is the correct place and scope. But it does fix my test case.
        The test case and the proposed code change are attached.

        aaa:lucene kmccrack$ svn info
        Path: .
        URL: http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_5/lucene
        Repository Root: http://svn.apache.org/repos/asf
        Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
        Revision: 1211687
        Node Kind: directory
        Schedule: normal
        Last Changed Author: sarowe
        Last Changed Rev: 1207561
        Last Changed Date: 2011-11-28 15:11:35 -0500 (Mon, 28 Nov 2011)

        Show
        Ken McCracken added a comment - - edited Initial proposed patch. I'm not sure if this is the correct place and scope. But it does fix my test case. The test case and the proposed code change are attached. aaa:lucene kmccrack$ svn info Path: . URL: http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_5/lucene Repository Root: http://svn.apache.org/repos/asf Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 Revision: 1211687 Node Kind: directory Schedule: normal Last Changed Author: sarowe Last Changed Rev: 1207561 Last Changed Date: 2011-11-28 15:11:35 -0500 (Mon, 28 Nov 2011)
        Hide
        Michael McCandless added a comment -

        Thanks Ken!

        I attached a new patch, whittling back the test a bit (while still
        showing the bug), fixing whitespace, etc.

        I also removed the @author tags (we don't attach names to the source
        code...).

        I fixed the bug in two places. First, when writing the segments file,
        I moved the createOutput in SegmentInfos.java inside the try/finally
        that deletes the file if an exception occurs. This way we shouldn't
        create 0-byte segments_N anymore.

        Second, in IndexFileDeleter I ignore a given segments_N if it's 0
        bytes; I don't delete the file at that point because IFD's normal
        ref-counting will eventually remove it.

        Show
        Michael McCandless added a comment - Thanks Ken! I attached a new patch, whittling back the test a bit (while still showing the bug), fixing whitespace, etc. I also removed the @author tags (we don't attach names to the source code...). I fixed the bug in two places. First, when writing the segments file, I moved the createOutput in SegmentInfos.java inside the try/finally that deletes the file if an exception occurs. This way we shouldn't create 0-byte segments_N anymore. Second, in IndexFileDeleter I ignore a given segments_N if it's 0 bytes; I don't delete the file at that point because IFD's normal ref-counting will eventually remove it.
        Hide
        Shai Erera added a comment -

        Nice catch Ken !

        Patch looks good to me. Perhaps replace the msg "code will not get here" with "should have hit CrashException"? Otherwise, +1 to commit.

        Show
        Shai Erera added a comment - Nice catch Ken ! Patch looks good to me. Perhaps replace the msg "code will not get here" with "should have hit CrashException"? Otherwise, +1 to commit.
        Hide
        Shane Detsch added a comment -

        Would you ever have the case there only one "segements_N" file is corrupt 'i.e. 0 size'? If so how would another "segments_N" file be regenerated? Does this 0-sized "segments_N" file always happen on a newly generated "segmetns_N" and "segments_N-1" still exist and is considered active? After deletion of the 0-sized "segments_N" file is there a process to make sure that the currently active "segmetns_N" (not the 0-sized one) accurately represents the segments in the index?

        Show
        Shane Detsch added a comment - Would you ever have the case there only one "segements_N" file is corrupt 'i.e. 0 size'? If so how would another "segments_N" file be regenerated? Does this 0-sized "segments_N" file always happen on a newly generated "segmetns_N" and "segments_N-1" still exist and is considered active? After deletion of the 0-sized "segments_N" file is there a process to make sure that the currently active "segmetns_N" (not the 0-sized one) accurately represents the segments in the index?
        Hide
        Michael McCandless added a comment -

        Would you ever have the case there only one "segements_N" file is corrupt 'i.e. 0 size'?

        No, unless something terrible is going on, such as your IO system
        disregards fsync, you have hardware problems, or something external is
        mucking with the index files.

        When Lucene commits it writes the next segments_N, fsyncs it (and all
        index file it references) and then removes old commits; so the old
        commits are not removed until the new one is on durable storage.

        I think the conditions for this bug to occur are very rare, ie, lucene
        tried to commit, hit a transient IO problem in creating the file (so
        that a 0 byte file is created), your app caught this and called
        IW.close, this time the transient IO problem is gone and the close
        succeeds in writing another segments_N file. At that point you'd hit
        this bug when you next tried to open an IndexWriter on the index.

        The more common failure would be if the segments_N fails to write (0
        byte file) and then when you close the IW it also fails, ie, because
        something suddenly is wrong w/ your IO system. In this case your
        "latest" segments_N is unreadable, and Lucene handles that fine (falls
        back to segments_(N-1), which should be ok because it was fsync'd).

        Show
        Michael McCandless added a comment - Would you ever have the case there only one "segements_N" file is corrupt 'i.e. 0 size'? No, unless something terrible is going on, such as your IO system disregards fsync, you have hardware problems, or something external is mucking with the index files. When Lucene commits it writes the next segments_N, fsyncs it (and all index file it references) and then removes old commits; so the old commits are not removed until the new one is on durable storage. I think the conditions for this bug to occur are very rare, ie, lucene tried to commit, hit a transient IO problem in creating the file (so that a 0 byte file is created), your app caught this and called IW.close, this time the transient IO problem is gone and the close succeeds in writing another segments_N file. At that point you'd hit this bug when you next tried to open an IndexWriter on the index. The more common failure would be if the segments_N fails to write (0 byte file) and then when you close the IW it also fails, ie, because something suddenly is wrong w/ your IO system. In this case your "latest" segments_N is unreadable, and Lucene handles that fine (falls back to segments_(N-1), which should be ok because it was fsync'd).
        Hide
        Ken McCracken added a comment -

        Thank you Mike for cleaning my suggestions up. I am validating and reviewing currently...

        What is the svn revision branch this patch would get checked in to?

        Show
        Ken McCracken added a comment - Thank you Mike for cleaning my suggestions up. I am validating and reviewing currently... What is the svn revision branch this patch would get checked in to?
        Hide
        Michael McCandless added a comment -

        Thanks Ken!

        What is the svn revision branch this patch would get checked in to?

        Oh, I checked into 3.x (to be 3.6.0) and trunk... if you select the "All" tab here, under Activity, you can see the full SVN paths...

        Show
        Michael McCandless added a comment - Thanks Ken! What is the svn revision branch this patch would get checked in to? Oh, I checked into 3.x (to be 3.6.0) and trunk... if you select the "All" tab here, under Activity, you can see the full SVN paths...
        Hide
        Ken McCracken added a comment -

        @Mike: My initial validation confirms your suggested patch works for me. Thank you.

        Show
        Ken McCracken added a comment - @Mike: My initial validation confirms your suggested patch works for me. Thank you.
        Hide
        Michael McCandless added a comment -

        Awesome, thanks Ken!

        Show
        Michael McCandless added a comment - Awesome, thanks Ken!

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Ken McCracken
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 48h
              48h
              Remaining:
              Remaining Estimate - 48h
              48h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development