Lucene - Core
  1. Lucene - Core
  2. LUCENE-5580

Always verify stored fields' checksum on merge

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I have seen a couple of index corruptions over the last months, and most of them happened on stored fields. The explanation might just be that since stored fields are usually most of the index size, they are just more likely to be corrupted due to a hardware/operating-system failure, but it might be as well a sneaky bug on our side.

      Lucene recently added checksums to index files, and you can enable integrity verification upon merge, but this comes with a cost since you need to read all index files twice instead of once. If you are merging a very large segment and your merges are I/O-bound, this might be noticeable.

      I would like to implement integrity checks for stored fields on merges on the fly, so that the stored fields files need to be read only once.

        Issue Links

          Activity

          Hide
          Adrien Grand added a comment -

          Here is a patch that verifies checksums on stored fields when doing bulk merges.

          Show
          Adrien Grand added a comment - Here is a patch that verifies checksums on stored fields when doing bulk merges.
          Hide
          Michael McCandless added a comment -

          +1 to very the checksum on the fly without reading the file twice, and the patch looks good.

          We could pull that anonymous BufferedChecksumIndexInput subclass out (e.g., ForwardOnlySeekingChecksum... or something) and CompressingTermVectors could do the same thing? Other non-bulk-copying components could also use it, e.g. I think when merging postings we read nearly the entire file already (no actual seeking)...

          We can do that in a separate issue.

          Show
          Michael McCandless added a comment - +1 to very the checksum on the fly without reading the file twice, and the patch looks good. We could pull that anonymous BufferedChecksumIndexInput subclass out (e.g., ForwardOnlySeekingChecksum... or something) and CompressingTermVectors could do the same thing? Other non-bulk-copying components could also use it, e.g. I think when merging postings we read nearly the entire file already (no actual seeking)... We can do that in a separate issue.
          Hide
          Adrien Grand added a comment -

          I agree this would be nice to do that on more index formats. I think I'll open a new issue since I would like to have at least this one in 4.8, and make sure it goes through enough Jenkins builds before the release.

          Show
          Adrien Grand added a comment - I agree this would be nice to do that on more index formats. I think I'll open a new issue since I would like to have at least this one in 4.8, and make sure it goes through enough Jenkins builds before the release.
          Hide
          ASF subversion and git services added a comment -

          Commit 1585910 from jpountz@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1585910 ]

          LUCENE-5580: Always verify stored fields checksums on bulk merge.

          Show
          ASF subversion and git services added a comment - Commit 1585910 from jpountz@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1585910 ] LUCENE-5580 : Always verify stored fields checksums on bulk merge.
          Hide
          ASF subversion and git services added a comment -

          Commit 1585913 from jpountz@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1585913 ]

          LUCENE-5580: Always verify stored fields checksums on bulk merge.

          Show
          ASF subversion and git services added a comment - Commit 1585913 from jpountz@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1585913 ] LUCENE-5580 : Always verify stored fields checksums on bulk merge.
          Hide
          Uwe Schindler added a comment -

          Close issue after release of 4.8.0

          Show
          Uwe Schindler added a comment - Close issue after release of 4.8.0

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development