Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7323

Compound file writing should verify checksum of its sub-files

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: master (7.0), 6.2
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      For larger segments, there is a non-trivial window, from when IW
      writes sub-files, to when it then builds the CFS, during which the
      files can become corrupted (from external process, bad filesystem,
      hardware, etc.)

      Today we quietly build the CFS even if the sub-files are corrupted,
      but we can easily detect it, letting users catch corruption earlier
      (write time instead of read time).

      1. LUCENE-7323.patch
        57 kB
        Michael McCandless
      2. LUCENE-7323.patch
        48 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          mikemccand Michael McCandless added a comment -

          Patch, I think it's close. It fixes our default
          Lucene50CompoundFileFormat to verify the checksum of its sub-files
          when writing.

          I also had to close up external access to SimpleText's doc values and
          postings format, i.e., you must use them only via SimpleTextCodec,
          because these files (intentionally) don't write codec headers and
          footers so you can't put them into a "normal" CFS file (SimpleText has
          its own CFS that doesn't verify checksums).

          I also made CodecUtil.read/writeCRC package private: do they
          really need to be public?

          Show
          mikemccand Michael McCandless added a comment - Patch, I think it's close. It fixes our default Lucene50CompoundFileFormat to verify the checksum of its sub-files when writing. I also had to close up external access to SimpleText's doc values and postings format, i.e., you must use them only via SimpleTextCodec, because these files (intentionally) don't write codec headers and footers so you can't put them into a "normal" CFS file (SimpleText has its own CFS that doesn't verify checksums). I also made CodecUtil.read/writeCRC package private: do they really need to be public?
          Hide
          mikemccand Michael McCandless added a comment -

          Another iteration, also verifying the segment ID of all incoming sub-files is correct ... I think it's ready.

          Show
          mikemccand Michael McCandless added a comment - Another iteration, also verifying the segment ID of all incoming sub-files is correct ... I think it's ready.
          Hide
          rcmuir Robert Muir added a comment -

          Looks nice. I like the latest patch much better, I think its better to push complexity into CodecUtil.

          Show
          rcmuir Robert Muir added a comment - Looks nice. I like the latest patch much better, I think its better to push complexity into CodecUtil.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 067fb25e4359ed8d5673e385976da7debc0e5b77 in lucene-solr's branch refs/heads/master from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=067fb25 ]

          LUCENE-7323: compound file writing now verifies checksum and segment ID for the incoming sub-files, to catch hardware issues or filesystem bugs earlier

          Show
          jira-bot ASF subversion and git services added a comment - Commit 067fb25e4359ed8d5673e385976da7debc0e5b77 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=067fb25 ] LUCENE-7323 : compound file writing now verifies checksum and segment ID for the incoming sub-files, to catch hardware issues or filesystem bugs earlier
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit ae0adfc34dea21df86ab7ebf034f3dbd6714c541 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ae0adfc ]

          LUCENE-7323: compound file writing now verifies checksum and segment ID for the incoming sub-files, to catch hardware issues or filesystem bugs earlier

          Show
          jira-bot ASF subversion and git services added a comment - Commit ae0adfc34dea21df86ab7ebf034f3dbd6714c541 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ae0adfc ] LUCENE-7323 : compound file writing now verifies checksum and segment ID for the incoming sub-files, to catch hardware issues or filesystem bugs earlier

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              mikemccand Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development