Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5842

Validate checksum footers for postings lists, docvalues, storedfields, termvectors on init

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      For small files (e.g. where we read in all the bytes anyway), we currently validate the checksum on reader init.

      But for larger files like .doc/.frq/.pos/.dvd/.fdt/.tvd we currently do nothing at all on init, as it would be too expensive.

      We should at least do this:

      // NOTE: data file is too costly to verify checksum against all the bytes on 
      // open, but for now we at least verify proper structure of the checksum 
      // footer: which looks for FOOTER_MAGIC + algorithmID. This is cheap 
      // and can detect some forms of corruption such as file truncation.
      CodecUtil.retrieveChecksum(data);
      
      1. LUCENE-5842.patch
        15 kB
        Robert Muir
      2. LUCENE-5842.patch
        14 kB
        Robert Muir

        Activity

        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1612852 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1612852 ]

        LUCENE-5842: Validate checksum footers for postings lists/docvalues/storedfields/vectors on init

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1612852 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1612852 ] LUCENE-5842 : Validate checksum footers for postings lists/docvalues/storedfields/vectors on init
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1612845 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1612845 ]

        LUCENE-5842: Validate checksum footers for postings lists/docvalues/storedfields/vectors on init

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1612845 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1612845 ] LUCENE-5842 : Validate checksum footers for postings lists/docvalues/storedfields/vectors on init
        Hide
        rcmuir Robert Muir added a comment -

        Updated patch, i missed to do the check before for the IDPostingsFormat terms dict in sandbox/

        Show
        rcmuir Robert Muir added a comment - Updated patch, i missed to do the check before for the IDPostingsFormat terms dict in sandbox/
        Hide
        rcmuir Robert Muir added a comment -

        By the way, as a followup, we can do even better and iterate a bit more:

        Today each file by itself can be 'correct' but you still have a corrupt index because the files are mismatched somehow (network replication, or some other bug).

        it might be worth thinking about reviving segmentinfo.attributes (thats cleanest i think), or put in files map directly (would be harder as it enforces files have checksums). We could store each files checksum there, and when we retrieve it here, validate against that attribute. This would detect mismatching.

        Ideally though we'd do this for the commit too (for deletes and dv updates).

        Anyway just something to explore on another issue if we can do it without creating a mess. I don't like how we cant detect such mismatching today (except via very rudimentary checks like livedocs.length = maxdoc etc).

        Show
        rcmuir Robert Muir added a comment - By the way, as a followup, we can do even better and iterate a bit more: Today each file by itself can be 'correct' but you still have a corrupt index because the files are mismatched somehow (network replication, or some other bug). it might be worth thinking about reviving segmentinfo.attributes (thats cleanest i think), or put in files map directly (would be harder as it enforces files have checksums). We could store each files checksum there, and when we retrieve it here, validate against that attribute. This would detect mismatching. Ideally though we'd do this for the commit too (for deletes and dv updates). Anyway just something to explore on another issue if we can do it without creating a mess. I don't like how we cant detect such mismatching today (except via very rudimentary checks like livedocs.length = maxdoc etc).
        Hide
        jpountz Adrien Grand added a comment -

        +1 to the patch

        Show
        jpountz Adrien Grand added a comment - +1 to the patch
        Hide
        mikemccand Michael McCandless added a comment -

        +1

        Show
        mikemccand Michael McCandless added a comment - +1
        Hide
        jpountz Adrien Grand added a comment -

        +1

        Show
        jpountz Adrien Grand added a comment - +1

          People

          • Assignee:
            Unassigned
            Reporter:
            rcmuir Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development