Lucene - Core
  1. Lucene - Core
  2. LUCENE-5842

Validate checksum footers for postings lists, docvalues, storedfields, termvectors on init

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      For small files (e.g. where we read in all the bytes anyway), we currently validate the checksum on reader init.

      But for larger files like .doc/.frq/.pos/.dvd/.fdt/.tvd we currently do nothing at all on init, as it would be too expensive.

      We should at least do this:

      // NOTE: data file is too costly to verify checksum against all the bytes on 
      // open, but for now we at least verify proper structure of the checksum 
      // footer: which looks for FOOTER_MAGIC + algorithmID. This is cheap 
      // and can detect some forms of corruption such as file truncation.
      CodecUtil.retrieveChecksum(data);
      
      1. LUCENE-5842.patch
        15 kB
        Robert Muir
      2. LUCENE-5842.patch
        14 kB
        Robert Muir

        Activity

        Hide
        Adrien Grand added a comment -

        +1

        Show
        Adrien Grand added a comment - +1
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Adrien Grand added a comment -

        +1 to the patch

        Show
        Adrien Grand added a comment - +1 to the patch
        Hide
        Robert Muir added a comment -

        By the way, as a followup, we can do even better and iterate a bit more:

        Today each file by itself can be 'correct' but you still have a corrupt index because the files are mismatched somehow (network replication, or some other bug).

        it might be worth thinking about reviving segmentinfo.attributes (thats cleanest i think), or put in files map directly (would be harder as it enforces files have checksums). We could store each files checksum there, and when we retrieve it here, validate against that attribute. This would detect mismatching.

        Ideally though we'd do this for the commit too (for deletes and dv updates).

        Anyway just something to explore on another issue if we can do it without creating a mess. I don't like how we cant detect such mismatching today (except via very rudimentary checks like livedocs.length = maxdoc etc).

        Show
        Robert Muir added a comment - By the way, as a followup, we can do even better and iterate a bit more: Today each file by itself can be 'correct' but you still have a corrupt index because the files are mismatched somehow (network replication, or some other bug). it might be worth thinking about reviving segmentinfo.attributes (thats cleanest i think), or put in files map directly (would be harder as it enforces files have checksums). We could store each files checksum there, and when we retrieve it here, validate against that attribute. This would detect mismatching. Ideally though we'd do this for the commit too (for deletes and dv updates). Anyway just something to explore on another issue if we can do it without creating a mess. I don't like how we cant detect such mismatching today (except via very rudimentary checks like livedocs.length = maxdoc etc).
        Hide
        Robert Muir added a comment -

        Updated patch, i missed to do the check before for the IDPostingsFormat terms dict in sandbox/

        Show
        Robert Muir added a comment - Updated patch, i missed to do the check before for the IDPostingsFormat terms dict in sandbox/
        Hide
        ASF subversion and git services added a comment -

        Commit 1612845 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1612845 ]

        LUCENE-5842: Validate checksum footers for postings lists/docvalues/storedfields/vectors on init

        Show
        ASF subversion and git services added a comment - Commit 1612845 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1612845 ] LUCENE-5842 : Validate checksum footers for postings lists/docvalues/storedfields/vectors on init
        Hide
        ASF subversion and git services added a comment -

        Commit 1612852 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1612852 ]

        LUCENE-5842: Validate checksum footers for postings lists/docvalues/storedfields/vectors on init

        Show
        ASF subversion and git services added a comment - Commit 1612852 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1612852 ] LUCENE-5842 : Validate checksum footers for postings lists/docvalues/storedfields/vectors on init

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development