Lucene - Core
  1. Lucene - Core
  2. LUCENE-3878

CheckIndex should check deleted documents too

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In 4.0 livedocs are passed down to the enums, thus deleted docs are not so special.

      So I think checkindex should not pass the livedocs down to the enums when checking,
      it should pass livedocs=null and check all the postings. It already does this separately to
      collect stats i think to compare against the term/collection statistics? But we should
      just clean this up and only use one enum.

      For example LUCENE-3876 is a case where we were actually making a corrumpt index,
      (a position was negative) but because the document in question was deleted, CheckIndex
      didn't detect this.

      This could have caused problems if someone just passed null for livedocs (maybe they
      are doing something where its not so important to take deletions into account)

      1. LUCENE-3878.patch
        32 kB
        Robert Muir
      2. LUCENE-3878.patch
        34 kB
        Robert Muir
      3. LUCENE-3878.patch
        19 kB
        Michael McCandless

        Activity

        Hide
        Robert Muir added a comment -

        Not annoyed only confused (I think i dont totally understand your comment).

        So i was just waiting for your patch, then I would know for sure
        what you meant.

        Show
        Robert Muir added a comment - Not annoyed only confused (I think i dont totally understand your comment). So i was just waiting for your patch, then I would know for sure what you meant.
        Hide
        Uwe Schindler added a comment -

        Heavy bug fixing! That separation of liveDocs in CheckIndex is a very good idea.
        I hope you are not annoyed by my vLong nitpicking...

        Show
        Uwe Schindler added a comment - Heavy bug fixing! That separation of liveDocs in CheckIndex is a very good idea. I hope you are not annoyed by my vLong nitpicking...
        Hide
        Robert Muir added a comment -

        Thanks for the help! I like it much better with the shared method/testing also deleted docs.

        Show
        Robert Muir added a comment - Thanks for the help! I like it much better with the shared method/testing also deleted docs.
        Hide
        Michael McCandless added a comment -

        Patch, cutting over testTermVectors to checkFields, making it also check deleted docs.

        This found a bug! In Lucene40 codec's term vectors seekCeil... fixed.

        Show
        Michael McCandless added a comment - Patch, cutting over testTermVectors to checkFields, making it also check deleted docs. This found a bug! In Lucene40 codec's term vectors seekCeil... fixed.
        Hide
        Robert Muir added a comment -

        ok this was pretty easy as a start. I didnt add any extra assertions just moved the logic into checkFields(Fields).
        we should be able to refactor testVectors to use this too, just like norms and docValues both share the same logic.

        I will mark this issue resolved now.

        Show
        Robert Muir added a comment - ok this was pretty easy as a start. I didnt add any extra assertions just moved the logic into checkFields(Fields). we should be able to refactor testVectors to use this too, just like norms and docValues both share the same logic. I will mark this issue resolved now.
        Hide
        Robert Muir added a comment -

        I'm still nervous about the patch losing coverage for postings.

        The problem with the previous second-pass was we only did minimal checks with a null liveDocs.

        I think ideally we factor these checks into separate methods that take liveDocs, and return stats.
        if there are deletions we do the full check with the real liveDocs too, and assert stats <= rawStats

        It will be heavy but i think we can do it from this patch.

        Show
        Robert Muir added a comment - I'm still nervous about the patch losing coverage for postings. The problem with the previous second-pass was we only did minimal checks with a null liveDocs. I think ideally we factor these checks into separate methods that take liveDocs, and return stats. if there are deletions we do the full check with the real liveDocs too, and assert stats <= rawStats It will be heavy but i think we can do it from this patch.
        Hide
        Robert Muir added a comment -

        updated patch with separate liveDocs validation for each segment... checking the number of set bits etc.

        Show
        Robert Muir added a comment - updated patch with separate liveDocs validation for each segment... checking the number of set bits etc.
        Hide
        Robert Muir added a comment -

        patch for trunk.

        Show
        Robert Muir added a comment - patch for trunk.
        Hide
        Robert Muir added a comment -

        Actually if we are willing to add SegmentReader.rawTermPositions() to match SegmentReader.rawTermDocs()
        we could do this in 3.x too...

        Show
        Robert Muir added a comment - Actually if we are willing to add SegmentReader.rawTermPositions() to match SegmentReader.rawTermDocs() we could do this in 3.x too...
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development