Lucene - Core
  1. Lucene - Core
  2. LUCENE-3878

CheckIndex should check deleted documents too

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In 4.0 livedocs are passed down to the enums, thus deleted docs are not so special.

      So I think checkindex should not pass the livedocs down to the enums when checking,
      it should pass livedocs=null and check all the postings. It already does this separately to
      collect stats i think to compare against the term/collection statistics? But we should
      just clean this up and only use one enum.

      For example LUCENE-3876 is a case where we were actually making a corrumpt index,
      (a position was negative) but because the document in question was deleted, CheckIndex
      didn't detect this.

      This could have caused problems if someone just passed null for livedocs (maybe they
      are doing something where its not so important to take deletions into account)

      1. LUCENE-3878.patch
        32 kB
        Robert Muir
      2. LUCENE-3878.patch
        34 kB
        Robert Muir
      3. LUCENE-3878.patch
        19 kB
        Michael McCandless

        Activity

        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Michael McCandless committed 1301939 (2 files)
        Reviews: none

        LUCENE-3878: fix CheckIndex.testTermVectors to use checkFields too; this found a bug in Lucene40's term vectors reader

        Hide
        Robert Muir added a comment -

        Not annoyed only confused (I think i dont totally understand your comment).

        So i was just waiting for your patch, then I would know for sure
        what you meant.

        Show
        Robert Muir added a comment - Not annoyed only confused (I think i dont totally understand your comment). So i was just waiting for your patch, then I would know for sure what you meant.
        Hide
        Uwe Schindler added a comment -

        Heavy bug fixing! That separation of liveDocs in CheckIndex is a very good idea.
        I hope you are not annoyed by my vLong nitpicking...

        Show
        Uwe Schindler added a comment - Heavy bug fixing! That separation of liveDocs in CheckIndex is a very good idea. I hope you are not annoyed by my vLong nitpicking...
        Hide
        Robert Muir added a comment -

        Thanks for the help! I like it much better with the shared method/testing also deleted docs.

        Show
        Robert Muir added a comment - Thanks for the help! I like it much better with the shared method/testing also deleted docs.
        Michael McCandless made changes -
        Attachment LUCENE-3878.patch [ 12518733 ]
        Hide
        Michael McCandless added a comment -

        Patch, cutting over testTermVectors to checkFields, making it also check deleted docs.

        This found a bug! In Lucene40 codec's term vectors seekCeil... fixed.

        Show
        Michael McCandless added a comment - Patch, cutting over testTermVectors to checkFields, making it also check deleted docs. This found a bug! In Lucene40 codec's term vectors seekCeil... fixed.
        Robert Muir made changes -
        Resolution Fixed [ 1 ]
        Status Open [ 1 ] Resolved [ 5 ]
        Hide
        Robert Muir added a comment -

        ok this was pretty easy as a start. I didnt add any extra assertions just moved the logic into checkFields(Fields).
        we should be able to refactor testVectors to use this too, just like norms and docValues both share the same logic.

        I will mark this issue resolved now.

        Show
        Robert Muir added a comment - ok this was pretty easy as a start. I didnt add any extra assertions just moved the logic into checkFields(Fields). we should be able to refactor testVectors to use this too, just like norms and docValues both share the same logic. I will mark this issue resolved now.
        rmuir committed 1301712 (1 file)
        Reviews: none

        LUCENE-3878: refactor testPostings into checkFields(Fields) and also run the test ignoring deletions. TODO: refactor testVectors to use this method too

        rmuir committed 1301684 (2 files)
        Reviews: none

        LUCENE-3878: visit the document even if its deleted, just don't count anything

        rmuir committed 1301683 (1 file)
        Reviews: none

        LUCENE-3878: visit the document even if its deleted, just don't count anything

        rmuir committed 1301650 (1 file)
        Reviews: none

        LUCENE-3878: add basic liveDocs checks to CheckIndex

        Hide
        Robert Muir added a comment -

        I'm still nervous about the patch losing coverage for postings.

        The problem with the previous second-pass was we only did minimal checks with a null liveDocs.

        I think ideally we factor these checks into separate methods that take liveDocs, and return stats.
        if there are deletions we do the full check with the real liveDocs too, and assert stats <= rawStats

        It will be heavy but i think we can do it from this patch.

        Show
        Robert Muir added a comment - I'm still nervous about the patch losing coverage for postings. The problem with the previous second-pass was we only did minimal checks with a null liveDocs. I think ideally we factor these checks into separate methods that take liveDocs, and return stats. if there are deletions we do the full check with the real liveDocs too, and assert stats <= rawStats It will be heavy but i think we can do it from this patch.
        Robert Muir made changes -
        Attachment LUCENE-3878.patch [ 12518694 ]
        Hide
        Robert Muir added a comment -

        updated patch with separate liveDocs validation for each segment... checking the number of set bits etc.

        Show
        Robert Muir added a comment - updated patch with separate liveDocs validation for each segment... checking the number of set bits etc.
        Robert Muir made changes -
        Field Original Value New Value
        Attachment LUCENE-3878.patch [ 12518688 ]
        Hide
        Robert Muir added a comment -

        patch for trunk.

        Show
        Robert Muir added a comment - patch for trunk.
        Hide
        Robert Muir added a comment -

        Actually if we are willing to add SegmentReader.rawTermPositions() to match SegmentReader.rawTermDocs()
        we could do this in 3.x too...

        Show
        Robert Muir added a comment - Actually if we are willing to add SegmentReader.rawTermPositions() to match SegmentReader.rawTermDocs() we could do this in 3.x too...
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Robert Muir created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development