Lucene - Core
  1. Lucene - Core
  2. LUCENE-4485

CheckIndex's term stats should not include deleted docs

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I was looking at the CheckIndex output on and index that has deletions, eg:

        4 of 30: name=_90 docCount=588408
          codec=Lucene41
          compound=false
          numFiles=14
          size (MB)=265.318
          diagnostics = {os=Linux, os.version=3.2.0-23-generic, mergeFactor=10, source=merge, lucene.version=5.0-SNAPSHOT, os.arch=amd64, mergeMaxNumSegments=-1, java.version=1.7.0_07, java.vendor=Oracle Corporation}
          has deletions [delGen=1]
          test: open reader.........OK [39351 deleted docs]
          test: fields..............OK [8 fields]
          test: field norms.........OK [2 fields]
          test: terms, freq, prox...OK [4910342 terms; 61319238 terms/docs pairs; 65597188 tokens]
          test (ignoring deletes): terms, freq, prox...OK [4910342 terms; 61319238 terms/docs pairs; 70293065 tokens]
          test: stored fields.......OK [1647171 total field count; avg 3 fields per doc]
          test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
          test: docvalues...........OK [0 total doc count; 1 docvalues fields]
      

      If you compare the test: terms, freq, prox (includes deletions) and the next line (doesn't include deletions), it's confusing because only the 3rd number (tokens) reflects deletions. I think the first two numbers should also reflect deletions? This way an app could get a sense of how much "deadweight" is in the index due to un-reclaimed deletions...

      1. LUCENE-4485.patch
        4 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Simple patch ...

        Show
        Michael McCandless added a comment - Simple patch ...
        Hide
        Robert Muir added a comment -

        +1

        Show
        Robert Muir added a comment - +1
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1399031

        LUCENE-4485: CheckIndex's terms, terms/docs pairs counts don't include deleted docs

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1399031 LUCENE-4485 : CheckIndex's terms, terms/docs pairs counts don't include deleted docs

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development