[LUCENE-4485] CheckIndex's term stats should not include deleted docs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.1, 6.0
Component/s: None
Labels:
None

Lucene Fields:

New

Description

I was looking at the CheckIndex output on and index that has deletions, eg:

  4 of 30: name=_90 docCount=588408
    codec=Lucene41
    compound=false
    numFiles=14
    size (MB)=265.318
    diagnostics = {os=Linux, os.version=3.2.0-23-generic, mergeFactor=10, source=merge, lucene.version=5.0-SNAPSHOT, os.arch=amd64, mergeMaxNumSegments=-1, java.version=1.7.0_07, java.vendor=Oracle Corporation}
    has deletions [delGen=1]
    test: open reader.........OK [39351 deleted docs]
    test: fields..............OK [8 fields]
    test: field norms.........OK [2 fields]
    test: terms, freq, prox...OK [4910342 terms; 61319238 terms/docs pairs; 65597188 tokens]
    test (ignoring deletes): terms, freq, prox...OK [4910342 terms; 61319238 terms/docs pairs; 70293065 tokens]
    test: stored fields.......OK [1647171 total field count; avg 3 fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
    test: docvalues...........OK [0 total doc count; 1 docvalues fields]

If you compare the test: terms, freq, prox (includes deletions) and the next line (doesn't include deletions), it's confusing because only the 3rd number (tokens) reflects deletions. I think the first two numbers should also reflect deletions? This way an app could get a sense of how much "deadweight" is in the index due to un-reclaimed deletions...

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-4485.patch
16/Oct/12 15:50
4 kB
Michael McCandless

Activity

People

Assignee:: Michael McCandless

Reporter:: Michael McCandless

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 16/Oct/12 11:54

Updated:: 28/Aug/22 13:30

Resolved:: 16/Oct/12 22:48