Lucene - Core
  1. Lucene - Core
  2. LUCENE-2010

Remove segments with all documents deleted in commit/flush/close of IndexWriter instead of waiting until a merge occurs.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I do not know if this is a bug in 2.9.0, but it seems that segments with all documents deleted are not automatically removed:

      4 of 14: name=_dlo docCount=5
        compound=true
        hasProx=true
        numFiles=2
        size (MB)=0.059
        diagnostics = {java.version=1.5.0_21, lucene.version=2.9.0 817268P - 2009-09-21 10:25:09, os=SunOS,
           os.arch=amd64, java.vendor=Sun Microsystems Inc., os.version=5.10, source=flush}
        has deletions [delFileName=_dlo_1.del]
        test: open reader.........OK [5 deleted docs]
        test: fields..............OK [136 fields]
        test: field norms.........OK [136 fields]
        test: terms, freq, prox...OK [1698 terms; 4236 terms/docs pairs; 0 tokens]
        test: stored fields.......OK [0 total field count; avg ? fields per doc]
        test: term vectors........OK [0 total vector count; avg ? term/freq vector fields per doc]
      

      Shouldn't such segments not be removed automatically during the next commit/close of IndexWriter?

      Mike McCandless:
      Lucene doesn't actually short-circuit this case, ie, if every single doc in a given segment has been deleted, it will still merge it [away] like normal, rather than simply dropping it immediately from the index, which I agree would be a simple optimization. Can you open a new issue? I would think IW can drop such a segment immediately (ie not wait for a merge or optimize) on flushing new deletes.

      1. LUCENE-2010.patch
        12 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1
          Hide
          Michael McCandless added a comment -

          Reopen until I can track down the 3.x fail this caused...

          Show
          Michael McCandless added a comment - Reopen until I can track down the 3.x fail this caused...
          Hide
          Uwe Schindler added a comment -

          Thanks for taking care!

          Show
          Uwe Schindler added a comment - Thanks for taking care!
          Hide
          Michael McCandless added a comment -

          Do you want to fix the rest of the tests and remove the text-only keepAllSegments method?

          It's actually only the QueryUtils test class that uses this... it makes an "empty" index by adding N docs and then deleting them all. So the test-only API needs to be public (QueryUtils is in oal.search). I'll mark it as lucene.internal...

          Show
          Michael McCandless added a comment - Do you want to fix the rest of the tests and remove the text-only keepAllSegments method? It's actually only the QueryUtils test class that uses this... it makes an "empty" index by adding N docs and then deleting them all. So the test-only API needs to be public (QueryUtils is in oal.search). I'll mark it as lucene.internal...
          Hide
          Uwe Schindler added a comment -

          Look fine to me! Its indeed quite simple. Will test this later.

          Do you want to fix the rest of the tests and remove the text-only keepAllSegments method? At least this method should be hidden by a package-private accessor or, if not possible, @lucene.internal.

          Show
          Uwe Schindler added a comment - Look fine to me! Its indeed quite simple. Will test this later. Do you want to fix the rest of the tests and remove the text-only keepAllSegments method? At least this method should be hidden by a package-private accessor or, if not possible, @lucene.internal.
          Hide
          Michael McCandless added a comment -

          Patch.

          The change itself is very simple – I added a pruneDeletedSegments method to SegmentInfos, and I call that on commit in IW and IR.

          But, then, various tests assume that Lucene doesn't do this EG asserting maxDoc(), docFreq(), etc... so I had to fix those up...

          Show
          Michael McCandless added a comment - Patch. The change itself is very simple – I added a pruneDeletedSegments method to SegmentInfos, and I call that on commit in IW and IR. But, then, various tests assume that Lucene doesn't do this EG asserting maxDoc(), docFreq(), etc... so I had to fix those up...
          Hide
          Michael McCandless added a comment -

          Note: IndexReader must also do this.

          Show
          Michael McCandless added a comment - Note: IndexReader must also do this.
          Hide
          Michael McCandless added a comment -

          If you delete all documents from the whole index, no segments would keep alive if automatically removed.

          IW now has a dedicated method to [efficiently] delete all docs, but yeah we should also short-circuit this, in case someone didn't use that method and instead actually deleted every doc separately.

          I'd think that our solution here would automatically handle this case (drop all segments) as well.

          On materializing deletes (IndexWriter.applyDeletes) we should simply sweep the segmentInfos, and drop any fully deleted segments. Should be a simple change.

          Show
          Michael McCandless added a comment - If you delete all documents from the whole index, no segments would keep alive if automatically removed. IW now has a dedicated method to [efficiently] delete all docs, but yeah we should also short-circuit this, in case someone didn't use that method and instead actually deleted every doc separately. I'd think that our solution here would automatically handle this case (drop all segments) as well. On materializing deletes (IndexWriter.applyDeletes) we should simply sweep the segmentInfos, and drop any fully deleted segments. Should be a simple change.
          Hide
          Uwe Schindler added a comment -

          There is one special case:
          If you delete all documents from the whole index, no segments would keep alive if automatically removed.
          Can we handle that? It should remain an empty segments_xxx file.

          Show
          Uwe Schindler added a comment - There is one special case: If you delete all documents from the whole index, no segments would keep alive if automatically removed. Can we handle that? It should remain an empty segments_xxx file.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development