Lucene - Core
  1. Lucene - Core
  2. LUCENE-2010

Remove segments with all documents deleted in commit/flush/close of IndexWriter instead of waiting until a merge occurs.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I do not know if this is a bug in 2.9.0, but it seems that segments with all documents deleted are not automatically removed:

      4 of 14: name=_dlo docCount=5
        compound=true
        hasProx=true
        numFiles=2
        size (MB)=0.059
        diagnostics = {java.version=1.5.0_21, lucene.version=2.9.0 817268P - 2009-09-21 10:25:09, os=SunOS,
           os.arch=amd64, java.vendor=Sun Microsystems Inc., os.version=5.10, source=flush}
        has deletions [delFileName=_dlo_1.del]
        test: open reader.........OK [5 deleted docs]
        test: fields..............OK [136 fields]
        test: field norms.........OK [136 fields]
        test: terms, freq, prox...OK [1698 terms; 4236 terms/docs pairs; 0 tokens]
        test: stored fields.......OK [0 total field count; avg ? fields per doc]
        test: term vectors........OK [0 total vector count; avg ? term/freq vector fields per doc]
      

      Shouldn't such segments not be removed automatically during the next commit/close of IndexWriter?

      Mike McCandless:
      Lucene doesn't actually short-circuit this case, ie, if every single doc in a given segment has been deleted, it will still merge it [away] like normal, rather than simply dropping it immediately from the index, which I agree would be a simple optimization. Can you open a new issue? I would think IW can drop such a segment immediately (ie not wait for a merge or optimize) on flushing new deletes.

      1. LUCENE-2010.patch
        12 kB
        Michael McCandless

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          456d 8h 30m 1 Michael McCandless 25/Jan/11 18:42
          Resolved Resolved Reopened Reopened
          23h 58m 1 Michael McCandless 26/Jan/11 18:41
          Reopened Reopened Resolved Resolved
          3h 15m 1 Michael McCandless 26/Jan/11 21:56
          Resolved Resolved Closed Closed
          62d 17h 54m 1 Grant Ingersoll 30/Mar/11 16:50
          Grant Ingersoll made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12564653 ] jira [ 12584646 ]
          Mark Thomas made changes -
          Workflow jira [ 12480418 ] Default workflow, editable Closed status [ 12564653 ]
          Michael McCandless made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Michael McCandless made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          Michael McCandless added a comment -

          Reopen until I can track down the 3.x fail this caused...

          Show
          Michael McCandless added a comment - Reopen until I can track down the 3.x fail this caused...
          Hide
          Uwe Schindler added a comment -

          Thanks for taking care!

          Show
          Uwe Schindler added a comment - Thanks for taking care!
          Michael McCandless made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Michael McCandless added a comment -

          Do you want to fix the rest of the tests and remove the text-only keepAllSegments method?

          It's actually only the QueryUtils test class that uses this... it makes an "empty" index by adding N docs and then deleting them all. So the test-only API needs to be public (QueryUtils is in oal.search). I'll mark it as lucene.internal...

          Show
          Michael McCandless added a comment - Do you want to fix the rest of the tests and remove the text-only keepAllSegments method? It's actually only the QueryUtils test class that uses this... it makes an "empty" index by adding N docs and then deleting them all. So the test-only API needs to be public (QueryUtils is in oal.search). I'll mark it as lucene.internal...
          Hide
          Uwe Schindler added a comment -

          Look fine to me! Its indeed quite simple. Will test this later.

          Do you want to fix the rest of the tests and remove the text-only keepAllSegments method? At least this method should be hidden by a package-private accessor or, if not possible, @lucene.internal.

          Show
          Uwe Schindler added a comment - Look fine to me! Its indeed quite simple. Will test this later. Do you want to fix the rest of the tests and remove the text-only keepAllSegments method? At least this method should be hidden by a package-private accessor or, if not possible, @lucene.internal.
          Michael McCandless made changes -
          Attachment LUCENE-2010.patch [ 12469209 ]
          Hide
          Michael McCandless added a comment -

          Patch.

          The change itself is very simple – I added a pruneDeletedSegments method to SegmentInfos, and I call that on commit in IW and IR.

          But, then, various tests assume that Lucene doesn't do this EG asserting maxDoc(), docFreq(), etc... so I had to fix those up...

          Show
          Michael McCandless added a comment - Patch. The change itself is very simple – I added a pruneDeletedSegments method to SegmentInfos, and I call that on commit in IW and IR. But, then, various tests assume that Lucene doesn't do this EG asserting maxDoc(), docFreq(), etc... so I had to fix those up...
          Michael McCandless made changes -
          Assignee Michael McCandless [ mikemccand ]
          Uwe Schindler made changes -
          Fix Version/s 3.1 [ 12314822 ]
          Fix Version/s 4.0 [ 12314025 ]
          Hide
          Michael McCandless added a comment -

          Note: IndexReader must also do this.

          Show
          Michael McCandless added a comment - Note: IndexReader must also do this.
          Hide
          Michael McCandless added a comment -

          If you delete all documents from the whole index, no segments would keep alive if automatically removed.

          IW now has a dedicated method to [efficiently] delete all docs, but yeah we should also short-circuit this, in case someone didn't use that method and instead actually deleted every doc separately.

          I'd think that our solution here would automatically handle this case (drop all segments) as well.

          On materializing deletes (IndexWriter.applyDeletes) we should simply sweep the segmentInfos, and drop any fully deleted segments. Should be a simple change.

          Show
          Michael McCandless added a comment - If you delete all documents from the whole index, no segments would keep alive if automatically removed. IW now has a dedicated method to [efficiently] delete all docs, but yeah we should also short-circuit this, in case someone didn't use that method and instead actually deleted every doc separately. I'd think that our solution here would automatically handle this case (drop all segments) as well. On materializing deletes (IndexWriter.applyDeletes) we should simply sweep the segmentInfos, and drop any fully deleted segments. Should be a simple change.
          Uwe Schindler made changes -
          Field Original Value New Value
          Link This issue relates to LUCENE-1960 [ LUCENE-1960 ]
          Hide
          Uwe Schindler added a comment -

          There is one special case:
          If you delete all documents from the whole index, no segments would keep alive if automatically removed.
          Can we handle that? It should remain an empty segments_xxx file.

          Show
          Uwe Schindler added a comment - There is one special case: If you delete all documents from the whole index, no segments would keep alive if automatically removed. Can we handle that? It should remain an empty segments_xxx file.
          Uwe Schindler created issue -

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development