Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-325

[PATCH] new method expungeDeleted() added to IndexWriter

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4
    • Component/s: core/index
    • Labels:
      None
    • Environment:

      Operating System: Windows XP
      Platform: All

    • Bugzilla Id:
      32712

      Description

      We make use the docIDs in lucene. I need a way to compact the docIDs in segments
      to remove the "holes" created from doing deletes. The only way to do this is by
      calling IndexWriter.optimize(). This is a very heavy call, for the cases where
      the index is large but with very small number of deleted docs, calling optimize
      is not practical.

      I need a new method: expungeDeleted(), which finds all the segments that have
      delete documents and merge only those segments.

      I have implemented this method and have discussed with Otis about submitting a
      patch. I don't see where I can attached the patch. I will do according to the
      patch guidleine and email the lucene mailing list.

      Thanks

      -John

      I don't see a place where I can

      1. ASF.LICENSE.NOT.GRANTED--attachment.txt
        2 kB
        John Wang
      2. ASF.LICENSE.NOT.GRANTED--IndexWriter.patch
        3 kB
        John Wang
      3. ASF.LICENSE.NOT.GRANTED--IndexWriter.patch
        112 kB
        John Wang
      4. ASF.LICENSE.NOT.GRANTED--TestExpungeDeleted.java
        3 kB
        John Wang
      5. LUCENE-325.patch
        14 kB
        Michael McCandless

        Activity

        Hide
        john.wang@gmail.com John Wang added a comment -

        Created an attachment (id=13757)
        Patched IndexWriter.java, contains a new method: expungeDeleted()

        I see now where to attach the patch

        See attached patch.

        Show
        john.wang@gmail.com John Wang added a comment - Created an attachment (id=13757) Patched IndexWriter.java, contains a new method: expungeDeleted() I see now where to attach the patch See attached patch.
        Hide
        otis@apache.org Otis Gospodnetic added a comment -

        Either this Bugzilla diff is not working properly, or you've made a number of
        other more stylistic changes to IndexWriter.java, which make it very hard to see
        the real change you made. Could you provide a diff with just the real changes?

        Also, if you have a unit test that excercises your addition, so we can be sure
        it doesn't break anything, that would be great. You'd want your unit test to
        mix and match Document additions and deletions and make sure it all still works
        properly.
        Thanks.

        Show
        otis@apache.org Otis Gospodnetic added a comment - Either this Bugzilla diff is not working properly, or you've made a number of other more stylistic changes to IndexWriter.java, which make it very hard to see the real change you made. Could you provide a diff with just the real changes? Also, if you have a unit test that excercises your addition, so we can be sure it doesn't break anything, that would be great. You'd want your unit test to mix and match Document additions and deletions and make sure it all still works properly. Thanks.
        Hide
        john.wang@gmail.com John Wang added a comment -

        Created an attachment (id=13759)
        implementation of expungeDeleted method

        Attached is the implementation of IndexWriter.expungeDeleted method. It is self
        contained.

        Thanks

        -John

        Show
        john.wang@gmail.com John Wang added a comment - Created an attachment (id=13759) implementation of expungeDeleted method Attached is the implementation of IndexWriter.expungeDeleted method. It is self contained. Thanks -John
        Hide
        john.wang@gmail.com John Wang added a comment -

        Created an attachment (id=13760)
        Unit test for the patch submited

        Attached please find the unit test for IndexWriter.expungeDelete() request by
        Otis.

        Thanks

        -John

        Show
        john.wang@gmail.com John Wang added a comment - Created an attachment (id=13760) Unit test for the patch submited Attached please find the unit test for IndexWriter.expungeDelete() request by Otis. Thanks -John
        Hide
        john.wang@gmail.com John Wang added a comment -

        Created an attachment (id=13776)
        updated patch with a fixed diff version

        1) As request I reran cvs diff without having done a file format (which screwed
        up cvs)

        2) implementation of expungeDelete was made better to handle/delete the
        compound file after segment info update.

        Show
        john.wang@gmail.com John Wang added a comment - Created an attachment (id=13776) updated patch with a fixed diff version 1) As request I reran cvs diff without having done a file format (which screwed up cvs) 2) implementation of expungeDelete was made better to handle/delete the compound file after segment info update.
        Hide
        gsingers Grant Ingersoll added a comment -

        This seems generally useful. I imagine, though, that the patch is way out of date. I wonder if the new ability to merge some segments might have an option to do this kind of thing.

        Any thoughts on resurrecting this?

        Show
        gsingers Grant Ingersoll added a comment - This seems generally useful. I imagine, though, that the patch is way out of date. I wonder if the new ability to merge some segments might have an option to do this kind of thing. Any thoughts on resurrecting this?
        Hide
        mikemccand Michael McCandless added a comment -

        I think we should resurrect this: I agree it's useful. I'll take it & tentatively mark it 2.4 (hopefully I can make time by then!).

        The original patch would simply merge one segment "in place". I think we can improve this a bit by merging any adjacent series of segments that have deletions? This would still preserve docID ordering, but would also accomplish some merging as a side effect (I think a good thing).

        Show
        mikemccand Michael McCandless added a comment - I think we should resurrect this: I agree it's useful. I'll take it & tentatively mark it 2.4 (hopefully I can make time by then!). The original patch would simply merge one segment "in place". I think we can improve this a bit by merging any adjacent series of segments that have deletions? This would still preserve docID ordering, but would also accomplish some merging as a side effect (I think a good thing).
        Hide
        mikemccand Michael McCandless added a comment -

        Attached patch. All tests pass. I plan to commit in a day or two.

        This adds two methods to IndexWriter:

        expungeDeletes() – defaults to doWait=true
        expungeDeletes(boolean doWait)

        If doWait is false, and you have a MergeScheduler that runs merges in
        BG threads, then the call returns immediately.

        I extended MergePolicy so it decides what "expunge deletes" really
        means (findMergesToExpungeDeletes). Then, in LogMergePolicy, I
        implemented this policy: we merge all adjacent segments (up to
        mergeFactor at once) that have deletes. If only 1 segment has
        deletes, it's a singular merge.

        Show
        mikemccand Michael McCandless added a comment - Attached patch. All tests pass. I plan to commit in a day or two. This adds two methods to IndexWriter: expungeDeletes() – defaults to doWait=true expungeDeletes(boolean doWait) If doWait is false, and you have a MergeScheduler that runs merges in BG threads, then the call returns immediately. I extended MergePolicy so it decides what "expunge deletes" really means (findMergesToExpungeDeletes). Then, in LogMergePolicy, I implemented this policy: we merge all adjacent segments (up to mergeFactor at once) that have deletes. If only 1 segment has deletes, it's a singular merge.
        Hide
        mikemccand Michael McCandless added a comment -

        I just committed this. Thanks John! And sorry for the looooong
        delay.

        I also added an "these APIs are experimental" warning on top of
        MergePolicy and MergeScheduler (which I should have done before
        2.3 , though I don't expect alot of usage of these).

        Show
        mikemccand Michael McCandless added a comment - I just committed this. Thanks John! And sorry for the looooong delay. I also added an "these APIs are experimental" warning on top of MergePolicy and MergeScheduler (which I should have done before 2.3 , though I don't expect alot of usage of these).

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            john.wang@gmail.com John Wang
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development