Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Similar to optimize(), expungeDeletes() has a misleading name.

      We already had problems with this on the user list because TieredMergePolicy
      didn't 'expunge' all their deletes.

      Also I think expunge is the wrong word, because expunge makes it seem
      like you just wrangle up the deletes and kick them out of the party and
      that it should be fast.

      1. LUCENE-3577.patch
        25 kB
        Michael McCandless

        Activity

        Hide
        Robert Muir added a comment -

        Also I think this method could do with some javadocs cleanup:
        from the javadocs it is practically begging you to call it if you
        ever delete, but doesn't TieredMP already handle this well?

        Show
        Robert Muir added a comment - Also I think this method could do with some javadocs cleanup: from the javadocs it is practically begging you to call it if you ever delete, but doesn't TieredMP already handle this well?
        Hide
        Michael McCandless added a comment -

        +1. The name does not indicate how horribly costly the operation is.

        And it leads to apps deleting/updating a few docs and then calling .expungeDeletes.

        We could also remove the method entirely? TieredMP already "favors" merges that reclaim more deletes (other things being equal), and you can change how strongly it does so (TMP.setReclaimDeletesWeight).

        In practice expungeDeletes will usually be just like forceMerge(1) so for apps that must have no deletes (eg maybe they need docFreq to be 100% accurate), they can call forceMerge(1) instead.

        Show
        Michael McCandless added a comment - +1. The name does not indicate how horribly costly the operation is. And it leads to apps deleting/updating a few docs and then calling .expungeDeletes. We could also remove the method entirely? TieredMP already "favors" merges that reclaim more deletes (other things being equal), and you can change how strongly it does so (TMP.setReclaimDeletesWeight). In practice expungeDeletes will usually be just like forceMerge(1) so for apps that must have no deletes (eg maybe they need docFreq to be 100% accurate), they can call forceMerge(1) instead.
        Hide
        Yonik Seeley added a comment -

        In practice expungeDeletes will usually be just like forceMerge(1) so for apps that must have no deletes (eg maybe they need docFreq to be 100% accurate), they can call forceMerge(1) instead.

        If there are just a few deletes in a few small segments, using optimize instead of expungeDeletes is much more expensive?
        Although, it doesn't really seem like an important use case (ensuring there are no deletes).

        Show
        Yonik Seeley added a comment - In practice expungeDeletes will usually be just like forceMerge(1) so for apps that must have no deletes (eg maybe they need docFreq to be 100% accurate), they can call forceMerge(1) instead. If there are just a few deletes in a few small segments, using optimize instead of expungeDeletes is much more expensive? Although, it doesn't really seem like an important use case (ensuring there are no deletes).
        Hide
        Hoss Man added a comment -

        If there are just a few deletes in a few small segments, using optimize instead of expungeDeletes is much more expensive?

        that's what i was wondering ...

        most incrementally updated indexes i've seen related to structured content (ie: products, news, blogs, patents, etc...) the "recent" documents are the only things likely to get updates (ie: a news story published in the past hour has a decent change of getting an update, a news story published yesterday might get a typo fixed, but a news story published a year ago isn't likely to ever get updated) so in a traditional merged segment structure the newer/smaller segments are the only ones that tend to have delets – the bigger older segments are mostly stagnant except when involved in merging. An expungeDelets call that only touches the small "recent" segments is going to be a lot faster then a full optimize, correct?

        Although, it doesn't really seem like an important use case (ensuring there are no deletes).

        I'm constantly surprised by the number of people who are really picky about ensuring that their tf/idf numbers are exact because they use them in a weird way – it's definitely an expert level concern, but if those people are willing to spend the time expunging deletes and we already have the code, might as well leave it in right?

        i think this is really just a question of naming/documentation: the method doesn't sound as sexy as optimize, but if someone stumbles upon it and thinks "oh wow, i guess i have to call this for my deletes to really be deleted" that's bad. likewise the javadocs encourage/imply that it this method should be called, instead of just explaining that it can be called and what it does.

        I don't have a good suggestion for the name, but the doc is really the issue...

        ...When an index has many document deletions (or updates to existing documents), it's best to either call optimize or expungeDeletes to remove all unused data in the index associated with the deleted documents. To see how many deletions you have pending in your index, call IndexReader.numDeletedDocs() This saves disk space and memory usage while searching. ...

        ...nothing in that description describes the downsides/cost of the method.

        Show
        Hoss Man added a comment - If there are just a few deletes in a few small segments, using optimize instead of expungeDeletes is much more expensive? that's what i was wondering ... most incrementally updated indexes i've seen related to structured content (ie: products, news, blogs, patents, etc...) the "recent" documents are the only things likely to get updates (ie: a news story published in the past hour has a decent change of getting an update, a news story published yesterday might get a typo fixed, but a news story published a year ago isn't likely to ever get updated) so in a traditional merged segment structure the newer/smaller segments are the only ones that tend to have delets – the bigger older segments are mostly stagnant except when involved in merging. An expungeDelets call that only touches the small "recent" segments is going to be a lot faster then a full optimize, correct? Although, it doesn't really seem like an important use case (ensuring there are no deletes). I'm constantly surprised by the number of people who are really picky about ensuring that their tf/idf numbers are exact because they use them in a weird way – it's definitely an expert level concern, but if those people are willing to spend the time expunging deletes and we already have the code, might as well leave it in right? i think this is really just a question of naming/documentation: the method doesn't sound as sexy as optimize, but if someone stumbles upon it and thinks "oh wow, i guess i have to call this for my deletes to really be deleted" that's bad. likewise the javadocs encourage/imply that it this method should be called, instead of just explaining that it can be called and what it does. I don't have a good suggestion for the name, but the doc is really the issue... ...When an index has many document deletions (or updates to existing documents), it's best to either call optimize or expungeDeletes to remove all unused data in the index associated with the deleted documents. To see how many deletions you have pending in your index, call IndexReader.numDeletedDocs() This saves disk space and memory usage while searching. ... ...nothing in that description describes the downsides/cost of the method.
        Hide
        Michael McCandless added a comment -

        How about forceMergeDeletes?

        Show
        Michael McCandless added a comment - How about forceMergeDeletes?
        Hide
        Robert Muir added a comment -

        I'm constantly surprised by the number of people who are really picky about ensuring that their tf/idf numbers are exact because they use them in a weird way

        Do they know how we store normalization factors?

        Show
        Robert Muir added a comment - I'm constantly surprised by the number of people who are really picky about ensuring that their tf/idf numbers are exact because they use them in a weird way Do they know how we store normalization factors?
        Hide
        Michael McCandless added a comment -

        Patch w/ rote rename to forceMergeDeletes.

        Show
        Michael McCandless added a comment - Patch w/ rote rename to forceMergeDeletes.
        Hide
        Uwe Schindler added a comment -

        Bulk close after release of 3.5

        Show
        Uwe Schindler added a comment - Bulk close after release of 3.5

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development