Lucene - Core
  1. Lucene - Core
  2. LUCENE-3569

Consolidate IndexWriter's optimize, maybeMerge and expungeDeletes under one merge(MP) method

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today, IndexWriter exposes 3 methods for 'cleaning up' / 'compacting' / 'optimizing' your index:

      • optimize() – merges as much segments as possible (down to 1 segment), and is discouraged in many cases because of its performance implications.
      • maybeMerge() – runs 'subtle' merges. Attempts to balance the index by not leaving too many segments, yet not merging large segments if unneeded.
      • expungeDeletes() – cleans up deleted documents from segments and on the go merges them.
      • a default MP that can be set on IndexWriterConfig, for ongoing merges IW performs (i.e. as a result of flushing a new segment).

      These methods are confusing in several levels:

      • Their names are misleading, see LUCENE-3454.
      • Why does expungeDeletes need to merge segments?
      • Eventually, they really do what the MergePolicy decides that should be done. I.e., one could write an MP that always merges all segments, and therefore calling maybeMerge would not be so subtle anymore. On the other hand, one could write an MP that never merges large segments (we in fact have several of those), and therefore calling optimize(1) would not end up with one segment.

      So the proposal is to replace all these methods with a single one merge(MergePolicy) (more on the names later). MergePolicy will have only one method findSegmentsForMerge and the caller will be responsible to configure it in order to perform the needed merges. We will provide ready-to-use MPs:

      • LightMergePolicy – for setting on IWC and doing the ongoing merges IW executes. This one will pick segments respecting various parameters such as mergeFactor, segmentSizes etc.
      • HeavyMergePolicy – for doing the optimize()-style merges.
      • ExpungeDeletesMergePolicy – for expunging deletes (my proposal is to drop segment merging from it, by default).

      Now about the names:

      • I think that it will be good, API-backcompat wise and in general, if we name that method doMaintenance (as expungeDeletes does not have to merge anything).
      • Instead of MergePolicy we call it MaintenancePolicy and similarly its single method findSegmentsForMaintenance, or getMaintenanceSpecification.
      • I called the MPs Light and Heavy just for the text, I think a better name should be found, but nothing comes up to mind now.

      It will allow us to use this on 3.x, by deprecating MP and all related methods.

        Activity

        Hide
        Michael McCandless added a comment -

        For natural merges I think the existing MergePolicy makes sense: it's
        embedded into IW (IWC) and is invoked whenever there is a change to
        the segments (eg, new segment flushed).

        But for forced merges (either forceMerge or expungeDeletes)... I don't
        think we need a new MergePolicy-like class? Can't this "outside
        logic" simply invoke registerMerge() directly on the incoming IW?

        So eg in contrib/misc (say), we'd add a new IndexUtils class (or
        something); it has a static method "expungeDeletes", that takes an IW
        instance. When the app calls that method, it inspects the IW's
        segments, chooses its merges, and registers them.

        Just like a MergePolicy, the method would have to check which merges
        are already running/registered (IW.getMergingSegments) and "work
        around" them. EG, if there are 7 segments with deletions, you check
        and see that 4 of them are already merging / scheduled for merge, so
        you know you only have to merge the other 3.

        Show
        Michael McCandless added a comment - For natural merges I think the existing MergePolicy makes sense: it's embedded into IW (IWC) and is invoked whenever there is a change to the segments (eg, new segment flushed). But for forced merges (either forceMerge or expungeDeletes)... I don't think we need a new MergePolicy-like class? Can't this "outside logic" simply invoke registerMerge() directly on the incoming IW? So eg in contrib/misc (say), we'd add a new IndexUtils class (or something); it has a static method "expungeDeletes", that takes an IW instance. When the app calls that method, it inspects the IW's segments, chooses its merges, and registers them. Just like a MergePolicy, the method would have to check which merges are already running/registered (IW.getMergingSegments) and "work around" them. EG, if there are 7 segments with deletions, you check and see that 4 of them are already merging / scheduled for merge, so you know you only have to merge the other 3.
        Hide
        Robert Muir added a comment -

        Mike suggested that we keep MP passed to IWC, and introduce this new class for the explicit merge() call. I like that idea, even though they will both have similar API signatures: findSegmentsForMerge, findSegmentsForMaintenance. But, it will also solve Robert's concern – if we develop an OptimizingMaintenancePolicy, you cannot pass it to IWC.

        After thinking about all this, I honestly don't understand the benefit of this change. It
        just seems to shuffle methods around.

        Having a mergepolicy with findMergesToExpunge() [as 99.999999999999999999999999999999999% of users will never write a mergepolicy] seems a lot better than making a vague class like maintenance policy [that also itself might also contain a merge policy?], just so we have what... one method on indexwriter?

        I don't think we should make this change: as long as we support dangerous things like 'optimize' (force-merge) then I want them to be totally explicit simple methods, not masked under layers. forceMerge(N) solves the naming problem of optimize() completely for me, and its even more explicit with 'force'.

        Show
        Robert Muir added a comment - Mike suggested that we keep MP passed to IWC, and introduce this new class for the explicit merge() call. I like that idea, even though they will both have similar API signatures: findSegmentsForMerge, findSegmentsForMaintenance. But, it will also solve Robert's concern – if we develop an OptimizingMaintenancePolicy, you cannot pass it to IWC. After thinking about all this, I honestly don't understand the benefit of this change. It just seems to shuffle methods around. Having a mergepolicy with findMergesToExpunge() [as 99.999999999999999999999999999999999% of users will never write a mergepolicy] seems a lot better than making a vague class like maintenance policy [that also itself might also contain a merge policy?] , just so we have what... one method on indexwriter? I don't think we should make this change: as long as we support dangerous things like 'optimize' (force-merge) then I want them to be totally explicit simple methods, not masked under layers. forceMerge(N) solves the naming problem of optimize() completely for me, and its even more explicit with 'force'.
        Hide
        Shai Erera added a comment -

        I get your point. Today, it is also possible for IW to initiate a merge (ongoing) and you call optimize(). Both calls generate MergeSpec which tell IW what to do. They 'interact' through IW's running and pending merges lists. Whatever is currently running is managed by a runningMerges<OneMerge> list and there is another list for pendingMerges<OneMerge). Whatever MP returns is added to the pendingMerges list, when MergeScheduler takes one merge from the list it is added to the runningMerges list, and that's more or less it.

        I think that if we make both MPs (Merge + Maintenance) return a MergeSpec, then we will be able to reuse to same interaction that exists today. Though I'm not sure how MergeSpec + MergeScheduler + IndexWriter could understand a OneMerge that denotes "just expunge deletes, don't merge", but that's either a bridge we'll need to cross when we make some progress here, or a requirement that we'll let go (i.e., someone could argue "if you're already rewriting segments X, Y and Z cuz they have deletes, why not merge them?").

        Show
        Shai Erera added a comment - I get your point. Today, it is also possible for IW to initiate a merge (ongoing) and you call optimize(). Both calls generate MergeSpec which tell IW what to do. They 'interact' through IW's running and pending merges lists. Whatever is currently running is managed by a runningMerges<OneMerge> list and there is another list for pendingMerges<OneMerge). Whatever MP returns is added to the pendingMerges list, when MergeScheduler takes one merge from the list it is added to the runningMerges list, and that's more or less it. I think that if we make both MPs (Merge + Maintenance) return a MergeSpec, then we will be able to reuse to same interaction that exists today. Though I'm not sure how MergeSpec + MergeScheduler + IndexWriter could understand a OneMerge that denotes "just expunge deletes, don't merge", but that's either a bridge we'll need to cross when we make some progress here, or a requirement that we'll let go (i.e., someone could argue "if you're already rewriting segments X, Y and Z cuz they have deletes, why not merge them?").
        Hide
        Hoss Man added a comment -

        Why would they need to interact?

        Maybe this is just my IW+MP ignorance bubbling up, but by interact I mean what if background merges are already taking place when an explicit merge(MP) method is called? From a Thread standpoint this is something that's possible right now – but not with conflicting instructions about which segments to merge. Admittedly this question was predicated on the misunderstanding that there would be a single MP API (regardless of whether it's still called MergePolicy or changes to MaintencePolicy) used in both contexts (IWC and IW.merge(...)) but it still seems like something that could be problematic .. maybe i'm just completley misunderstanding though.

        Show
        Hoss Man added a comment - Why would they need to interact? Maybe this is just my IW+MP ignorance bubbling up, but by interact I mean what if background merges are already taking place when an explicit merge(MP) method is called? From a Thread standpoint this is something that's possible right now – but not with conflicting instructions about which segments to merge. Admittedly this question was predicated on the misunderstanding that there would be a single MP API (regardless of whether it's still called MergePolicy or changes to MaintencePolicy) used in both contexts (IWC and IW.merge(...)) but it still seems like something that could be problematic .. maybe i'm just completley misunderstanding though.
        Hide
        Shai Erera added a comment -

        "Maintenance" of what?

        Segment management is all that IW does (in the context of MergePolicy) - it merges them from time to time, and deletes unused segments. I don't mind if we call it SegmentMaintenacePolicy to make it more clear. In general I find terms like 'management' not always clarifying what the object is about, but in this case I think it's not so bad. So if others like the name, I'm ok with it as well.

        how exactly would the MP specified on the IWC interact with the MP passed explicitly to the merge(MP) method?

        Why would they need to interact? I think that the MP passed to IWC should be used for ongoing merges that IW initiates due to flushing new segments, and the one that you pass to merge() should do explicit actions, that are usually not done as part of the ongoing merges (like optimize, expunging deletes etc.).

        Mike suggested that we keep MP passed to IWC, and introduce this new class for the explicit merge() call. I like that idea, even though they will both have similar API signatures: findSegmentsForMerge, findSegmentsForMaintenance. But, it will also solve Robert's concern – if we develop an OptimizingMaintenancePolicy, you cannot pass it to IWC.

        Yet, if someone wants to pass such a policy to IWC, he can create a MergePolicy wrapper to it – but that is an explicit action someone does and not a naive mistake.

        Show
        Shai Erera added a comment - "Maintenance" of what? Segment management is all that IW does (in the context of MergePolicy) - it merges them from time to time, and deletes unused segments. I don't mind if we call it SegmentMaintenacePolicy to make it more clear. In general I find terms like 'management' not always clarifying what the object is about, but in this case I think it's not so bad. So if others like the name, I'm ok with it as well. how exactly would the MP specified on the IWC interact with the MP passed explicitly to the merge(MP) method? Why would they need to interact? I think that the MP passed to IWC should be used for ongoing merges that IW initiates due to flushing new segments, and the one that you pass to merge() should do explicit actions, that are usually not done as part of the ongoing merges (like optimize, expunging deletes etc.). Mike suggested that we keep MP passed to IWC, and introduce this new class for the explicit merge() call. I like that idea, even though they will both have similar API signatures: findSegmentsForMerge, findSegmentsForMaintenance. But, it will also solve Robert's concern – if we develop an OptimizingMaintenancePolicy, you cannot pass it to IWC. Yet, if someone wants to pass such a policy to IWC, he can create a MergePolicy wrapper to it – but that is an explicit action someone does and not a naive mistake.
        Hide
        Robert Muir added a comment -

        Are you also going to hack IndexWriter.forceMerge to throw IAE if the user specifies "1" ? Other then the name, how is this different exactly?

        Because its totally different if someone uses it for a 'one-off' time versus setting it in indexwriterconfig!

        Show
        Robert Muir added a comment - Are you also going to hack IndexWriter.forceMerge to throw IAE if the user specifies "1" ? Other then the name, how is this different exactly? Because its totally different if someone uses it for a 'one-off' time versus setting it in indexwriterconfig!
        Hide
        Hoss Man added a comment -

        I would add an instanceof check so that this returns IAE if you do this

        IWconfig.setMaintenancePolicy(new OptimizingMaintenancePolicy()).

        Are you also going to hack IndexWriter.forceMerge to throw IAE if the user specifies "1" ? Other then the name, how is this different exactly?

        It doesn't seem like there has actually been any fundamental objection to the main idea of this issue, or to renaming "MergePolicy" or "MaintenancePolicy" – the main contention here seems to be the idea of an "OptimizingMaintenancePolicy" in light of the entire "optimize sounds cool there for is bad" meme. So can we agree that the specific example of "OptimizingMaintenancePolicy" is a bad idea and that an instance like that should probably have a name more explicit about it's purpose like "AlwaysForceMergeToASingleSegmentMaintenancePolicy" (or better still: "NeverExceedNSegmentsMaintenancePolicy" with a single constructor that requires an "int maxSegments" property)

        Personally i like the idea of simplifying the methods, and having an abstraction like MergePolicy handle the subtleties of "merge down to N segments" vs "expunge deletes (and maybe merge) but my two concerns (as someone who doesn't understand the nuances of the impl under the IndexWriter covers) are:

        1) "Maintenance" of what? ... if it's just about the segments (but not just merging, because maybe you have one that does delete expunging w/o merging) then "SegmentManagementPolicy" might be better. it's not like this thing is doing logical document maintenance, or cleaning up unused files on disk.

        2) how exactly would the MP specified on the IWC interact with the MP passed explicitly to the merge(MP) method? does the merge(MP) method completely ignore/override the configured MP? ... that seems like something that could be incredibly error prone. Would it be better to use a pattern where some MPs have public methods for "forceMerge(int)" and "expungeDeletes()" and encourage client code that wants programmatic control of this type of thing to keep a ref to the IWC->MP they are using and call methods on it directly?

        Show
        Hoss Man added a comment - I would add an instanceof check so that this returns IAE if you do this IWconfig.setMaintenancePolicy(new OptimizingMaintenancePolicy()). Are you also going to hack IndexWriter.forceMerge to throw IAE if the user specifies "1" ? Other then the name, how is this different exactly? It doesn't seem like there has actually been any fundamental objection to the main idea of this issue, or to renaming "MergePolicy" or "MaintenancePolicy" – the main contention here seems to be the idea of an "OptimizingMaintenancePolicy" in light of the entire "optimize sounds cool there for is bad" meme. So can we agree that the specific example of "OptimizingMaintenancePolicy" is a bad idea and that an instance like that should probably have a name more explicit about it's purpose like "AlwaysForceMergeToASingleSegmentMaintenancePolicy" (or better still: "NeverExceedNSegmentsMaintenancePolicy" with a single constructor that requires an "int maxSegments" property) Personally i like the idea of simplifying the methods, and having an abstraction like MergePolicy handle the subtleties of "merge down to N segments" vs "expunge deletes (and maybe merge) but my two concerns (as someone who doesn't understand the nuances of the impl under the IndexWriter covers) are: 1) "Maintenance" of what? ... if it's just about the segments (but not just merging, because maybe you have one that does delete expunging w/o merging) then "SegmentManagementPolicy" might be better. it's not like this thing is doing logical document maintenance, or cleaning up unused files on disk. 2) how exactly would the MP specified on the IWC interact with the MP passed explicitly to the merge(MP) method? does the merge(MP) method completely ignore/override the configured MP? ... that seems like something that could be incredibly error prone. Would it be better to use a pattern where some MPs have public methods for "forceMerge(int)" and "expungeDeletes()" and encourage client code that wants programmatic control of this type of thing to keep a ref to the IWC->MP they are using and call methods on it directly?
        Hide
        Michael McCandless added a comment -

        Maybe the new "maintenance" class shouldn't even be a MergePolicy? This way you can't accidentally stick it onto your IWC as the natural merge policy.

        Or, we should un-twist things here. Why pass a class to IW only to have IW invoke methods on that class to get the merges out? We do this for the natural MP because on events like flushing / merge finishes we need to invoke it. But maybe this maintenance class could instead be a static method somewhere and you pass your IW to it and it directly invokes .registerMerge/.merge ops against the IW.

        Maybe even MP could be un-twisted... like if it had way to register w/ IW and be notified when segments changed...

        Show
        Michael McCandless added a comment - Maybe the new "maintenance" class shouldn't even be a MergePolicy? This way you can't accidentally stick it onto your IWC as the natural merge policy. Or, we should un-twist things here. Why pass a class to IW only to have IW invoke methods on that class to get the merges out? We do this for the natural MP because on events like flushing / merge finishes we need to invoke it. But maybe this maintenance class could instead be a static method somewhere and you pass your IW to it and it directly invokes .registerMerge/.merge ops against the IW. Maybe even MP could be un-twisted... like if it had way to register w/ IW and be notified when segments changed...
        Hide
        Mark Miller added a comment -

        Not a big fan of the maintenance name. Anyone have any alternatives? MergePolicy still seems better to me - even considering deletes.

        Show
        Mark Miller added a comment - Not a big fan of the maintenance name. Anyone have any alternatives? MergePolicy still seems better to me - even considering deletes.
        Hide
        Shai Erera added a comment -

        Ok then add the instanceof check. Easily override-able, but at least it'll be intentional.

        Show
        Shai Erera added a comment - Ok then add the instanceof check. Easily override-able, but at least it'll be intentional.
        Hide
        Robert Muir added a comment -

        I would add an instanceof check so that this returns IAE if you do this

        IWconfig.setMaintenancePolicy(new OptimizingMaintenancePolicy()).

        i don't care about the fact someone can write their own MP that does this,
        I care about what we provide out of box: this 'looks' like normal code.

        This stuff about not reading javadocs is bullshit. That only covers
        the first time the software is written. Sorry guys, 95% of software
        is maintenance, not silicon valley where all code is new and being
        written once in the hope it will be useful.

        When someone has to pick up the work from another guy and is busy
        tracking down bugs, code like the above looks totally reasonable
        at a glance based on the naming.

        Show
        Robert Muir added a comment - I would add an instanceof check so that this returns IAE if you do this IWconfig.setMaintenancePolicy(new OptimizingMaintenancePolicy()). i don't care about the fact someone can write their own MP that does this, I care about what we provide out of box: this 'looks' like normal code. This stuff about not reading javadocs is bullshit. That only covers the first time the software is written. Sorry guys, 95% of software is maintenance, not silicon valley where all code is new and being written once in the hope it will be useful. When someone has to pick up the work from another guy and is busy tracking down bugs, code like the above looks totally reasonable at a glance based on the naming.
        Hide
        Shai Erera added a comment -

        What instanceof are you talking about? There won't be "just one" MP for optimizing, or regular merges or whatever cleanup. We will provide OOtB MPs for different tasks, default to a one that makes sense for regular merges, and let the user call merge()/doMaintenance()/pickYouFavoriteName() whenever he wants, but we'll tell him to pass an MP for that.

        What's wrong with that? Why our methods can take a maxNumSegments and not timeLimitForMerge or whatever criteria that determines when to stop? Instead of coding that into the method signature, we allow users to pass MPs that they want.

        I really think that you're fooling yourself if you think that forceMerge() is more intimidating than merge(TotalSegmentsMergePolicy) or something. Do you really believe our users are that stupid that they don't read javadocs at all !?

        Show
        Shai Erera added a comment - What instanceof are you talking about? There won't be "just one" MP for optimizing, or regular merges or whatever cleanup. We will provide OOtB MPs for different tasks, default to a one that makes sense for regular merges, and let the user call merge()/doMaintenance()/pickYouFavoriteName() whenever he wants, but we'll tell him to pass an MP for that. What's wrong with that? Why our methods can take a maxNumSegments and not timeLimitForMerge or whatever criteria that determines when to stop? Instead of coding that into the method signature, we allow users to pass MPs that they want. I really think that you're fooling yourself if you think that forceMerge() is more intimidating than merge(TotalSegmentsMergePolicy) or something. Do you really believe our users are that stupid that they don't read javadocs at all !?
        Hide
        Robert Muir added a comment -

        If the MP's name is clear enough that this is a devastating thing to do 'regularly' then isn't it the user's decision whether to set it or not?

        No, if we do this: I will add the instanceof+IllegalArgumentException to IndexWriterConfig myself. I think this is a step backwards from
        renaming optimize to sound cool, now its starting to imply that its regularly scheduled required maintenance, and we are making it
        possible to automate the O(n^2) indexing that some people are doing today...

        Show
        Robert Muir added a comment - If the MP's name is clear enough that this is a devastating thing to do 'regularly' then isn't it the user's decision whether to set it or not? No, if we do this: I will add the instanceof+IllegalArgumentException to IndexWriterConfig myself. I think this is a step backwards from renaming optimize to sound cool, now its starting to imply that its regularly scheduled required maintenance, and we are making it possible to automate the O(n^2) indexing that some people are doing today...
        Hide
        Robert Muir added a comment -

        We cannot really prevent it - someone could write a 'regular' MP that always optimizes.

        My problem is not that, its us taking optimize, renaming it to a maintenance policy that a user
        can configure to run all the time with indexwriter.

        no way!

        Show
        Robert Muir added a comment - We cannot really prevent it - someone could write a 'regular' MP that always optimizes. My problem is not that, its us taking optimize, renaming it to a maintenance policy that a user can configure to run all the time with indexwriter. no way!
        Hide
        Shai Erera added a comment -
        i dont like pushing optimize-type merging into an MP, because then
        its settable in IWConfig (as the regular MP).
        

        So? I have a code path which opens an IndexWriter just to run index optimization and closes it. The thing is - it's the user who sets the MP, and if he wants to set an MP that always does optimize, why should we block him? If the MP's name is clear enough that this is a devastating thing to do 'regularly' then isn't it the user's decision whether to set it or not?

        We cannot really prevent it - someone could write a 'regular' MP that always optimizes. The changes in this issue propose to remove the distinction between optimize, maybeMerge and expungeDeletes. You do index maintenance and you define what maintenance you'd like to do through the maintenance policy (however we'll end up calling it).

        Show
        Shai Erera added a comment - i dont like pushing optimize-type merging into an MP, because then its settable in IWConfig (as the regular MP). So? I have a code path which opens an IndexWriter just to run index optimization and closes it. The thing is - it's the user who sets the MP, and if he wants to set an MP that always does optimize, why should we block him? If the MP's name is clear enough that this is a devastating thing to do 'regularly' then isn't it the user's decision whether to set it or not? We cannot really prevent it - someone could write a 'regular' MP that always optimizes. The changes in this issue propose to remove the distinction between optimize, maybeMerge and expungeDeletes. You do index maintenance and you define what maintenance you'd like to do through the maintenance policy (however we'll end up calling it).
        Hide
        Robert Muir added a comment -

        i dont like pushing optimize-type merging into an MP, because then
        its settable in IWConfig (as the regular MP).

        Show
        Robert Muir added a comment - i dont like pushing optimize-type merging into an MP, because then its settable in IWConfig (as the regular MP).

          People

          • Assignee:
            Unassigned
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development