Lucene - Core
  1. Lucene - Core
  2. LUCENE-4472

Add setting that prevents merging on updateDocument

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Currently we always call maybeMerge if a segment was flushed after updateDocument. Some apps and in particular ElasticSearch uses some hacky workarounds to disable that ie for merge throttling. It should be easier to enable this kind of behavior.

      1. LUCENE-4472.patch
        16 kB
        Simon Willnauer
      2. LUCENE-4472.patch
        12 kB
        Simon Willnauer
      3. LUCENE-4472.patch
        8 kB
        Simon Willnauer

        Activity

        Hide
        Simon Willnauer added a comment -

        here is a patch that adds such an option to IWC via live settings.

        Show
        Simon Willnauer added a comment - here is a patch that adds such an option to IWC via live settings.
        Hide
        Michael McCandless added a comment -

        Neat How does ES customize / throttle its merging?

        Maybe the setting should mean we never call maybeMerge implicitly? (Ie, neither on close nor NRT reader or any other time), rather than just singling out updateDocument/addDocument? Eg if we add other methods in the future (like field updates), it should also prevent those from doing merges?

        Show
        Michael McCandless added a comment - Neat How does ES customize / throttle its merging? Maybe the setting should mean we never call maybeMerge implicitly? (Ie, neither on close nor NRT reader or any other time), rather than just singling out updateDocument/addDocument? Eg if we add other methods in the future (like field updates), it should also prevent those from doing merges?
        Hide
        selckin added a comment -

        +1, if you have a lot of indexes and they all start merging at the same time it can be quite taxing

        I think ES has dedicated configurable thread pool where for each index a maybeMerge() is scheduled on an interval. (Size of thread pool limits number of concurrent merges)

        Show
        selckin added a comment - +1, if you have a lot of indexes and they all start merging at the same time it can be quite taxing I think ES has dedicated configurable thread pool where for each index a maybeMerge() is scheduled on an interval. (Size of thread pool limits number of concurrent merges)
        Hide
        Shai Erera added a comment -

        Patch looks good. One minor comment – LiveIWC.setMaybeMergeAfterFlush() contains a redundant 'a' in its javadocs --> ".. after a each segment".

        Maybe the setting should mean we never call maybeMerge implicitly?

        Perhaps the issue's description should change to "Add setting to prevent merging on segment flush". Because as I understand the fix, the check is made only after a segment has been flushed, which already covers addDocument/updateDocument as well as future field updates?

        Show
        Shai Erera added a comment - Patch looks good. One minor comment – LiveIWC.setMaybeMergeAfterFlush() contains a redundant 'a' in its javadocs --> ".. after a each segment". Maybe the setting should mean we never call maybeMerge implicitly? Perhaps the issue's description should change to "Add setting to prevent merging on segment flush". Because as I understand the fix, the check is made only after a segment has been flushed, which already covers addDocument/updateDocument as well as future field updates?
        Hide
        Robert Muir added a comment -

        Can we consider instead giving MergePolicy the proper context here instead of adding a boolean?

        This seems more flexible.

        Show
        Robert Muir added a comment - Can we consider instead giving MergePolicy the proper context here instead of adding a boolean? This seems more flexible.
        Hide
        Shay Banon added a comment -

        Agree with Robert on the additional context flag, that would make things most flexible. A flag on IW makes things simpler from the user perspective though, cause then there is no need to customize the built in merge policies.

        Show
        Shay Banon added a comment - Agree with Robert on the additional context flag, that would make things most flexible. A flag on IW makes things simpler from the user perspective though, cause then there is no need to customize the built in merge policies.
        Hide
        Robert Muir added a comment -

        I don't think you need to customize anything built in. Just delegate and forward findMerges()?

        The problem is this doesn't really have the necessary context today: I think we should fix that.
        But I'd like policies around merging to stay in ... MergePolicy

        Otherwise IWC could easily get cluttered with conflicting options which makes it complex.

        Show
        Robert Muir added a comment - I don't think you need to customize anything built in. Just delegate and forward findMerges()? The problem is this doesn't really have the necessary context today: I think we should fix that. But I'd like policies around merging to stay in ... MergePolicy Otherwise IWC could easily get cluttered with conflicting options which makes it complex.
        Hide
        Simon Willnauer added a comment -

        The MergePolicy is tricky here since we clone the MP in IW you need to actually pull and cast the MP from IW to change the setting if you want to do this in realtime. Maybe we can add something like this to MP so we can change MergeSettings in RT too. Otherwise you need to build a special MP but we can certainly do that.

        Show
        Simon Willnauer added a comment - The MergePolicy is tricky here since we clone the MP in IW you need to actually pull and cast the MP from IW to change the setting if you want to do this in realtime. Maybe we can add something like this to MP so we can change MergeSettings in RT too. Otherwise you need to build a special MP but we can certainly do that.
        Hide
        Simon Willnauer added a comment -

        I played around with this and sketched out how this could look like. I don't think we should entirely break BW compat but open up the context like I did in the patch. There are still edges in the patch but some initial feedback would be great.

        Show
        Simon Willnauer added a comment - I played around with this and sketched out how this could look like. I don't think we should entirely break BW compat but open up the context like I did in the patch. There are still edges in the patch but some initial feedback would be great.
        Hide
        Robert Muir added a comment -

        I like this patch much better than the first one.

        As far as back compat, I'm not sure if we should try to do anything tricky, the current patch isn't really a break
        it just allows MP to handle this stuff at a more fine-grained level, I think its fine.

        p.s. UNKOWN and EMPTY_FOCED_SEGMENTS look like typos

        Show
        Robert Muir added a comment - I like this patch much better than the first one. As far as back compat, I'm not sure if we should try to do anything tricky, the current patch isn't really a break it just allows MP to handle this stuff at a more fine-grained level, I think its fine. p.s. UNKOWN and EMPTY_FOCED_SEGMENTS look like typos
        Hide
        Michael McCandless added a comment -

        I like the new MergeCause enum!

        But, instead of folding all parameters into a MergeContext, and exposing a single MergePolicy.findMerges methods, can we keep the methods we have today and just add MergeCause as another parameter? This is a very expert API and I think it's fine to simply change it. I think this approach is more type-safe for the future, ie if we need to add something important such that a custom merge policy should pay attention to it ... apps will see compilation errors on upgrading and know they have to handle the new parameter.

        Show
        Michael McCandless added a comment - I like the new MergeCause enum! But, instead of folding all parameters into a MergeContext, and exposing a single MergePolicy.findMerges methods, can we keep the methods we have today and just add MergeCause as another parameter? This is a very expert API and I think it's fine to simply change it. I think this approach is more type-safe for the future, ie if we need to add something important such that a custom merge policy should pay attention to it ... apps will see compilation errors on upgrading and know they have to handle the new parameter.
        Hide
        Michael McCandless added a comment -

        Actually I think we only need to add the MergeCause (maybe rename this to MergeTrigger?) param to findMerges? That method is invoked for natural merges, and knowing the trigger for the natural merge is useful...

        The other two methods (findForceMerges, findForcedDeletesMerges) are only triggered when the app explicitly asked IndexWriter to do so.

        Show
        Michael McCandless added a comment - Actually I think we only need to add the MergeCause (maybe rename this to MergeTrigger?) param to findMerges? That method is invoked for natural merges, and knowing the trigger for the natural merge is useful... The other two methods (findForceMerges, findForcedDeletesMerges) are only triggered when the app explicitly asked IndexWriter to do so.
        Hide
        Simon Willnauer added a comment -

        The other two methods (findForceMerges, findForcedDeletesMerges) are only triggered when the app explicitly asked IndexWriter to do so.

        I am not sure if we should really do that. I'd rather make those two methods protected and make it a impl detail of merge policy. I think the specialized methods are a poor man's approach to the MergeContext and the api is rather clumsy along those lines. I'd be happy to not break bw. compat but only add a more flexible API that is the authoritative source / single entry point for the IndexWriter. If you think this through finfForcedDeletesMerges and findForcedMerges are really and impl detail of the current IndexWriter and if we would modularize it would become even more obvious.

        Show
        Simon Willnauer added a comment - The other two methods (findForceMerges, findForcedDeletesMerges) are only triggered when the app explicitly asked IndexWriter to do so. I am not sure if we should really do that. I'd rather make those two methods protected and make it a impl detail of merge policy. I think the specialized methods are a poor man's approach to the MergeContext and the api is rather clumsy along those lines. I'd be happy to not break bw. compat but only add a more flexible API that is the authoritative source / single entry point for the IndexWriter. If you think this through finfForcedDeletesMerges and findForcedMerges are really and impl detail of the current IndexWriter and if we would modularize it would become even more obvious.
        Hide
        Michael McCandless added a comment -

        I think forced merges or forcing reclaiming of deletions, both invoked
        by explicit app request, are very different use cases than the
        "natural merging" Lucene does during indexing (not directly invoked by
        the app, but as a side effect of other API calls).

        So I think it makes sense that the MP has separate methods to handle
        these very different use cases.

        I don't thing we should use single param / single method XXXContext
        approach to "bypass" back compat. We already tried this with
        ScorerContext but backed it out because of the loss of type
        safety... for expert APIs like this one I think it's actually good to
        require apps to revisit their impls on upgrading, if we've added
        parameters: it gives them a chance to improve their impls. Plus this
        API is already marked @experimental...

        Also, single method taking a single XXXContext obj means that method
        will have to have a switch or bunch of if statements to handle what
        are in fact very different use cases, which is rather awkward.

        Still, separately I would love to make forceMerge/Deletes un-public so
        you have to work harder to invoke them (eg maybe you invoke the merge
        policy directly and then call IW.maybeMerge ... or something). We can
        do that separately...

        Show
        Michael McCandless added a comment - I think forced merges or forcing reclaiming of deletions, both invoked by explicit app request, are very different use cases than the "natural merging" Lucene does during indexing (not directly invoked by the app, but as a side effect of other API calls). So I think it makes sense that the MP has separate methods to handle these very different use cases. I don't thing we should use single param / single method XXXContext approach to "bypass" back compat. We already tried this with ScorerContext but backed it out because of the loss of type safety... for expert APIs like this one I think it's actually good to require apps to revisit their impls on upgrading, if we've added parameters: it gives them a chance to improve their impls. Plus this API is already marked @experimental... Also, single method taking a single XXXContext obj means that method will have to have a switch or bunch of if statements to handle what are in fact very different use cases, which is rather awkward. Still, separately I would love to make forceMerge/Deletes un-public so you have to work harder to invoke them (eg maybe you invoke the merge policy directly and then call IW.maybeMerge ... or something). We can do that separately...
        Hide
        Simon Willnauer added a comment -

        fair enough mike. I think as long as we don't have a modular IW this is a fair argument to bind it to the IW impl. This patch contains the least intrusive change adding a MergeTrigger enum to findMerges. I think its ready unless somebody find better names or something.

        Show
        Simon Willnauer added a comment - fair enough mike. I think as long as we don't have a modular IW this is a fair argument to bind it to the IW impl. This patch contains the least intrusive change adding a MergeTrigger enum to findMerges. I think its ready unless somebody find better names or something.
        Hide
        Michael McCandless added a comment -

        Thanks Simon, looks good!

        Show
        Michael McCandless added a comment - Thanks Simon, looks good!
        Hide
        Simon Willnauer added a comment -

        I will go ahead and commit this. I will let it bake in a day or two before I backport. I think we should not deprecate but break BWCompat here this is experimental and super expert.

        Show
        Simon Willnauer added a comment - I will go ahead and commit this. I will let it bake in a day or two before I backport. I think we should not deprecate but break BWCompat here this is experimental and super expert.
        Hide
        Simon Willnauer added a comment -

        Committed to trunk in revision 1399712.

        Show
        Simon Willnauer added a comment - Committed to trunk in revision 1399712.
        Hide
        Simon Willnauer added a comment -

        backported to 4.x & Committed revision 1400041.

        Show
        Simon Willnauer added a comment - backported to 4.x & Committed revision 1400041.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Simon Willnauer
        http://svn.apache.org/viewvc?view=revision&revision=1400041

        LUCENE-4472: provide propper context information to MergePolicy which events triggered a findMerges

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1400041 LUCENE-4472 : provide propper context information to MergePolicy which events triggered a findMerges

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Simon Willnauer
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development