Lucene - Core
  1. Lucene - Core
  2. LUCENE-1584

Callback for intercepting merging segments in IndexWriter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4.1
    • Fix Version/s: 2.9
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      For things like merging field caches or bitsets, it's useful to
      know which segments were merged to create a new segment.

      1. LUCENE-1584.patch
        0.9 kB
        Jason Rutherglen
      2. LUCENE-1584.patch
        0.9 kB
        Jason Rutherglen
      3. LUCENE-1584.patch
        64 kB
        Jason Rutherglen

        Activity

        Hide
        Jason Rutherglen added a comment -

        Patch is combined with LUCENE-1516.

        IndexWriter has a setSegmentMergerCallback method that is called
        in IW.mergeMiddle where the readers being merged and the newly
        merged reader are passed to the SMC.mergedSegments method.

        I think we need to expose the SegmentReader segment name somehow
        either via IndexReader.getSegmentName or an interface on top of
        SegmentReader?

        Show
        Jason Rutherglen added a comment - Patch is combined with LUCENE-1516 . IndexWriter has a setSegmentMergerCallback method that is called in IW.mergeMiddle where the readers being merged and the newly merged reader are passed to the SMC.mergedSegments method. I think we need to expose the SegmentReader segment name somehow either via IndexReader.getSegmentName or an interface on top of SegmentReader?
        Hide
        Michael McCandless added a comment -

        I think this can be achieved, today, by making your own MergeScheduler wrapper, or by subclassing ConcurrentMergeScheduler and eg overriding the doMerge method? If so, I'd prefer not to add a callback to IW.

        Show
        Michael McCandless added a comment - I think this can be achieved, today, by making your own MergeScheduler wrapper, or by subclassing ConcurrentMergeScheduler and eg overriding the doMerge method? If so, I'd prefer not to add a callback to IW.
        Hide
        Jason Rutherglen added a comment -

        I would like to move away from our current position of somewhat
        closed APIs that require user classes be a part of the Lucene
        packages.

        It's always best to reuse existing APIs, however we've migrated
        to OSGi which means anytime we need to place new classes in
        Lucene packages, we need to rollout specific JARs (I think,
        perhaps it's more complex) for the few classes outside of our
        main package classes. This makes deployment of search
        applications a bit more difficult and time consuming.

        A related thread regarding MergePolicy is at:
        http://markmail.org/thread/h5bxjflpcyejrcqg

        Show
        Jason Rutherglen added a comment - I would like to move away from our current position of somewhat closed APIs that require user classes be a part of the Lucene packages. It's always best to reuse existing APIs, however we've migrated to OSGi which means anytime we need to place new classes in Lucene packages, we need to rollout specific JARs (I think, perhaps it's more complex) for the few classes outside of our main package classes. This makes deployment of search applications a bit more difficult and time consuming. A related thread regarding MergePolicy is at: http://markmail.org/thread/h5bxjflpcyejrcqg
        Hide
        Michael McCandless added a comment -

        I'd like to step back and understand the wider use case / context that's driving this need (to know precisely when segments got merged). EG if we fix Lucene's field cache, and Lucene's near real-time search manages CSF's efficiently in memory, does that address the use case behind this?

        It's possible that we should simply make SegmentInfo(s) public, so that MergePolicy/Scheduler can be fully created external to Lucene, and track all specifics of why/when merges are happening. But those APIs have a high surface area, and we do make changes over time.

        Show
        Michael McCandless added a comment - I'd like to step back and understand the wider use case / context that's driving this need (to know precisely when segments got merged). EG if we fix Lucene's field cache, and Lucene's near real-time search manages CSF's efficiently in memory, does that address the use case behind this? It's possible that we should simply make SegmentInfo(s) public, so that MergePolicy/Scheduler can be fully created external to Lucene, and track all specifics of why/when merges are happening. But those APIs have a high surface area, and we do make changes over time.
        Hide
        Jason Rutherglen added a comment -

        I think it's good to take a step back, "if we fix Lucene's field
        cache, and Lucene's near real-time search manages CSF's
        efficiently in memory" fixes the use case. Relying on CSF coming
        in probably won't help this the case if it doesn't make it into
        the 2.9 release. I like the callback method because it does not
        rely on passing segment infos around and instead uses the
        already public IndexReader classes.

        Show
        Jason Rutherglen added a comment - I think it's good to take a step back, "if we fix Lucene's field cache, and Lucene's near real-time search manages CSF's efficiently in memory" fixes the use case. Relying on CSF coming in probably won't help this the case if it doesn't make it into the 2.9 release. I like the callback method because it does not rely on passing segment infos around and instead uses the already public IndexReader classes.
        Hide
        Michael McCandless added a comment -

        Jason once LUCENE-1516 is in, can you repost this patch? It's hard to see what's new, here.

        Show
        Michael McCandless added a comment - Jason once LUCENE-1516 is in, can you repost this patch? It's hard to see what's new, here.
        Hide
        Earwin Burrfoot added a comment -

        .bq I'd like to step back and understand the wider use case / context that's driving this need (to know precisely when segments got merged)
        This is required in one form or another for any kinds of segment-aware caches.
        We're currently using our own field cache (I doubt we'll ever switch back to lucene's native, fixed one or not) and filter cache. Both caches are warmed up on reopen, asynchronously from search requests and both would warm up considerably faster if we have data on how segments have changed.

        Show
        Earwin Burrfoot added a comment - .bq I'd like to step back and understand the wider use case / context that's driving this need (to know precisely when segments got merged) This is required in one form or another for any kinds of segment-aware caches. We're currently using our own field cache (I doubt we'll ever switch back to lucene's native, fixed one or not) and filter cache. Both caches are warmed up on reopen, asynchronously from search requests and both would warm up considerably faster if we have data on how segments have changed.
        Hide
        Michael McCandless added a comment -

        This is required in one form or another for any kinds of segment-aware caches.

        The problem is you need more information than simply "these segments got merged" to actually do something interesting with your caches.

        EG you'd need to know which deleted docs got zapped, right?

        We're currently using our own field cache (I doubt we'll ever switch back to lucene's native, fixed one or not)

        Can you explain what's missing in Lucene's FieldCache? (Since we are going to build a new one for LUCENE-831 it'd be great to address all known limitations...).

        Show
        Michael McCandless added a comment - This is required in one form or another for any kinds of segment-aware caches. The problem is you need more information than simply "these segments got merged" to actually do something interesting with your caches. EG you'd need to know which deleted docs got zapped, right? We're currently using our own field cache (I doubt we'll ever switch back to lucene's native, fixed one or not) Can you explain what's missing in Lucene's FieldCache? (Since we are going to build a new one for LUCENE-831 it'd be great to address all known limitations...).
        Hide
        Earwin Burrfoot added a comment -

        The problem is you need more information than simply "these segments got merged" to actually do something interesting with your caches.

        Okay, now I've thought a bit. What we need is a notification on which segments remained, which are new and which got toasted, plus docid ranges for them. Their ancestry is irrelevant, because you're right, to exploit it we also need deleted docs, and then replicate some of the merging logic and it gets really messy from here. Dropping parts of the cache related to dead segments, rebasing survivors and doing a fair-and-square load/uninversion/whatever for new ones is enough.

        Can you explain what's missing in Lucene's FieldCache?

        It's not that easy to say. Our version was initially used only for sorting, but without concurrency issues and with async warmup. But then we used it to load docs (way better than storing fields and using IndexReader.document), tied up with our strongly-typed-fields code, added handling for multi-valued fields, used it for faceted searches.
        So now it is essentially just something different from Lucene field cache.

        Show
        Earwin Burrfoot added a comment - The problem is you need more information than simply "these segments got merged" to actually do something interesting with your caches. Okay, now I've thought a bit. What we need is a notification on which segments remained, which are new and which got toasted, plus docid ranges for them. Their ancestry is irrelevant, because you're right, to exploit it we also need deleted docs, and then replicate some of the merging logic and it gets really messy from here. Dropping parts of the cache related to dead segments, rebasing survivors and doing a fair-and-square load/uninversion/whatever for new ones is enough. Can you explain what's missing in Lucene's FieldCache? It's not that easy to say. Our version was initially used only for sorting, but without concurrency issues and with async warmup. But then we used it to load docs (way better than storing fields and using IndexReader.document), tied up with our strongly-typed-fields code, added handling for multi-valued fields, used it for faceted searches. So now it is essentially just something different from Lucene field cache.
        Hide
        Michael McCandless added a comment -

        Moving out.

        Show
        Michael McCandless added a comment - Moving out.
        Hide
        Jason Rutherglen added a comment -

        Can we put this one in 2.9? It seems like a fairly straightfoward change. Or make it a protected method?

        Show
        Jason Rutherglen added a comment - Can we put this one in 2.9? It seems like a fairly straightfoward change. Or make it a protected method?
        Hide
        Michael McCandless added a comment -

        Well, I'm worried about how much [now package private] info you're gonna need about the merging segments – it doesn't seem like a simple change to me.

        Show
        Michael McCandless added a comment - Well, I'm worried about how much [now package private] info you're gonna need about the merging segments – it doesn't seem like a simple change to me.
        Hide
        Jason Rutherglen added a comment -

        We can make it protected that way it's expert level and a user
        needs to inherit from IndexWriter to use it. I don't think today
        it's possible to simply inherit from IW to get the merge
        information because IW.merge is final, and there needs to be a
        way to know the merge was successful.

        Show
        Jason Rutherglen added a comment - We can make it protected that way it's expert level and a user needs to inherit from IndexWriter to use it. I don't think today it's possible to simply inherit from IW to get the merge information because IW.merge is final, and there needs to be a way to know the merge was successful.
        Hide
        Michael McCandless added a comment -

        But you'll still need access to package-private stuff, eg MergePolicy.OneMerge?

        Show
        Michael McCandless added a comment - But you'll still need access to package-private stuff, eg MergePolicy.OneMerge?
        Hide
        Jason Rutherglen added a comment -

        Added a protected IW.mergeSuccess method. We can't really do much more right now until we wrap SegmentInfo with a public class or make it public.

        Show
        Jason Rutherglen added a comment - Added a protected IW.mergeSuccess method. We can't really do much more right now until we wrap SegmentInfo with a public class or make it public.
        Hide
        Michael McCandless added a comment -

        Shouldn't we make that method package private? (Since it's arg (MergePolicy.OneMerge.*) is package private).

        Show
        Michael McCandless added a comment - Shouldn't we make that method package private? (Since it's arg (MergePolicy.OneMerge.*) is package private).
        Hide
        Jason Rutherglen added a comment -

        Yep!

        Show
        Jason Rutherglen added a comment - Yep!
        Hide
        Michael McCandless added a comment -

        OK I committed it! Thanks Jason!

        Show
        Michael McCandless added a comment - OK I committed it! Thanks Jason!

          People

          • Assignee:
            Unassigned
            Reporter:
            Jason Rutherglen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 96h
              96h
              Remaining:
              Remaining Estimate - 96h
              96h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development