Lucene - Core
  1. Lucene - Core
  2. LUCENE-2447

Add support for subsets of searchables inside a MultiSearcher/ParallelMultiSearcher instance's methods at runtime

    Details

    • Lucene Fields:
      New, Patch Available

      Description

      Here's the situation: We have a site with a fair few amount of indexes that we're using MultiSearcher/ParallelMultiSearcher for, but the users can select an arbitrary permutation of indexes to search. For example (contrived, but illustratory): the site has indexes numbered 1 - 10; user A wants to search in all 10; user B wants to search indexes 1, 2 and 3, user C wants to search even-numbered indexes. From Lucene 3.0.1, the only way to do this is to continually instantiate a new MultiSearcher based on every permutation of indexes that a user wants, which is not ideal at all.

      What I've done is add a new parameter to all methods in MultiSearcher that use the searchables array (docFreq, search, rewrite and createDocFrequencyMap), a Set<Searchable> which is checked for isEmpty() and contains() for every iteration over the searchables[]. The actual logic has been moved into these methods and the old methods have become overloads that pass a Collections.emptySet() into those methods, so I do not expect there to be a very noticeable performance impact as a result of this modification, if it's measurable at all.

      I didn't modify the test for MultiSearcher very much, just enough to illustrate the that subsetting of the search results works, since no other logic has changed. If I need to do more for the testing, let me know and I'll do it.

      I've attached the patches for MultiSearcher.java, ParallelMultiSearcher.java and TestMultiSearcher.java.

      1. LUCENE-2447.patch
        16 kB
        Edward Drapkin
      2. LUCENE-2447-predicate.patch
        20 kB
        Edward Drapkin

        Activity

        Hide
        Edward Drapkin added a comment -

        Patch for proposed change.

        Show
        Edward Drapkin added a comment - Patch for proposed change.
        Hide
        Earwin Burrfoot added a comment -

        Is creating MultiSearcher/MultiReader per-request too heavy? I thought these were pretty lightweight.
        You can also stuff everything into the same index and use cached filters.

        Show
        Earwin Burrfoot added a comment - Is creating MultiSearcher/MultiReader per-request too heavy? I thought these were pretty lightweight. You can also stuff everything into the same index and use cached filters.
        Hide
        Edward Drapkin added a comment -

        It's not entirely the fact that creating a MultiSearcher per request is too heavy. if you'll look at 2440, I also modified ParallelMultiSearcher to support a fixed thread pool; what I'm worried about is, even with a fixed thread pool of something small like 4 threads, the concurrent request count could spiral the amount of threads that the JVM has to deal with out of control. If I can use the same ParallelMultiSearcher across requests, with a fixed thread pool of something sane like 16 or 24 threads, then I can be reasonably sure that this particular class isn't going to spiral thread counts out of control.

        As far as stuffing everything into the same index, we've looked into that and determined that it isn't a real possibility because the size of the indexes - there's quite a few ranging from a few MB to a few GB of data - would make the merge process relatively expensive and coupled with the fact that the indexes themselves are built and maintained separately, we'd be needing to run the merging process too frequently for it to be feasible.

        Show
        Edward Drapkin added a comment - It's not entirely the fact that creating a MultiSearcher per request is too heavy. if you'll look at 2440, I also modified ParallelMultiSearcher to support a fixed thread pool; what I'm worried about is, even with a fixed thread pool of something small like 4 threads, the concurrent request count could spiral the amount of threads that the JVM has to deal with out of control. If I can use the same ParallelMultiSearcher across requests, with a fixed thread pool of something sane like 16 or 24 threads, then I can be reasonably sure that this particular class isn't going to spiral thread counts out of control. As far as stuffing everything into the same index, we've looked into that and determined that it isn't a real possibility because the size of the indexes - there's quite a few ranging from a few MB to a few GB of data - would make the merge process relatively expensive and coupled with the fact that the indexes themselves are built and maintained separately, we'd be needing to run the merging process too frequently for it to be feasible.
        Hide
        Earwin Burrfoot added a comment -

        I think there was a recent issue about running ParallelMultiSearcher over Java5 Executor. If you use Executor, it takes care of pooling threads for you, and you create ParallelMultiSearcher per-request without worries.

        Show
        Earwin Burrfoot added a comment - I think there was a recent issue about running ParallelMultiSearcher over Java5 Executor. If you use Executor, it takes care of pooling threads for you, and you create ParallelMultiSearcher per-request without worries.
        Hide
        Edward Drapkin added a comment -

        A quick search (in JIRA and markmail) yielded no results. Can you link me to the issue? While that may (yet to be seen) be a solution to my problem, is that a reason to not accept the proposed patches (2440 and this one, 2447)?

        Show
        Edward Drapkin added a comment - A quick search (in JIRA and markmail) yielded no results. Can you link me to the issue? While that may (yet to be seen) be a solution to my problem, is that a reason to not accept the proposed patches (2440 and this one, 2447)?
        Hide
        Earwin Burrfoot added a comment -

        I think it was LUCENE-2041?
        But yep, as you noted in LUCENE-2440 (missed this issue), while they moved over to Executor, they haven't provided a way to supply an external one. So I fully support LUCENE-2440.

        As for the current issue, I'm always butting in when people try to add something that can already be done with existing APIs. Less APIs = better design, easier to support.

        Show
        Earwin Burrfoot added a comment - I think it was LUCENE-2041 ? But yep, as you noted in LUCENE-2440 (missed this issue), while they moved over to Executor, they haven't provided a way to supply an external one. So I fully support LUCENE-2440 . As for the current issue, I'm always butting in when people try to add something that can already be done with existing APIs. Less APIs = better design, easier to support.
        Hide
        Edward Drapkin added a comment -

        Ah, cool, regarding LUCENE-2440

        You mention that it's possible to accomplish what this accomplishes with the current API, via instantiating a MultiSearcher per request, which is possible, but I think this way would be much simpler and while increasing the complexity of the API, it does so in a consistent way that's easy to understand and use (and doesn't break BC); if the difference between the proposed change of the API and the current API is too different, maybe splitting the API change into a new class would be the solution (i.e. two classes: MultiSearcher and SplittableMultiSearcher). Either way, under the current API, calls look like this:

        public void doSearch()

        { Set<Searchable> searchables = this.getSearchablesFromRequestParams(); //faux method MultiSearcher mSearcher = new MultiSearcher(searchables); mSearcher.search(someQuery, 1000); //... }

        Compare with, under my proposed API:

        public void doSearch()

        { this.mSearcher.search(this.getSearchablesFromRequestParams(), someQuery, 1000); //... }

        Keeping in mind that I'm not sure this is an entirely esoteric/niche requirement (surely I can't be the only one who has this issue) and this doesn't break any existing code or significantly increase its execution time, the end result is much cleaner code (from userland) that's also less resource intensive (however cheap - on my completely idle Q9300 it takes about 3us (20us for ParallelMultiSearcher) to instantiate* - it may be to instantiate MultiSearcher, it's still more expensive that keeping one instance around, especially in a heavily trafficked environment), especially regarding memory usage and garbage collection times.

        • I created 100 indexes, each with 10,000 documents (each of which had 100 fields named name1, name2, etc. with 128 bytes of random string) and then tested that - each index was ~60MB. I can paste the code I used if you would like.
        Show
        Edward Drapkin added a comment - Ah, cool, regarding LUCENE-2440 You mention that it's possible to accomplish what this accomplishes with the current API, via instantiating a MultiSearcher per request, which is possible, but I think this way would be much simpler and while increasing the complexity of the API, it does so in a consistent way that's easy to understand and use (and doesn't break BC); if the difference between the proposed change of the API and the current API is too different, maybe splitting the API change into a new class would be the solution (i.e. two classes: MultiSearcher and SplittableMultiSearcher). Either way, under the current API, calls look like this: public void doSearch() { Set<Searchable> searchables = this.getSearchablesFromRequestParams(); //faux method MultiSearcher mSearcher = new MultiSearcher(searchables); mSearcher.search(someQuery, 1000); //... } Compare with, under my proposed API: public void doSearch() { this.mSearcher.search(this.getSearchablesFromRequestParams(), someQuery, 1000); //... } Keeping in mind that I'm not sure this is an entirely esoteric/niche requirement (surely I can't be the only one who has this issue) and this doesn't break any existing code or significantly increase its execution time, the end result is much cleaner code (from userland) that's also less resource intensive (however cheap - on my completely idle Q9300 it takes about 3us (20us for ParallelMultiSearcher) to instantiate* - it may be to instantiate MultiSearcher, it's still more expensive that keeping one instance around, especially in a heavily trafficked environment), especially regarding memory usage and garbage collection times. I created 100 indexes, each with 10,000 documents (each of which had 100 fields named name1, name2, etc. with 128 bytes of random string) and then tested that - each index was ~60MB. I can paste the code I used if you would like.
        Hide
        Edward Drapkin added a comment - - edited

        I talked to Uri in IRC, and he suggested making a Predicate<T> utility interface and using that instead of forcing people to use Set<Searchable>. This seemed like a much better idea, so I went ahead and implemented that and this is the patch that reflects that.

        (Note: patch name is LUCENE-2447-predicate.patch).

        Show
        Edward Drapkin added a comment - - edited I talked to Uri in IRC, and he suggested making a Predicate<T> utility interface and using that instead of forcing people to use Set<Searchable>. This seemed like a much better idea, so I went ahead and implemented that and this is the patch that reflects that. (Note: patch name is LUCENE-2447 -predicate.patch).
        Hide
        Yonik Seeley added a comment - - edited

        I think I'm leaning toward Earwin's thinking... MultiSearcher seems to be mostly behavior and almost no state. Push a little more in that direction and you have a lot of flexibility. Perhaps allow the CachedDfSource to be specified, don't copy the Searchable[] array, and all of a sudden it's as light weight as a String.

        edit: actually, I just realized that CachedDfSource isn't even maintained state... it already is really light weight. Adding the ability to specify an Executor gets you 99% there.

        Show
        Yonik Seeley added a comment - - edited I think I'm leaning toward Earwin's thinking... MultiSearcher seems to be mostly behavior and almost no state. Push a little more in that direction and you have a lot of flexibility. Perhaps allow the CachedDfSource to be specified, don't copy the Searchable[] array, and all of a sudden it's as light weight as a String. edit: actually, I just realized that CachedDfSource isn't even maintained state... it already is really light weight. Adding the ability to specify an Executor gets you 99% there.
        Hide
        Edward Drapkin added a comment -

        I think that's where I have a misunderstanding and disagreement with the function of the multisearcher. Because it's mostly stateless, I can't help but thing that it's designed to be instantiated on demand, every time that it needs to be used. While with the "traditional" MultiSearcher, this doesn't seem to be much of a problem (3us is no time at all), it gets progressively heavier and more cumbersome to keep creating Executors every time with ParallelMultiSearcher. OTOH, given the ability to specify an Executor at object creation "solves" this issue, making ParallelMultiSearcher almost as light as MultiSearcher itself.

        Not to digress into architectural theory, but I have a disagreement with the way that these classes are designed to be used. In order to make MultiSearcher even functional, the application needs to maintain its list of Searchables outside MultiSearcher itself; given the ability to specify an Executor at construction time (for ParallelMultiSearcher), you're now maintaining an array of Searchables (because we it's expensive and wasteful to create those every time they're needed) and a thread pool management object in the calling object. This reeks of state leakage to me, where the state of [Parallel]MultiSearcher is being maintained by an external, calling object and is being re-created every time it's needed, violating encapsulation conventions and practice. Furthermore, (with the caveat that I'm relatively new to lucene) MultiSearcher is itself a Searchable and this behavior is inconsistent with the way that I've seen other Searchables handled. I'm not sure how much sense it makes to trade maintaining a reference to one Searchable (that encapsulates several) to maintaining a list/array/collection of references to other Searchables... especially when you look into multi-threaded apps and non-final collections of Searchables that may start to modify the state (however transient) of your MultiSearcher outside of the class itself.

        I'm not sure how clear it is from my previous comments and the code itself, but the idea behind the patch was that the user (in this case, me) wouldn't be maintaining anything for the state of the [Parallel]MultiSearcher except for the instance of the class itself. Right now, it's possible to do this (not keep any permanent references to anything that's fed into the MultiSearcher constructor) but only if you intend to always search all of your Searchables. The patch takes a few steps in the direction of making keeping references to the Searchables outside of the MultiSearcher unnecessary (although, come to think about it, if this is the direction this class heads in, getSearchables() needs to return a [deep] clone of multiSearcher.searchables rather than the array itself), but without any method defined in Searchable that allows you to identify a Searchable without a reference, I'm not sure that there is much more that can be done.

        Show
        Edward Drapkin added a comment - I think that's where I have a misunderstanding and disagreement with the function of the multisearcher. Because it's mostly stateless, I can't help but thing that it's designed to be instantiated on demand, every time that it needs to be used. While with the "traditional" MultiSearcher, this doesn't seem to be much of a problem (3us is no time at all), it gets progressively heavier and more cumbersome to keep creating Executors every time with ParallelMultiSearcher. OTOH, given the ability to specify an Executor at object creation "solves" this issue, making ParallelMultiSearcher almost as light as MultiSearcher itself. Not to digress into architectural theory, but I have a disagreement with the way that these classes are designed to be used. In order to make MultiSearcher even functional, the application needs to maintain its list of Searchables outside MultiSearcher itself; given the ability to specify an Executor at construction time (for ParallelMultiSearcher), you're now maintaining an array of Searchables (because we it's expensive and wasteful to create those every time they're needed) and a thread pool management object in the calling object. This reeks of state leakage to me, where the state of [Parallel] MultiSearcher is being maintained by an external, calling object and is being re-created every time it's needed, violating encapsulation conventions and practice. Furthermore, (with the caveat that I'm relatively new to lucene) MultiSearcher is itself a Searchable and this behavior is inconsistent with the way that I've seen other Searchables handled. I'm not sure how much sense it makes to trade maintaining a reference to one Searchable (that encapsulates several) to maintaining a list/array/collection of references to other Searchables... especially when you look into multi-threaded apps and non-final collections of Searchables that may start to modify the state (however transient) of your MultiSearcher outside of the class itself. I'm not sure how clear it is from my previous comments and the code itself, but the idea behind the patch was that the user (in this case, me) wouldn't be maintaining anything for the state of the [Parallel] MultiSearcher except for the instance of the class itself. Right now, it's possible to do this (not keep any permanent references to anything that's fed into the MultiSearcher constructor) but only if you intend to always search all of your Searchables. The patch takes a few steps in the direction of making keeping references to the Searchables outside of the MultiSearcher unnecessary (although, come to think about it, if this is the direction this class heads in, getSearchables() needs to return a [deep] clone of multiSearcher.searchables rather than the array itself), but without any method defined in Searchable that allows you to identify a Searchable without a reference, I'm not sure that there is much more that can be done.

          People

          • Assignee:
            Unassigned
            Reporter:
            Edward Drapkin
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development