Lucene - Core
  1. Lucene - Core
  2. LUCENE-6066

Collector that manages diversity in search results

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1
    • Component/s: core/query/scoring
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This issue provides a new collector for situations where a client doesn't want more than N matches for any given key (e.g. no more than 5 products from any one retailer in a marketplace). In these circumstances a document that was previously thought of as competitive during collection has to be removed from the final PQ and replaced with another doc (eg a retailer who already has 5 matches in the PQ receives a 6th match which is better than his previous ones). This requires a new remove method on the existing PriorityQueue class.

      1. LUCENE-6066.patch
        35 kB
        Adrien Grand
      2. LUCENE-PQRemoveV8.patch
        35 kB
        Mark Harwood
      3. LUCENE-PQRemoveV9.patch
        35 kB
        Mark Harwood

        Activity

        Hide
        Mark Harwood added a comment -

        New remove(element) method in PriorityQueue and related test

        Show
        Mark Harwood added a comment - New remove(element) method in PriorityQueue and related test
        Hide
        Adrien Grand added a comment -

        Ensuring diversity of the search results is a requirement I have often heard, so I think it would be nice to see what we can do. However I'm not sure that adding a linear-time removal method to PriorityQueue is the right thing to do, maybe we need a different data-structure that would make removal less costly?

        Show
        Adrien Grand added a comment - Ensuring diversity of the search results is a requirement I have often heard, so I think it would be nice to see what we can do. However I'm not sure that adding a linear-time removal method to PriorityQueue is the right thing to do, maybe we need a different data-structure that would make removal less costly?
        Hide
        Mark Harwood added a comment -

        If the PQ set the current array position as a property of each element every time it moved them around I could pass the array index to remove() rather than an object that had to be scanned for

        Show
        Mark Harwood added a comment - If the PQ set the current array position as a property of each element every time it moved them around I could pass the array index to remove() rather than an object that had to be scanned for
        Hide
        Michael McCandless added a comment -

        I agree it would be nice to make diversity work well with Lucene. Isn't it essentially the same thing as grouping, which holds top N hits, in second pass, for top M groups found in the first pass?

        Show
        Michael McCandless added a comment - I agree it would be nice to make diversity work well with Lucene. Isn't it essentially the same thing as grouping, which holds top N hits, in second pass, for top M groups found in the first pass?
        Hide
        Mark Harwood added a comment -

        I guess it's different from grouping in that:
        1) it only involves one pass over the data
        2) the client doesn't have to guess the number of groups he is going to need to get up-front
        3) We don't get any "filler" docs in each group's results i.e. a bunch of irrelevant docs for an author with one good hit.

        Show
        Mark Harwood added a comment - I guess it's different from grouping in that: 1) it only involves one pass over the data 2) the client doesn't have to guess the number of groups he is going to need to get up-front 3) We don't get any "filler" docs in each group's results i.e. a bunch of irrelevant docs for an author with one good hit.
        Hide
        Mark Harwood added a comment -

        An analogy might be making a compilation album of 1967's top hit records:

        1) A vanilla Lucene query's results might look like a "Best of the Beatles" album - no diversity
        2) A grouping query would produce "The 10 top-selling artists of 1967 - some killer and quite a lot of filler"
        3) A "diversified" query would be the top 20 hit records of that year - with a max of 3 Beatles hits to maintain diversity

        Show
        Mark Harwood added a comment - An analogy might be making a compilation album of 1967's top hit records: 1) A vanilla Lucene query's results might look like a "Best of the Beatles" album - no diversity 2) A grouping query would produce "The 10 top-selling artists of 1967 - some killer and quite a lot of filler" 3) A "diversified" query would be the top 20 hit records of that year - with a max of 3 Beatles hits to maintain diversity
        Hide
        Michael McCandless added a comment -

        Thanks Mark, that makes sense, and that's a great example to help understand it. The fact that diversity doesn't need to keep the "filler" is what allows it to be a single pass.

        If we have this linear cost remove, what's the worst case complexity? When all N hits have the same "key" but are visited from worst to best score? Is it then O(N * M), where M is number of top hits I want?

        If the PQ set the current array position as a property of each element every time it moved them around I could pass the array index to remove() rather than an object that had to be scanned for

        This seems promising, maybe as a separate dedicated (forked) PQ impl? But how will you track the min element for each key in the PQ (to know which element to remove, when a more competitive hit with that key arrives)?

        Show
        Michael McCandless added a comment - Thanks Mark, that makes sense, and that's a great example to help understand it. The fact that diversity doesn't need to keep the "filler" is what allows it to be a single pass. If we have this linear cost remove, what's the worst case complexity? When all N hits have the same "key" but are visited from worst to best score? Is it then O(N * M), where M is number of top hits I want? If the PQ set the current array position as a property of each element every time it moved them around I could pass the array index to remove() rather than an object that had to be scanned for This seems promising, maybe as a separate dedicated (forked) PQ impl? But how will you track the min element for each key in the PQ (to know which element to remove, when a more competitive hit with that key arrives)?
        Hide
        Mark Harwood added a comment -

        But how will you track the min element for each key in the PQ (to know which element to remove, when a more competitive hit with that key arrives)?

        I was thinking of this as a foundation: (pseudo code)

        DiversifyingPriorityQueue.java
           abstract class KeyedElement {
               int pqPos;
               abstract Object getKey();
           }
           class DiversifyingPriorityQueue<T extends KeyedElement> extends PriorityQueue<T> {
                FastRemovablePriorityQueue<T> mainPQ;
                Map<Object, PriorityQueue> perKeyQueues;
          }
        

        You can probably guess at the logic but it is based around:

        • making sure each key has a max of n entries using an entry in perKeyQueues.
        • Evictions from the mainPQ will require removal from the related perKeyQueue
        • Emptied perKeyQueues can be recycled for use with other keys
        • Evictions from the perKeyQueue will require removal from the mainPQ

        This seems promising, maybe as a separate dedicated (forked) PQ impl?

        Yes, introducing a linear-cost remove by marking elements with a position is an added cost that not all PQs will require so forking seems necessary. In this case a common abstraction for these different PQs would be useful for the places where results are consumed e.g. TopDocsCollector

        Show
        Mark Harwood added a comment - But how will you track the min element for each key in the PQ (to know which element to remove, when a more competitive hit with that key arrives)? I was thinking of this as a foundation: (pseudo code) DiversifyingPriorityQueue.java abstract class KeyedElement { int pqPos; abstract Object getKey(); } class DiversifyingPriorityQueue<T extends KeyedElement> extends PriorityQueue<T> { FastRemovablePriorityQueue<T> mainPQ; Map< Object , PriorityQueue> perKeyQueues; } You can probably guess at the logic but it is based around: making sure each key has a max of n entries using an entry in perKeyQueues. Evictions from the mainPQ will require removal from the related perKeyQueue Emptied perKeyQueues can be recycled for use with other keys Evictions from the perKeyQueue will require removal from the mainPQ This seems promising, maybe as a separate dedicated (forked) PQ impl? Yes, introducing a linear-cost remove by marking elements with a position is an added cost that not all PQs will require so forking seems necessary. In this case a common abstraction for these different PQs would be useful for the places where results are consumed e.g. TopDocsCollector
        Hide
        Stefan Pohl added a comment -

        I also needed this missing functionality and hence implemented it in the specialized heap used within the MinShouldMatchSumScorer (LUCENE-4571). I also went for the O(n) solution not to have the position-update overhead, but in my scoring use-case the removed item is often to be found in the beginning of the heap array.

        Regarding your patch, please note that it's not enough to sift-down the inserted last element, you first have to try bubbling it up in order to deal with all possible heap states. A more randomized test case would generate plenty of example cases where the heap property otherwise won't be invariant.

        It would be nice if there would be only one implementation of heap functionality, so that they could centrally be tested thoroughly, but it still seems that specialized versions can outperform generic ones (e.g. LUCENE-6028)...

        Show
        Stefan Pohl added a comment - I also needed this missing functionality and hence implemented it in the specialized heap used within the MinShouldMatchSumScorer ( LUCENE-4571 ). I also went for the O(n) solution not to have the position-update overhead, but in my scoring use-case the removed item is often to be found in the beginning of the heap array. Regarding your patch, please note that it's not enough to sift-down the inserted last element, you first have to try bubbling it up in order to deal with all possible heap states. A more randomized test case would generate plenty of example cases where the heap property otherwise won't be invariant. It would be nice if there would be only one implementation of heap functionality, so that they could centrally be tested thoroughly, but it still seems that specialized versions can outperform generic ones (e.g. LUCENE-6028 )...
        Hide
        Mark Harwood added a comment -

        Added missing upHeap call to remove method.
        Added extra randomized tests and method to check validity of PQ elements as mutations are made.

        Show
        Mark Harwood added a comment - Added missing upHeap call to remove method. Added extra randomized tests and method to check validity of PQ elements as mutations are made.
        Hide
        Mark Harwood added a comment -

        Thanks for your comments, Stefan.

        The remove method I believe is implemented correctly now.

        it still seems that specialized versions can outperform generic ones

        Yes, the DiversifyingPriorityQueue that I imagined would need access to a new remove method in the existing PriorityQueue looks like it is better implemented as a fork of the existing PriorityQueue. I'll attach this fork here in a future addition.
        Maybe with these differing implementations there is a need to have a common interface that provides an abstraction for things like TopDocsCollector to add and pop results.

        Show
        Mark Harwood added a comment - Thanks for your comments, Stefan. The remove method I believe is implemented correctly now. it still seems that specialized versions can outperform generic ones Yes, the DiversifyingPriorityQueue that I imagined would need access to a new remove method in the existing PriorityQueue looks like it is better implemented as a fork of the existing PriorityQueue. I'll attach this fork here in a future addition. Maybe with these differing implementations there is a need to have a common interface that provides an abstraction for things like TopDocsCollector to add and pop results.
        Hide
        Mark Harwood added a comment -

        Updated patch.
        Added DiversifiedTopDocsCollector and associated test. This class represents the primary use-case for wanting to include a new remove() method in PriorityQueue.
        The PriorityQueue has original upHeap/downHeap methods unchanged in case of any performance change and a new specialised upHeap/downHeap that takes a position to support the new remove function.

        Show
        Mark Harwood added a comment - Updated patch. Added DiversifiedTopDocsCollector and associated test. This class represents the primary use-case for wanting to include a new remove() method in PriorityQueue. The PriorityQueue has original upHeap/downHeap methods unchanged in case of any performance change and a new specialised upHeap/downHeap that takes a position to support the new remove function.
        Hide
        Michael McCandless added a comment -

        This looks great!

        Maybe advertise in the javadocs the runtime cost of the collector? (The linear remove method).

        In TestPriorityQueue, instead of int numDocsInPQ = (int) (1 + (random.nextFloat() * 100)); you could use TestUtil.nextInt(random(), 1, 100).

        Why couldn't you just pass your custom queue instead of null to super() in DiversifiedTopDocsCollector ctor? Scary to have to pass null to super ... something is wrong.

        Can you fix diversifying vs diversified naming (pick one)?

        Maybe add a random test, in addition to the fixed scores/MAX_HITS_PER_KEY? The random test could do the "slow and stupid but probably correct" diversified collection to get truth, and assert it equals the result using the new collector.

        The javadocs are nice and clear (diversify by "band"), but then the abstract method returns NumericDocValues, which is confusing: how does "beatles" become a number? Why not e.g. SortedDVs? Maybe, the test case can show how the nice Beatles example would actually work?

        I saw a test about paging; how does/should paging work with such a collector?

        I think the changes to PQ are OK. Scary to add a linear-cost method, but it advertises this. If this is too-scary to anyone we could always fully fork to a private (for this collector) impl.

        Show
        Michael McCandless added a comment - This looks great! Maybe advertise in the javadocs the runtime cost of the collector? (The linear remove method). In TestPriorityQueue, instead of int numDocsInPQ = (int) (1 + (random.nextFloat() * 100)); you could use TestUtil.nextInt(random(), 1, 100) . Why couldn't you just pass your custom queue instead of null to super() in DiversifiedTopDocsCollector ctor? Scary to have to pass null to super ... something is wrong. Can you fix diversifying vs diversified naming (pick one)? Maybe add a random test, in addition to the fixed scores/MAX_HITS_PER_KEY? The random test could do the "slow and stupid but probably correct" diversified collection to get truth, and assert it equals the result using the new collector. The javadocs are nice and clear (diversify by "band"), but then the abstract method returns NumericDocValues, which is confusing: how does "beatles" become a number? Why not e.g. SortedDVs? Maybe, the test case can show how the nice Beatles example would actually work? I saw a test about paging; how does/should paging work with such a collector? I think the changes to PQ are OK. Scary to add a linear-cost method, but it advertises this. If this is too-scary to anyone we could always fully fork to a private (for this collector) impl.
        Hide
        Michael McCandless added a comment -

        This can come later, but have you thought about how this collector could also handle sorting by something other than score? E.g. take a Sort, or delegate/wrap another collector or comparator or something...

        Show
        Michael McCandless added a comment - This can come later, but have you thought about how this collector could also handle sorting by something other than score? E.g. take a Sort, or delegate/wrap another collector or comparator or something...
        Hide
        Mark Harwood added a comment -

        Thanks for the review, Mike. I'm working through changes.

        Why couldn't you just pass your custom queue instead of null to super() in DiversifiedTopDocsCollector ctor?

        Oops. That was a cut/paste error transferring code from es that relied on a forked PriorityQueue which is obviously incompatible with the Lucene TopDocsCollector base class.

        the abstract method returns NumericDocValues, which is confusing: how does "beatles" become a number? Why not e.g. SortedDVs

        I originally had a getKey(docId) method that returned an object - anything which implements hashCode and Equals. When I talked through with Adrien he suggested the use of NumericDocValues as a better abstraction which could be backed by any system based on ordinals. We need to decide on what this abstraction should be. One of the things I've been grappling with is if the collector should implement support for multi-keyed docs e.g. a field containing hashes for near-duplicate detection to avoid too-similar texts. This would require extra code in the collector to determine if any one key had exceeded limits (and ideally some memory-safeguard for docs with too many keys).

        >I saw a test about paging; how does/should paging work with such a collector?

        In regular collections, TopScoreDocCollector provides all of the smarts for in-order/out-of-order and starting from the ScoreDoc at the bottom of the previous page. I expect I would have to reimplement all of it's logic for a new DiversifiedTopScoreKeyedDocCollector because it makes some assumptions about using updateTop() that don't apply when we have a two-tier system for scoring (globally competitive and within-key competitive).
        My vague assumption was that the logic for paging would have to be that any per-key constraints would apply across multiple pages e.g. having had 5 Beatles hits on pages 1 and 2 you wouldn't expect to find any more the deeper you go into the results because it had exhausted the "max 5 per key" limit. This logic would probably preclude any use of the deep-paging optimisation where you can pass just the ScoreDoc of the last entry on the previous page to minimise the size of the PQ created for subsequent pages.

        Show
        Mark Harwood added a comment - Thanks for the review, Mike. I'm working through changes. Why couldn't you just pass your custom queue instead of null to super() in DiversifiedTopDocsCollector ctor? Oops. That was a cut/paste error transferring code from es that relied on a forked PriorityQueue which is obviously incompatible with the Lucene TopDocsCollector base class. the abstract method returns NumericDocValues, which is confusing: how does "beatles" become a number? Why not e.g. SortedDVs I originally had a getKey(docId) method that returned an object - anything which implements hashCode and Equals. When I talked through with Adrien he suggested the use of NumericDocValues as a better abstraction which could be backed by any system based on ordinals. We need to decide on what this abstraction should be. One of the things I've been grappling with is if the collector should implement support for multi-keyed docs e.g. a field containing hashes for near-duplicate detection to avoid too-similar texts. This would require extra code in the collector to determine if any one key had exceeded limits (and ideally some memory-safeguard for docs with too many keys). >I saw a test about paging; how does/should paging work with such a collector? In regular collections, TopScoreDocCollector provides all of the smarts for in-order/out-of-order and starting from the ScoreDoc at the bottom of the previous page. I expect I would have to reimplement all of it's logic for a new DiversifiedTopScoreKeyedDocCollector because it makes some assumptions about using updateTop() that don't apply when we have a two-tier system for scoring (globally competitive and within-key competitive). My vague assumption was that the logic for paging would have to be that any per-key constraints would apply across multiple pages e.g. having had 5 Beatles hits on pages 1 and 2 you wouldn't expect to find any more the deeper you go into the results because it had exhausted the "max 5 per key" limit. This logic would probably preclude any use of the deep-paging optimisation where you can pass just the ScoreDoc of the last entry on the previous page to minimise the size of the PQ created for subsequent pages.
        Hide
        Adrien Grand added a comment -

        the abstract method returns NumericDocValues, which is confusing: how does "beatles" become a number? Why not e.g. SortedDVs?

        I think that ideally it should work with both numerics and strings, but then I'm afraid we would end-up with a complicated FieldComparator-like API ? One idea I had (that might be bad) is that if we use NumericDocValues as a way to get keys, then we could also make it work for strings by having a NumericDocValues instance that returns an OrdinalMap's global ordinals as keys. Another option would be to have something that produces an object key (with equals/hashcode) for every document but for numerics it would probably mean lots of boxing, and for strings it would imply value lookups in every collect() call, so it's not very appealing ?

        Show
        Adrien Grand added a comment - the abstract method returns NumericDocValues, which is confusing: how does "beatles" become a number? Why not e.g. SortedDVs? I think that ideally it should work with both numerics and strings, but then I'm afraid we would end-up with a complicated FieldComparator-like API ? One idea I had (that might be bad) is that if we use NumericDocValues as a way to get keys, then we could also make it work for strings by having a NumericDocValues instance that returns an OrdinalMap's global ordinals as keys. Another option would be to have something that produces an object key (with equals/hashcode) for every document but for numerics it would probably mean lots of boxing, and for strings it would imply value lookups in every collect() call, so it's not very appealing ?
        Hide
        Michael McCandless added a comment -

        I think that ideally it should work with both numerics and strings, but then I'm afraid we would end-up with a complicated FieldComparator-like API

        That's a great and scary point. We do NOT want another FieldComparator. And we don't need the generic Object...

        I agree: let's use NumericDV, but can we add comments / maybe examples in the test case showing how the "beatles" example can work w/ this API?

        Show
        Michael McCandless added a comment - I think that ideally it should work with both numerics and strings, but then I'm afraid we would end-up with a complicated FieldComparator-like API That's a great and scary point. We do NOT want another FieldComparator. And we don't need the generic Object... I agree: let's use NumericDV, but can we add comments / maybe examples in the test case showing how the "beatles" example can work w/ this API?
        Hide
        Mark Harwood added a comment -

        Added Junit test showing use with String based dedup keys using 2 lookup impls - slow+accurate global ords and fast but potentially inaccurate hashing of BinaryDocValues

        Show
        Mark Harwood added a comment - Added Junit test showing use with String based dedup keys using 2 lookup impls - slow+accurate global ords and fast but potentially inaccurate hashing of BinaryDocValues
        Hide
        Mark Harwood added a comment -

        What feels awkward in the example Junit is that diversified collections are not compatible with existing Sort functionality - I had to use a custom Similarity class to sort by the popularity of songs in my test data.
        Combining the diversified collector with any other form of existing collector (e.g. TopFieldCollector to achieve field-based sorting) via wrapping is problematic because the other collectors all work with an assumption that previously collected elements are never recalled. The diversifying collector needs the ability to recall previously collected elements when new elements with the same key need to be substituted.

        Show
        Mark Harwood added a comment - What feels awkward in the example Junit is that diversified collections are not compatible with existing Sort functionality - I had to use a custom Similarity class to sort by the popularity of songs in my test data. Combining the diversified collector with any other form of existing collector (e.g. TopFieldCollector to achieve field-based sorting) via wrapping is problematic because the other collectors all work with an assumption that previously collected elements are never recalled. The diversifying collector needs the ability to recall previously collected elements when new elements with the same key need to be substituted.
        Hide
        Mark Harwood added a comment -

        Removed outdated acceptDocsInOrder() method.

        Show
        Mark Harwood added a comment - Removed outdated acceptDocsInOrder() method.
        Hide
        Mark Harwood added a comment -

        Fixed the test PQ's impl of lessThan() which was causing test failures on duplicate Integers placed into queue.

        Show
        Mark Harwood added a comment - Fixed the test PQ's impl of lessThan() which was causing test failures on duplicate Integers placed into queue.
        Hide
        Michael McCandless added a comment -

        Thanks Mark, the patch looks great. I like the added test cases! But I'm disappointed that Stayin' Alive didn't make the cut...

        You could probably use CustomScoreQuery instead of having to make a whole Sim impl ... but no need to fix that now.

        The patch has tabs ... need to change those to spaces ("ant precommit" will catch that).

        The comment for the approximate version says it uses longs but it actually uses int right (String.hashCode())?

        Show
        Michael McCandless added a comment - Thanks Mark, the patch looks great. I like the added test cases! But I'm disappointed that Stayin' Alive didn't make the cut... You could probably use CustomScoreQuery instead of having to make a whole Sim impl ... but no need to fix that now. The patch has tabs ... need to change those to spaces ("ant precommit" will catch that). The comment for the approximate version says it uses longs but it actually uses int right (String.hashCode())?
        Hide
        Mark Harwood added a comment -

        Tabs removed. Ant precommit now passes. Still no Bee Gees (sorry, Mike).
        Will commit to trunk and 5.1 in a day or 2 if no objections.

        Show
        Mark Harwood added a comment - Tabs removed. Ant precommit now passes. Still no Bee Gees (sorry, Mike). Will commit to trunk and 5.1 in a day or 2 if no objections.
        Hide
        Adrien Grand added a comment -

        Hey Mark, I have some comments:

        • maybe we should have this feature in lucene/sandbox or in lucene/misc first instead of lucene/core?
        • I think we should also add a lucene.experimental annotation to this collector?
        • the `if (size == 0)` condition at the top of PQ.remove looks already covered by the below for-loop?
        • Should PQ.downHeap and upHead delegate to their counterpart that takes a position?
        Show
        Adrien Grand added a comment - Hey Mark, I have some comments: maybe we should have this feature in lucene/sandbox or in lucene/misc first instead of lucene/core? I think we should also add a lucene.experimental annotation to this collector? the `if (size == 0)` condition at the top of PQ.remove looks already covered by the below for-loop? Should PQ.downHeap and upHead delegate to their counterpart that takes a position?
        Hide
        Mark Harwood added a comment -

        maybe we should have this feature in lucene/sandbox or in lucene/misc first instead of lucene/core?

        It relies on a change to core's PriorityQueue (which was the original focus of this issue but then the issue extended into being about the specialized collector that is possibly the only justification for introducing a "remove" method on PQ).

        I think we should also add a lucene.experimental annotation to this collector?

        That seems fair.

        the `if (size == 0)` condition at the top of PQ.remove looks already covered by the below for-loop?

        good point, will change.

        Should PQ.downHeap and upHead delegate to their counterpart that takes a position?

        I wanted to avoid the possibility of introducing any slow down to the PQ impl by keeping the existing upHeap/downHeap methods intact and duplicating most of their logic in the version that takes a position.

        Show
        Mark Harwood added a comment - maybe we should have this feature in lucene/sandbox or in lucene/misc first instead of lucene/core? It relies on a change to core's PriorityQueue (which was the original focus of this issue but then the issue extended into being about the specialized collector that is possibly the only justification for introducing a "remove" method on PQ). I think we should also add a lucene.experimental annotation to this collector? That seems fair. the `if (size == 0)` condition at the top of PQ.remove looks already covered by the below for-loop? good point, will change. Should PQ.downHeap and upHead delegate to their counterpart that takes a position? I wanted to avoid the possibility of introducing any slow down to the PQ impl by keeping the existing upHeap/downHeap methods intact and duplicating most of their logic in the version that takes a position.
        Hide
        Adrien Grand added a comment -

        It relies on a change to core's PriorityQueue (which was the original focus of this issue but then the issue extended into being about the specialized collector that is possibly the only justification for introducing a "remove" method on PQ).

        I was thinking of just moving the collector there?

        I wanted to avoid the possibility of introducing any slow down to the PQ impl by keeping the existing upHeap/downHeap methods intact and duplicating most of their logic in the version that takes a position.

        Fair enough, we can check with luceneutil later to see if we can merge them. Since regular queries only actually update the PQ on competitive hits, that should hopefully not be an issue.

        Show
        Adrien Grand added a comment - It relies on a change to core's PriorityQueue (which was the original focus of this issue but then the issue extended into being about the specialized collector that is possibly the only justification for introducing a "remove" method on PQ). I was thinking of just moving the collector there? I wanted to avoid the possibility of introducing any slow down to the PQ impl by keeping the existing upHeap/downHeap methods intact and duplicating most of their logic in the version that takes a position. Fair enough, we can check with luceneutil later to see if we can merge them. Since regular queries only actually update the PQ on competitive hits, that should hopefully not be an issue.
        Hide
        Mark Harwood added a comment -

        Move DiversifiedTopDocsCollector and related unit test to "misc".
        Added "experimental" annotation.
        Removed superfluous "if ==0 " test in PriorityQueue.

        Thanks, Adrien.

        Show
        Mark Harwood added a comment - Move DiversifiedTopDocsCollector and related unit test to "misc". Added "experimental" annotation. Removed superfluous "if ==0 " test in PriorityQueue. Thanks, Adrien.
        Hide
        Adrien Grand added a comment -

        Hi Mark,

        I played with your patch to see if removing the code duplication of PriorityQueue would hurt the benchmark and everything looks ok:

                            TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                 LowSloppyPhrase       69.77      (4.5%)       69.25      (3.8%)   -0.7% (  -8% -    7%)
                        PKLookup      259.92      (3.2%)      258.50      (2.0%)   -0.5% (  -5% -    4%)
                HighSloppyPhrase       13.96      (5.1%)       13.92      (4.8%)   -0.3% (  -9% -   10%)
                    OrNotHighLow     1135.87      (6.6%)     1132.89      (5.5%)   -0.3% ( -11% -   12%)
                      AndHighLow     1075.94      (5.2%)     1073.63      (4.6%)   -0.2% (  -9% -   10%)
                       LowPhrase      124.58      (2.0%)      124.49      (1.8%)   -0.1% (  -3% -    3%)
                       MedPhrase       78.58      (1.5%)       78.56      (1.6%)   -0.0% (  -3% -    3%)
                         Prefix3       77.58      (4.7%)       77.59      (3.6%)    0.0% (  -7% -    8%)
                      HighPhrase       14.14      (1.4%)       14.16      (1.6%)    0.1% (  -2% -    3%)
                      AndHighMed      248.72      (3.9%)      249.23      (3.5%)    0.2% (  -6% -    7%)
                          Fuzzy1       72.16      (5.6%)       72.32      (6.2%)    0.2% ( -10% -   12%)
                        HighTerm       71.70      (5.3%)       71.91      (5.1%)    0.3% (  -9% -   11%)
                       OrHighLow       68.70      (5.4%)       68.91      (5.7%)    0.3% ( -10% -   11%)
                         MedTerm      220.94      (5.8%)      221.62      (5.4%)    0.3% ( -10% -   12%)
                        Wildcard       20.86      (1.6%)       20.92      (1.4%)    0.3% (  -2% -    3%)
                     LowSpanNear       16.46      (2.3%)       16.51      (2.3%)    0.3% (  -4% -    5%)
                     MedSpanNear       18.46      (2.4%)       18.52      (2.1%)    0.3% (  -3% -    4%)
                          IntNRQ        6.63      (4.0%)        6.65      (4.0%)    0.4% (  -7% -    8%)
                   OrHighNotHigh       38.52      (1.7%)       38.65      (1.5%)    0.4% (  -2% -    3%)
                   OrNotHighHigh       79.04      (2.3%)       79.33      (1.8%)    0.4% (  -3% -    4%)
                    OrHighNotMed       52.77      (1.9%)       52.97      (1.5%)    0.4% (  -2% -    3%)
                 MedSloppyPhrase       44.24      (2.9%)       44.42      (2.5%)    0.4% (  -4% -    5%)
                       OrHighMed       47.19      (5.2%)       47.37      (5.4%)    0.4% (  -9% -   11%)
                    OrHighNotLow       85.13      (2.7%)       85.48      (2.1%)    0.4% (  -4% -    5%)
                      OrHighHigh       26.42      (5.1%)       26.55      (5.0%)    0.5% (  -9% -   11%)
                     AndHighHigh       84.14      (3.6%)       84.67      (3.0%)    0.6% (  -5% -    7%)
                    HighSpanNear       50.80      (1.8%)       51.18      (1.4%)    0.7% (  -2% -    4%)
                          Fuzzy2       38.02      (8.4%)       38.54      (7.5%)    1.3% ( -13% -   18%)
                         LowTerm     1395.69      (8.9%)     1420.90      (8.4%)    1.8% ( -14% -   20%)
                    OrNotHighMed      310.39      (4.4%)      316.65      (3.8%)    2.0% (  -5% -   10%)
                         Respell       82.66      (4.7%)       84.39      (4.4%)    2.1% (  -6% -   11%)
        

        I attached the patch that I tested with.

        +1 to commit

        Show
        Adrien Grand added a comment - Hi Mark, I played with your patch to see if removing the code duplication of PriorityQueue would hurt the benchmark and everything looks ok: TaskQPS baseline StdDev QPS patch StdDev Pct diff LowSloppyPhrase 69.77 (4.5%) 69.25 (3.8%) -0.7% ( -8% - 7%) PKLookup 259.92 (3.2%) 258.50 (2.0%) -0.5% ( -5% - 4%) HighSloppyPhrase 13.96 (5.1%) 13.92 (4.8%) -0.3% ( -9% - 10%) OrNotHighLow 1135.87 (6.6%) 1132.89 (5.5%) -0.3% ( -11% - 12%) AndHighLow 1075.94 (5.2%) 1073.63 (4.6%) -0.2% ( -9% - 10%) LowPhrase 124.58 (2.0%) 124.49 (1.8%) -0.1% ( -3% - 3%) MedPhrase 78.58 (1.5%) 78.56 (1.6%) -0.0% ( -3% - 3%) Prefix3 77.58 (4.7%) 77.59 (3.6%) 0.0% ( -7% - 8%) HighPhrase 14.14 (1.4%) 14.16 (1.6%) 0.1% ( -2% - 3%) AndHighMed 248.72 (3.9%) 249.23 (3.5%) 0.2% ( -6% - 7%) Fuzzy1 72.16 (5.6%) 72.32 (6.2%) 0.2% ( -10% - 12%) HighTerm 71.70 (5.3%) 71.91 (5.1%) 0.3% ( -9% - 11%) OrHighLow 68.70 (5.4%) 68.91 (5.7%) 0.3% ( -10% - 11%) MedTerm 220.94 (5.8%) 221.62 (5.4%) 0.3% ( -10% - 12%) Wildcard 20.86 (1.6%) 20.92 (1.4%) 0.3% ( -2% - 3%) LowSpanNear 16.46 (2.3%) 16.51 (2.3%) 0.3% ( -4% - 5%) MedSpanNear 18.46 (2.4%) 18.52 (2.1%) 0.3% ( -3% - 4%) IntNRQ 6.63 (4.0%) 6.65 (4.0%) 0.4% ( -7% - 8%) OrHighNotHigh 38.52 (1.7%) 38.65 (1.5%) 0.4% ( -2% - 3%) OrNotHighHigh 79.04 (2.3%) 79.33 (1.8%) 0.4% ( -3% - 4%) OrHighNotMed 52.77 (1.9%) 52.97 (1.5%) 0.4% ( -2% - 3%) MedSloppyPhrase 44.24 (2.9%) 44.42 (2.5%) 0.4% ( -4% - 5%) OrHighMed 47.19 (5.2%) 47.37 (5.4%) 0.4% ( -9% - 11%) OrHighNotLow 85.13 (2.7%) 85.48 (2.1%) 0.4% ( -4% - 5%) OrHighHigh 26.42 (5.1%) 26.55 (5.0%) 0.5% ( -9% - 11%) AndHighHigh 84.14 (3.6%) 84.67 (3.0%) 0.6% ( -5% - 7%) HighSpanNear 50.80 (1.8%) 51.18 (1.4%) 0.7% ( -2% - 4%) Fuzzy2 38.02 (8.4%) 38.54 (7.5%) 1.3% ( -13% - 18%) LowTerm 1395.69 (8.9%) 1420.90 (8.4%) 1.8% ( -14% - 20%) OrNotHighMed 310.39 (4.4%) 316.65 (3.8%) 2.0% ( -5% - 10%) Respell 82.66 (4.7%) 84.39 (4.4%) 2.1% ( -6% - 11%) I attached the patch that I tested with. +1 to commit
        Hide
        ASF subversion and git services added a comment -

        Commit 1659186 from mharwood@apache.org in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1659186 ]

        LUCENE-6066: new DiversifiedTopDocsCollector in misc and PriorityQueue.remove method

        Show
        ASF subversion and git services added a comment - Commit 1659186 from mharwood@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1659186 ] LUCENE-6066 : new DiversifiedTopDocsCollector in misc and PriorityQueue.remove method
        Hide
        ASF subversion and git services added a comment -

        Commit 1659195 from mharwood@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1659195 ]

        LUCENE-6066: new DiversifiedTopDocsCollector in misc and PriorityQueue.remove method

        Show
        ASF subversion and git services added a comment - Commit 1659195 from mharwood@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1659195 ] LUCENE-6066 : new DiversifiedTopDocsCollector in misc and PriorityQueue.remove method
        Hide
        Mark Harwood added a comment -

        Committed to trunk and 5x branch. Thanks for reviews Adrien and Mike.

        Show
        Mark Harwood added a comment - Committed to trunk and 5x branch. Thanks for reviews Adrien and Mike.

          People

          • Assignee:
            Unassigned
            Reporter:
            Mark Harwood
          • Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development