Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When costly scoring factors are used during searching, a common
      approach is to do a cheaper / basic query first, collect the top few
      hundred hits, and then rescore those hits using the more costly
      query.

      It's not clear/simple to do this with Lucene today; I think we should
      make it easier.

      1. LUCENE-5489.patch
        23 kB
        Michael McCandless
      2. LUCENE-5489.patch
        22 kB
        Michael McCandless
      3. LUCENE-5489.patch
        10 kB
        Michael McCandless
      4. LUCENE-5489.patch
        16 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          An initial patch, extracted from LUCENE-5288. It just takes the TopDocs from the first pass search, makes a Filter, runs the second pass search with that Filter, and then does a linear combination of the scores and resorts by that.

          This is just a start; we need to iterate.

          E.g. it won't work with "deep paging", when the user wants to page beyond the topN of the first pass search.

          Show
          Michael McCandless added a comment - An initial patch, extracted from LUCENE-5288 . It just takes the TopDocs from the first pass search, makes a Filter, runs the second pass search with that Filter, and then does a linear combination of the scores and resorts by that. This is just a start; we need to iterate. E.g. it won't work with "deep paging", when the user wants to page beyond the topN of the first pass search.
          Hide
          Michael McCandless added a comment -

          New patch, just fixing javadocs / nocommits ... I think this is ready, and we can iterate later for future improvements.

          Show
          Michael McCandless added a comment - New patch, just fixing javadocs / nocommits ... I think this is ready, and we can iterate later for future improvements.
          Hide
          Robert Muir added a comment -

          +1 to commit this. API is very simple and nice. I tried to think of a better name for the filter, but its private, so its not important.

          Show
          Robert Muir added a comment - +1 to commit this. API is very simple and nice. I tried to think of a better name for the filter, but its private, so its not important.
          Hide
          Simon Willnauer added a comment -

          it's awesome that we are adding this. I have almost the same code on a high level in Elasticsearch and it works just great. Ideally I'd love to drop the code from ES and replace it with the lucene one but I think we need to iterate more on this. From what I can see the biggest issues are :

          • the way how scores are combined is hardcoded (we just multiply)
          • we only have one way to rescore ie. we use another query by default which would be nice if the interface would allow us to use other ways of rescorers.
          • most of the goodies are hidden in a static method I think we should add a nice interface / abstract class
          • it would be awesome if the interface could provide a way to get an Explanation like other queries...

          what I think would make sense is something like this:

          
          public class AbstractRescorer implements Rescore {
          
            @Override
            public TopDocs rescore(IndexSearcher searcher, TopDocs topDocs, int topN) {
               // do what is done int he static method
            }
          
            protected abstract float combine(float originalScore, float resocredScore);
            
            @Override 
            public Explanation explain(IndexSearcher searcher, Explanation sourceExplain, int docId) {
              // impl explain
            }
          
          }
          

          I hope this makes sense?

          Show
          Simon Willnauer added a comment - it's awesome that we are adding this. I have almost the same code on a high level in Elasticsearch and it works just great. Ideally I'd love to drop the code from ES and replace it with the lucene one but I think we need to iterate more on this. From what I can see the biggest issues are : the way how scores are combined is hardcoded (we just multiply) we only have one way to rescore ie. we use another query by default which would be nice if the interface would allow us to use other ways of rescorers. most of the goodies are hidden in a static method I think we should add a nice interface / abstract class it would be awesome if the interface could provide a way to get an Explanation like other queries... what I think would make sense is something like this: public class AbstractRescorer implements Rescore { @Override public TopDocs rescore(IndexSearcher searcher, TopDocs topDocs, int topN) { // do what is done int he static method } protected abstract float combine( float originalScore, float resocredScore); @Override public Explanation explain(IndexSearcher searcher, Explanation sourceExplain, int docId) { // impl explain } } I hope this makes sense?
          Hide
          Robert Muir added a comment -

          Can it just be a one method abstract class (essentially the combine)? The patch could even keep the simple method signature of today, implemented via this (which means there is basically a usable example sitting there in the source code, too)

          Show
          Robert Muir added a comment - Can it just be a one method abstract class (essentially the combine)? The patch could even keep the simple method signature of today, implemented via this (which means there is basically a usable example sitting there in the source code, too)
          Hide
          Simon Willnauer added a comment -

          I'd be totally ok with skipping the Explain that can be fixed later or in user code. I think as long as I can override the combine it's actually helpful. I think for users the static simple method should remain but I don't think the (non-static) rescore method should be polluted with a Query since that is impl. specifc and can be passed via Ctor args or so but still be hidden if the user uses the simple static API.

          Show
          Simon Willnauer added a comment - I'd be totally ok with skipping the Explain that can be fixed later or in user code. I think as long as I can override the combine it's actually helpful. I think for users the static simple method should remain but I don't think the (non-static) rescore method should be polluted with a Query since that is impl. specifc and can be passed via Ctor args or so but still be hidden if the user uses the simple static API.
          Hide
          Michael McCandless added a comment -

          OK, new patch, folding in Simon's & Rob's feedback (thanks!), adding more tests.

          I made an entirely abstract class Rescorer, and then a QueryRescorer subclass that uses a Query to get the 2nd pass scores. In the future we can have other sources of scores, e.g. an ExpressionRescorer.

          Show
          Michael McCandless added a comment - OK, new patch, folding in Simon's & Rob's feedback (thanks!), adding more tests. I made an entirely abstract class Rescorer, and then a QueryRescorer subclass that uses a Query to get the 2nd pass scores. In the future we can have other sources of scores, e.g. an ExpressionRescorer.
          Hide
          Simon Willnauer added a comment -

          hey mike, thanks for the new patch I think you overlooked that one but the signature looks funky:

          protected abstract float combine(float firstPassScore, Float secondPassScore);
          

          I guess we can use the primitive for both args? I also think this method should only be on QueryRescorer and not in the interface?

          I also wonder why you extract the IDs and Scores, I think you should clone the scoreDocs array and sort that first. Then you can just sort the rescored scoreDocs array and simply merge the scores. Once you are done you resort the previously cloned array and we don't need to do all the auto boxing in that hashmap and it's the same sorting we already do?

          Show
          Simon Willnauer added a comment - hey mike, thanks for the new patch I think you overlooked that one but the signature looks funky: protected abstract float combine( float firstPassScore, Float secondPassScore); I guess we can use the primitive for both args? I also think this method should only be on QueryRescorer and not in the interface? I also wonder why you extract the IDs and Scores, I think you should clone the scoreDocs array and sort that first. Then you can just sort the rescored scoreDocs array and simply merge the scores. Once you are done you resort the previously cloned array and we don't need to do all the auto boxing in that hashmap and it's the same sorting we already do?
          Hide
          Simon Willnauer added a comment -

          oh I see the Float was to mark a match / non-match.. I guess we should really just pass a boolean to make thinks clear.

          Show
          Simon Willnauer added a comment - oh I see the Float was to mark a match / non-match.. I guess we should really just pass a boolean to make thinks clear.
          Hide
          Michael McCandless added a comment -

          I guess we should really just pass a boolean to make thinks clear.

          I'll switch to a boolean; I agree the sig is weird now.

          Show
          Michael McCandless added a comment - I guess we should really just pass a boolean to make thinks clear. I'll switch to a boolean; I agree the sig is weird now.
          Hide
          Michael McCandless added a comment -

          I also think this method should only be on QueryRescorer and not in the interface?

          Woops, right, I'll move it.

          I also wonder why you extract the IDs and Scores, I think you should clone the scoreDocs array and sort that first. Then you can just sort the rescored scoreDocs array and simply merge the scores. Once you are done you resort the previously cloned array and we don't need to do all the auto boxing in that hashmap and it's the same sorting we already do?

          I think this can wait? It's just an optimization (making the code more hairy but a bit faster). I'll put a TODO...

          Show
          Michael McCandless added a comment - I also think this method should only be on QueryRescorer and not in the interface? Woops, right, I'll move it. I also wonder why you extract the IDs and Scores, I think you should clone the scoreDocs array and sort that first. Then you can just sort the rescored scoreDocs array and simply merge the scores. Once you are done you resort the previously cloned array and we don't need to do all the auto boxing in that hashmap and it's the same sorting we already do? I think this can wait? It's just an optimization (making the code more hairy but a bit faster). I'll put a TODO...
          Hide
          Simon Willnauer added a comment -

          yeah it can wait I guess - please go ahead and put a TODO

          Show
          Simon Willnauer added a comment - yeah it can wait I guess - please go ahead and put a TODO
          Hide
          Michael McCandless added a comment -

          New patch, folding in feedback ... I think it's ready.

          Show
          Michael McCandless added a comment - New patch, folding in feedback ... I think it's ready.
          Hide
          Robert Muir added a comment -

          This looks good, thanks for moving combine(), as the expression already indicates how to combine with the score. It would be cool for us to add that subclass in a followup issue, then we have a better feeling the abstractions are really working.

          Show
          Robert Muir added a comment - This looks good, thanks for moving combine(), as the expression already indicates how to combine with the score. It would be cool for us to add that subclass in a followup issue, then we have a better feeling the abstractions are really working.
          Hide
          Simon Willnauer added a comment -

          I think this looks great +1 to commit

          Show
          Simon Willnauer added a comment - I think this looks great +1 to commit
          Hide
          ASF subversion and git services added a comment -

          Commit 1579911 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1579911 ]

          LUCENE-5489: add Rescorer/QueryRescorer

          Show
          ASF subversion and git services added a comment - Commit 1579911 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1579911 ] LUCENE-5489 : add Rescorer/QueryRescorer
          Hide
          ASF subversion and git services added a comment -

          Commit 1579913 from Michael McCandless in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1579913 ]

          LUCENE-5489: add Rescorer/QueryRescorer

          Show
          ASF subversion and git services added a comment - Commit 1579913 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1579913 ] LUCENE-5489 : add Rescorer/QueryRescorer
          Hide
          ASF subversion and git services added a comment -

          Commit 1579914 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1579914 ]

          LUCENE-5489: add changes

          Show
          ASF subversion and git services added a comment - Commit 1579914 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1579914 ] LUCENE-5489 : add changes
          Hide
          ASF subversion and git services added a comment -

          Commit 1579915 from Michael McCandless in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1579915 ]

          LUCENE-5489: add changes

          Show
          ASF subversion and git services added a comment - Commit 1579915 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1579915 ] LUCENE-5489 : add changes
          Hide
          Michael McCandless added a comment -

          Thanks guys!

          Show
          Michael McCandless added a comment - Thanks guys!
          Hide
          ASF subversion and git services added a comment -

          Commit 1579948 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1579948 ]

          LUCENE-5489: fix test bug

          Show
          ASF subversion and git services added a comment - Commit 1579948 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1579948 ] LUCENE-5489 : fix test bug
          Hide
          ASF subversion and git services added a comment -

          Commit 1579949 from Michael McCandless in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1579949 ]

          LUCENE-5489: fix test bug

          Show
          ASF subversion and git services added a comment - Commit 1579949 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1579949 ] LUCENE-5489 : fix test bug
          Hide
          Uwe Schindler added a comment -

          Close issue after release of 4.8.0

          Show
          Uwe Schindler added a comment - Close issue after release of 4.8.0

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development