Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9331

Can we remove ReRankQuery's length constructor argument?

    Details

    • Type: Wish
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.2, 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      Can we remove ReRankQuery's length constructor argument? It is a ReRankQParserPlugin private class.

      proposed patch summary:

      • change ReRankQuery.getTopDocsCollector to use its len argument (instead of the length member)
      • remove ReRankQuery's length member and constructor argument
      • remove ReRankQParser.parse's use of the rows and start parameters

      motivation: towards ReRankQParserPlugin and LTRQParserPlugin (SOLR-8542) sharing (more) code

      1. SOLR-9331.patch
        3 kB
        Christine Poerschke
      2. SOLR-9331.patch
        3 kB
        Christine Poerschke

        Issue Links

          Activity

          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 02527909625763871f0cf5a4aea353462299cda1 in lucene-solr's branch refs/heads/branch_6x from Christine Poerschke
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0252790 ]

          SOLR-9331: Remove ReRankQuery's length constructor argument and member.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 02527909625763871f0cf5a4aea353462299cda1 in lucene-solr's branch refs/heads/branch_6x from Christine Poerschke [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0252790 ] SOLR-9331 : Remove ReRankQuery's length constructor argument and member.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit bc25a565d23a7f791272be02685e71217234704b in lucene-solr's branch refs/heads/master from Christine Poerschke
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bc25a56 ]

          SOLR-9331: Remove ReRankQuery's length constructor argument and member.

          Show
          jira-bot ASF subversion and git services added a comment - Commit bc25a565d23a7f791272be02685e71217234704b in lucene-solr's branch refs/heads/master from Christine Poerschke [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bc25a56 ] SOLR-9331 : Remove ReRankQuery's length constructor argument and member.
          Hide
          cpoerschke Christine Poerschke added a comment -

          Previous patch rebased against latest master with conflicts resolved.

          Show
          cpoerschke Christine Poerschke added a comment - Previous patch rebased against latest master with conflicts resolved.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          I haven't had time to review but if you're feeling comfortable with the changes then feel free to commit.

          The re-ranker is going to be a major focus of the 6.3 release for me, so I'll be able to provide feedback if I run into any issues.

          Show
          joel.bernstein Joel Bernstein added a comment - I haven't had time to review but if you're feeling comfortable with the changes then feel free to commit. The re-ranker is going to be a major focus of the 6.3 release for me, so I'll be able to provide feedback if I run into any issues.
          Hide
          cpoerschke Christine Poerschke added a comment -

          Hi Joel Bernstein - no problem. Thanks for your input on my re-ranker small refactor work. My plan is to commit SOLR-9353 later this afternoon and then this SOLR-9331 here Monday or Tuesday. Would that work for you too? From quick look at the latest SOLR-9252 patch there should be no merge conflicts.

          Show
          cpoerschke Christine Poerschke added a comment - Hi Joel Bernstein - no problem. Thanks for your input on my re-ranker small refactor work. My plan is to commit SOLR-9353 later this afternoon and then this SOLR-9331 here Monday or Tuesday. Would that work for you too? From quick look at the latest SOLR-9252 patch there should be no merge conflicts.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Hi Christine Poerschke,

          Sorry I've been slow responding to your work with the re-ranker.

          Now that SOLR-9252 is complete the next step is to deploy stored models in the reranker. I'll be working on this for the Solr 6.3 release. That will be a good time for me to test out any changes you make in the reranker.

          My plan for deploying models in the reranker is to add a classify() function query that returns the score from a logistic regression classifier. The classify() function query will initially work with the logistic regression models created in SOLR-9252.

          Then the rerank query would be:

          {!func}classify(...)
          
          Show
          joel.bernstein Joel Bernstein added a comment - Hi Christine Poerschke , Sorry I've been slow responding to your work with the re-ranker. Now that SOLR-9252 is complete the next step is to deploy stored models in the reranker. I'll be working on this for the Solr 6.3 release. That will be a good time for me to test out any changes you make in the reranker. My plan for deploying models in the reranker is to add a classify() function query that returns the score from a logistic regression classifier. The classify() function query will initially work with the logistic regression models created in SOLR-9252 . Then the rerank query would be: {!func}classify(...)
          Hide
          cpoerschke Christine Poerschke added a comment -

          Been reading and investigating some more w.r.t. SOLR-9331 here and SOLR-9336 also - here's my now improved understanding of how the request parameters rows, start and reRankDocs and the solrconfig.xml element queryResultWindowSize combine as far as the ReRank(Query|Collector) and the QueryResultCache are concerned.

          The start parameter defaults to 0 if not supplied and combined with the rows parameter it is used for paging, for example if each page is to contain five documents then the requests would be:

          # page 1
          ...&rows=5
          # page 2
          ...&rows=5&start=5
          # page 3
          ...&rows=5&start=10
          

          Next, let's say we wish to apply some sort of reranking to improve search relevance.

          • Here's what the requests would look like if we were to rerank/reorder just the first page of documents:
            # page 1
            ...&rq={!rerank+reRankDocs=5+reRankQuery=$rrq+...}&rrq=...&rows=5
            # cost: 5 docs retrieved, 5 docs reordered, 5 docs returned
            #
            # page 2
            ...&rq={!rerank+reRankDocs=5+reRankQuery=$rrq+...}&rrq=...&rows=5&start=5
            # cost: 10 docs retrieved, 5 docs reordered, 5 docs skipped and then 5 docs returned
            #
            # page 3
            ...&rq={!rerank+reRankDocs=5+reRankQuery=$rrq+...}&rrq=...&rows=5&start=10
            # cost: 15 docs retrieved, 5 docs reordered, 10 docs skipped and then 5 docs returned
            
          • Here's what the requests would look like if we were to rerank/reorder the first five pages of documents:
            # page 1
            ...&rq={!rerank+reRankDocs=25+reRankQuery=$rrq+...}&rrq=...&rows=5
            # cost: 25 docs retrieved, 25 docs reordered, 5 docs returned
            #
            # page 2
            ...&rq={!rerank+reRankDocs=25+reRankQuery=$rrq+...}&rrq=...&rows=5&start=5
            # cost: 25 docs retrieved, 25 docs reordered, 5 docs skipped and then 5 docs returned
            #
            # page 3
            ...&rq={!rerank+reRankDocs=25+reRankQuery=$rrq+...}&rrq=...&rows=5&start=10
            # cost: 25 docs retrieved, 25 docs reordered, 10 docs skipped and then 5 docs returned
            

          Next, let's think about query result caching and demonstrate why reRankDocs needs to be part of the ReRankQuery.equalTo formula:

          # reranking logic: 'odd ahead of even'
          # results without            reranking: 1 2 3 4 5 6 7 8 9 10
          # results with reRankDocs=7  reranking: 1 3 5 7 2 4 6 8 9 10
          # results with reRankDocs=10 reranking: 1 3 5 7 9 2 4 6 8 10
          
          • conclusion: reRankDocs influences the end result and thus it must form part of the query result caching logic.

          Next, let's consider query result caching, with rows+start combined into a length variable:

          # reranking logic: 'odd ahead of even'
          #
          #  input: reRankDocs=3&start=0&rows=6&rq=...
          # output: [ 1 3 2 4 5 6 ] // populate cache (length=6)
          #
          #  input: reRankDocs=3&start=0&rows=3&rq=...
          # output: [ 1 3 2 ] // use (length=6) cache (we need only first half subset of what is cached)
          #
          #  input: reRankDocs=3&start=3&rows=3&rq=...
          # output: [ 4 5 6 ] // use (length=6) cache (we need only second half subset of what is cached)
          #
          #  input: reRankDocs=3&start=0&rows=4&rq=...
          # output: [ 1 3 2 4 ] // use (length=6) cache (we need only first two thirds subset of what is cached)
          #
          #  input: reRankDocs=3&start=3&rows=4&rq=...
          # output: cache lookup returns (length=6) cache entry with too few elements and so no cache use here
          #  cache: [ 1 3 2 4 5 6 7 ] // populate cache (length=7)
          # output: [ 4 5 6 7 ]
          
          • conclusions: length not being part of the query result caching logic means that
            • a length=6 cache entry can be used by some (but not all) subsequent length!=6 requests
            • the cache entry's length must be considered relative to the request's length and a cache hit is not always a useable cache hit

          Following on from this, we can think of the queryResultWindowSize config element as a 'rounded up' version of the length variable:

          # reranking logic: 'odd ahead of even'
          #
          #  input: reRankDocs=3&start=0&rows=6&rq=...
          # config: queryResultWindowSize=8
          #  cache: [ 1 3 2 4 5 6 7 8 ] // populate cache (length=8)
          # output: [ 1 3 2 4 5 6 ]
          #
          #  input: reRankDocs=3&start=3&rows=4&rq=...
          # output: [ 4 5 6 7 ] // use (length=8) cache (we need only a middle subset of what is cached)
          
          • notes:
            • the first query slightly overpopulated the cache and thus the second query could use the cache
            • the first query became slightly more expensive (8 vs. 6 docs retrieved)
            • the second query became cheaper since the query result cache could be used

          Finally, let's reassure ourselves that the queryResultWindowSize 'rounding up' does not alter query results themselves:

          • The mainCollector is used to retrieve documents. The ReRankCollector constructor specifies Math.max(reRankDocs, length) as the numHits for the mainCollector.
          • The ReRankCollector.topDocs method obtains Math.max(reRankDocs, length) documents from the mainCollector and then reranks/reorders up to reRankDocs of the obtained documents.

          So, based on the above analysis I would conclude that ReRankQuery's length constructor argument can safely be removed (as is proposed by the attached patch) and that doing so would be in keeping with the queryResultWindowSize logic itself.

          How does that sound? What do you think?

          Show
          cpoerschke Christine Poerschke added a comment - Been reading and investigating some more w.r.t. SOLR-9331 here and SOLR-9336 also - here's my now improved understanding of how the request parameters rows , start and reRankDocs and the solrconfig.xml element queryResultWindowSize combine as far as the ReRank(Query|Collector) and the QueryResultCache are concerned. The start parameter defaults to 0 if not supplied and combined with the rows parameter it is used for paging, for example if each page is to contain five documents then the requests would be: # page 1 ...&rows=5 # page 2 ...&rows=5&start=5 # page 3 ...&rows=5&start=10 Next, let's say we wish to apply some sort of reranking to improve search relevance. Here's what the requests would look like if we were to rerank/reorder just the first page of documents: # page 1 ...&rq={!rerank+reRankDocs=5+reRankQuery=$rrq+...}&rrq=...&rows=5 # cost: 5 docs retrieved, 5 docs reordered, 5 docs returned # # page 2 ...&rq={!rerank+reRankDocs=5+reRankQuery=$rrq+...}&rrq=...&rows=5&start=5 # cost: 10 docs retrieved, 5 docs reordered, 5 docs skipped and then 5 docs returned # # page 3 ...&rq={!rerank+reRankDocs=5+reRankQuery=$rrq+...}&rrq=...&rows=5&start=10 # cost: 15 docs retrieved, 5 docs reordered, 10 docs skipped and then 5 docs returned Here's what the requests would look like if we were to rerank/reorder the first five pages of documents: # page 1 ...&rq={!rerank+reRankDocs=25+reRankQuery=$rrq+...}&rrq=...&rows=5 # cost: 25 docs retrieved, 25 docs reordered, 5 docs returned # # page 2 ...&rq={!rerank+reRankDocs=25+reRankQuery=$rrq+...}&rrq=...&rows=5&start=5 # cost: 25 docs retrieved, 25 docs reordered, 5 docs skipped and then 5 docs returned # # page 3 ...&rq={!rerank+reRankDocs=25+reRankQuery=$rrq+...}&rrq=...&rows=5&start=10 # cost: 25 docs retrieved, 25 docs reordered, 10 docs skipped and then 5 docs returned Next, let's think about query result caching and demonstrate why reRankDocs needs to be part of the ReRankQuery.equalTo formula: # reranking logic: 'odd ahead of even' # results without reranking: 1 2 3 4 5 6 7 8 9 10 # results with reRankDocs=7 reranking: 1 3 5 7 2 4 6 8 9 10 # results with reRankDocs=10 reranking: 1 3 5 7 9 2 4 6 8 10 conclusion: reRankDocs influences the end result and thus it must form part of the query result caching logic. Next, let's consider query result caching, with rows + start combined into a length variable: # reranking logic: 'odd ahead of even' # # input: reRankDocs=3&start=0&rows=6&rq=... # output: [ 1 3 2 4 5 6 ] // populate cache (length=6) # # input: reRankDocs=3&start=0&rows=3&rq=... # output: [ 1 3 2 ] // use (length=6) cache (we need only first half subset of what is cached) # # input: reRankDocs=3&start=3&rows=3&rq=... # output: [ 4 5 6 ] // use (length=6) cache (we need only second half subset of what is cached) # # input: reRankDocs=3&start=0&rows=4&rq=... # output: [ 1 3 2 4 ] // use (length=6) cache (we need only first two thirds subset of what is cached) # # input: reRankDocs=3&start=3&rows=4&rq=... # output: cache lookup returns (length=6) cache entry with too few elements and so no cache use here # cache: [ 1 3 2 4 5 6 7 ] // populate cache (length=7) # output: [ 4 5 6 7 ] conclusions: length not being part of the query result caching logic means that a length=6 cache entry can be used by some (but not all) subsequent length!=6 requests the cache entry's length must be considered relative to the request's length and a cache hit is not always a useable cache hit Following on from this, we can think of the queryResultWindowSize config element as a 'rounded up' version of the length variable: # reranking logic: 'odd ahead of even' # # input: reRankDocs=3&start=0&rows=6&rq=... # config: queryResultWindowSize=8 # cache: [ 1 3 2 4 5 6 7 8 ] // populate cache (length=8) # output: [ 1 3 2 4 5 6 ] # # input: reRankDocs=3&start=3&rows=4&rq=... # output: [ 4 5 6 7 ] // use (length=8) cache (we need only a middle subset of what is cached) notes: the first query slightly overpopulated the cache and thus the second query could use the cache the first query became slightly more expensive (8 vs. 6 docs retrieved) the second query became cheaper since the query result cache could be used Finally, let's reassure ourselves that the queryResultWindowSize 'rounding up' does not alter query results themselves: The mainCollector is used to retrieve documents. The ReRankCollector constructor specifies Math.max(reRankDocs, length) as the numHits for the mainCollector. The ReRankCollector.topDocs method obtains Math.max(reRankDocs, length) documents from the mainCollector and then reranks/reorders up to reRankDocs of the obtained documents. So, based on the above analysis I would conclude that ReRankQuery 's length constructor argument can safely be removed (as is proposed by the attached patch) and that doing so would be in keeping with the queryResultWindowSize logic itself. How does that sound? What do you think?
          Hide
          cpoerschke Christine Poerschke added a comment -

          Thanks Joel Bernstein for the QueryResultCache and getDocListNC pointers. I now realize why and agree that the length passed to the constructor and the len passed to getTopDocsCollector are potentially different and so the constructor's length argument should stay therefore. Whilst looking at the code I noticed that the hashCode and equalsTo in ReRankQuery don't use the length member, SOLR-9336 created with proposed patch to change that - what do you think?

          Show
          cpoerschke Christine Poerschke added a comment - Thanks Joel Bernstein for the QueryResultCache and getDocListNC pointers. I now realize why and agree that the length passed to the constructor and the len passed to getTopDocsCollector are potentially different and so the constructor's length argument should stay therefore. Whilst looking at the code I noticed that the hashCode and equalsTo in ReRankQuery don't use the length member, SOLR-9336 created with proposed patch to change that - what do you think?
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          Also let's take close at all the adjustments done to the length in SolrIndexSearcher.getDocListNC(QueryResult qr, QueryCommand cmd).

          Show
          joel.bernstein Joel Bernstein added a comment - - edited Also let's take close at all the adjustments done to the length in SolrIndexSearcher.getDocListNC(QueryResult qr, QueryCommand cmd).
          Hide
          joel.bernstein Joel Bernstein added a comment -

          So the length that is passed into the constructor would be the exact length requested by query. The len being passed in getTopDocsCollector would be adjusted for the query result cache I believe. I'm not sure there is any issue though with using the len being passed in getTopDocsCollector.

          Show
          joel.bernstein Joel Bernstein added a comment - So the length that is passed into the constructor would be the exact length requested by query. The len being passed in getTopDocsCollector would be adjusted for the query result cache I believe. I'm not sure there is any issue though with using the len being passed in getTopDocsCollector.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          It's been a while since I looked at this code. I'm wondering if I originally implemented it like this because of issues with the QueryResultCache. But I don't remember exactly the reason for having a separate length variable.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited It's been a while since I looked at this code. I'm wondering if I originally implemented it like this because of issues with the QueryResultCache. But I don't remember exactly the reason for having a separate length variable.

            People

            • Assignee:
              cpoerschke Christine Poerschke
              Reporter:
              cpoerschke Christine Poerschke
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development