Solr
  1. Solr
  2. SOLR-5463

Provide cursor/token based "searchAfter" support that works with arbitrary sorting (ie: "deep paging")

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 5.0
    • Component/s: None
    • Labels:
      None

      Description

      I'd like to revist a solution to the problem of "deep paging" in Solr, leveraging an HTTP based API similar to how IndexSearcher.searchAfter works at the lucene level: require the clients to provide back a token indicating the sort values of the last document seen on the previous "page". This is similar to the "cursor" model I've seen in several other REST APIs that support "pagnation" over a large sets of results (notable the twitter API and it's "since_id" param) except that we'll want something that works with arbitrary multi-level sort critera that can be either ascending or descending.

      SOLR-1726 laid some initial ground work here and was commited quite a while ago, but the key bit of argument parsing to leverage it was commented out due to some problems (see comments in that issue). It's also somewhat out of date at this point: at the time it was commited, IndexSearcher only supported searchAfter for simple scores, not arbitrary field sorts; and the params added in SOLR-1726 suffer from this limitation as well.

      I think it would make sense to start fresh with a new issue with a focus on ensuring that we have deep paging which:

      • supports arbitrary field sorts in addition to sorting by score
      • works in distributed mode
      Basic Usage
      • send a request with sort=X&start=0&rows=N&cursorMark=*
        • sort can be anything, but must include the uniqueKey field (as a tie breaker)
        • "N" can be any number you want per page
        • start must be "0"
        • "*" denotes you want to use a cursor starting at the beginning mark
      • parse the response body and extract the (String) nextCursorMark value
      • Replace the "*" value in your initial request params with the nextCursorMark value from the response in the subsequent request
      • repeat until the nextCursorMark value stops changing, or you have collected as many docs as you need
      1. SOLR-5463__straw_man__MissingStringLastComparatorSource.patch
        3 kB
        Steve Rowe
      2. SOLR-5463__straw_man.patch
        99 kB
        Hoss Man
      3. SOLR-5463__straw_man.patch
        97 kB
        Hoss Man
      4. SOLR-5463__straw_man.patch
        94 kB
        Hoss Man
      5. SOLR-5463__straw_man.patch
        94 kB
        Hoss Man
      6. SOLR-5463__straw_man.patch
        94 kB
        Hoss Man
      7. SOLR-5463__straw_man.patch
        70 kB
        Hoss Man
      8. SOLR-5463__straw_man.patch
        70 kB
        Hoss Man
      9. SOLR-5463__straw_man.patch
        64 kB
        Hoss Man
      10. SOLR-5463__straw_man.patch
        58 kB
        Hoss Man
      11. SOLR-5463__straw_man.patch
        39 kB
        Hoss Man
      12. SOLR-5463.patch
        132 kB
        Hoss Man
      13. SOLR-5463.patch
        126 kB
        Hoss Man
      14. SOLR-5463.patch
        118 kB
        Hoss Man
      15. SOLR-5463.patch
        114 kB
        Steve Rowe
      16. SOLR-5463.patch
        111 kB
        Hoss Man
      17. SOLR-5463.patch
        115 kB
        Hoss Man
      18. SOLR-5463.patch
        110 kB
        Hoss Man
      19. SOLR-5463-randomized-faceting-test.patch
        8 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          I've been reading up on the internals of IndexSearcher.searchAfter and the associated PagingFieldCollector used (as well as some of the problems encountered in SOLR-1726) and I'm not convinced it could be a slam dunk to try and use them directly in Solr:

          • IndexSearcher.searchAfter/PagingFieldCollector relies on the "client" (ie: Solr) passing back the FieldDoc of the last doc returned, and has expectations that the (lucene) docid contained in that FieldDoc will be meaningful
            • We could perhaps serialize a representation of the "last" FieldDoc to include the the response of each request, and the deserialize that into a suitable imposter object on the "searchAfter" request – but there is still the problem of the internal docid which will be missleading in a multishard distributed solr setup)
          • There are a varity of code paths in SolrIndexSearcher for executing searches and it's not immediately obvious (to me) if/when it would make sense to augment each of those paths with PagingFieldCollector (see yonik's comment in SOLR-1726 about faceting).

          With that in mind, the approach i'm going to pursue (largely for my own sanity) is:

          • Attempt a minimally invasive straw man implimentation of "searchAfter" type functionality that works in distributed mode – ideally w/o modifying any existing Solr code.
          • Use this straw man implementation to sanity check that the end user API is useful
          • Build up good comprehensive (passing) tests against this straw man
          • circle back and revisit the implementation details looking for oportunities to:
            • refactor to eliminate similar code duplication
            • improve performance

          My current idea is to implement this straw man solution using a new SearchComponent that would run after QueryComponent, along hte lines of...

          • prepare:
            • No-Op unless "searchAfter" param is specified
              • Use some marker value to mean "first page"
            • assert that start==0 (doesn't make sense when using searchAfter)
            • assert that uniqueKey is one of the sort fields (to ensure consistent ordering)
            • if searchAfter param value indicates this is not the first request:
              • deserialize the token it into a list of sort values
              • add a new PostFilter that restricts to documents based on those values and the sort directions (same basic logic as PagingFieldCollector)
          • process:
            • No-Op unless "searchAfter" param is specified
            • do nothing if this is a shard request
            • for regular old single node solr requests: serialize the sort values of the last doc in the Doc List (that QueryComponent has already built) and put it in the response as the "next" searchAfter token
          • finishStage:
            • No-Op unless "searchAfter" param is specified and stage is "DONE"
            • serialize the sort values of the last doc in the Doc List (that QueryComponent already merged) and put it in the response as the "next" searchAfter token
          Show
          Hoss Man added a comment - I've been reading up on the internals of IndexSearcher.searchAfter and the associated PagingFieldCollector used (as well as some of the problems encountered in SOLR-1726 ) and I'm not convinced it could be a slam dunk to try and use them directly in Solr: IndexSearcher.searchAfter/PagingFieldCollector relies on the "client" (ie: Solr) passing back the FieldDoc of the last doc returned, and has expectations that the (lucene) docid contained in that FieldDoc will be meaningful We could perhaps serialize a representation of the "last" FieldDoc to include the the response of each request, and the deserialize that into a suitable imposter object on the "searchAfter" request – but there is still the problem of the internal docid which will be missleading in a multishard distributed solr setup) There are a varity of code paths in SolrIndexSearcher for executing searches and it's not immediately obvious (to me) if/when it would make sense to augment each of those paths with PagingFieldCollector (see yonik's comment in SOLR-1726 about faceting). With that in mind, the approach i'm going to pursue (largely for my own sanity) is: Attempt a minimally invasive straw man implimentation of "searchAfter" type functionality that works in distributed mode – ideally w/o modifying any existing Solr code. Use this straw man implementation to sanity check that the end user API is useful Build up good comprehensive (passing) tests against this straw man circle back and revisit the implementation details looking for oportunities to: refactor to eliminate similar code duplication improve performance My current idea is to implement this straw man solution using a new SearchComponent that would run after QueryComponent, along hte lines of... prepare: No-Op unless "searchAfter" param is specified Use some marker value to mean "first page" assert that start==0 (doesn't make sense when using searchAfter) assert that uniqueKey is one of the sort fields (to ensure consistent ordering) if searchAfter param value indicates this is not the first request: deserialize the token it into a list of sort values add a new PostFilter that restricts to documents based on those values and the sort directions (same basic logic as PagingFieldCollector) process: No-Op unless "searchAfter" param is specified do nothing if this is a shard request for regular old single node solr requests: serialize the sort values of the last doc in the Doc List (that QueryComponent has already built) and put it in the response as the "next" searchAfter token finishStage: No-Op unless "searchAfter" param is specified and stage is "DONE" serialize the sort values of the last doc in the Doc List (that QueryComponent already merged) and put it in the response as the "next" searchAfter token
          Hide
          Robert Muir added a comment -

          We could perhaps serialize a representation of the "last" FieldDoc to include the the response of each request, and the deserialize that into a suitable imposter object on the "searchAfter" request – but there is still the problem of the internal docid which will be missleading in a multishard distributed solr setup)

          I disagree. The fieldDoc only contains the values that were sorted on. This is what is minimal and necessary to do paging

          If solr wants to avoid lucene docids for some reason (e.g. because it does not yet implement searcher leases), then perhaps when using this feature (at least for now) the uniqueid should always be added as a tiebreak to the sort.

          Show
          Robert Muir added a comment - We could perhaps serialize a representation of the "last" FieldDoc to include the the response of each request, and the deserialize that into a suitable imposter object on the "searchAfter" request – but there is still the problem of the internal docid which will be missleading in a multishard distributed solr setup) I disagree. The fieldDoc only contains the values that were sorted on. This is what is minimal and necessary to do paging If solr wants to avoid lucene docids for some reason (e.g. because it does not yet implement searcher leases), then perhaps when using this feature (at least for now) the uniqueid should always be added as a tiebreak to the sort.
          Hide
          Hoss Man added a comment -

          I disagree. The fieldDoc only contains the values that were sorted on. This is what is minimal and necessary to do paging

          FieldDoc subclasses ScoreDoc which includes the internal docid – and PagingFieldCollector does look at it. But as you say: as long as we include uniqueKey in the fields (which i already mentioned) then the docid in the FieldDoc shouldn't matter since (i think?) it's only used as a tie breaker.

          If solr wants to avoid lucene docids for some reason (e.g. because it does not yet implement searcher leases) ...

          I'm glad you brought up searcher leases, because i wanted to mention it before but i forgot...

          • I have no idea how to even try to implement searcher leases in a sane way in a distribted solr setup, given that we want clients to be able to hit any replica on subsequent requests.
          • For my use cases, I actively do NOT want a searcher lease when doing deep paging: if documents matching my searcher, but on high pages i have not loaded yet, get deleted from the index, i don't want them included in the results once i get to that page just because they were a match X minutes ago when my search started.

          I think what makes the most sense is to ensure we can support deep paging w/o searcher leases, and then if/when searcher leases are supported people who want both can have both.


          I'm attaching my current progress with a straw man impl + tests. It includes the basic functionality & tests for doing deep paging on a single node solr setup using numeric sorts.

          There are an absurd number of nocommits in this patch: most of them are in the impl and i'm not worried about them because im hoping the impl can ultimately be thrown out; some are in the test because of additional tests i want to write; some are in the test because of silly limitations in the impl.

          Only one class of nocommits really concerns me at this point and that's the issue of dealing with String sorts – the way Solr's distributed sorting code deals with fields that use SortField.Type.STRING (and presumably SolrTield.Type.STRING_VAL) results in the coordinator node having a String object even though the underlying FieldComparator expects/uses BytesRef as the comparison value.

          I could probably hack arround this, and convert the Strings back to BytesRef myself in the DeepPaging code – but this actually smells like a more fundamental problem we should address. It seems to be the same root problem that sarowe has been looking into in SOLR-5354 in order to play nicer with custom FieldTypes: safely "serializing" the true sort object (regardless of what it is) between shards->coordinator, and then deserializing it & using the real FieldComparator for each field to do the aggregated sorting of the docs from each shard.


          In any case, my next step is to get a some distributed tests setup and working against this straw man impl, and then dig into throwing away the straw man impl and trying to replace it with PagingFieldCollector – posibly with a side diversion to help sarowe fix the underlying problems in SOLR-5354 first.

          Show
          Hoss Man added a comment - I disagree. The fieldDoc only contains the values that were sorted on. This is what is minimal and necessary to do paging FieldDoc subclasses ScoreDoc which includes the internal docid – and PagingFieldCollector does look at it. But as you say: as long as we include uniqueKey in the fields (which i already mentioned) then the docid in the FieldDoc shouldn't matter since (i think?) it's only used as a tie breaker. If solr wants to avoid lucene docids for some reason (e.g. because it does not yet implement searcher leases) ... I'm glad you brought up searcher leases, because i wanted to mention it before but i forgot... I have no idea how to even try to implement searcher leases in a sane way in a distribted solr setup, given that we want clients to be able to hit any replica on subsequent requests. For my use cases, I actively do NOT want a searcher lease when doing deep paging: if documents matching my searcher, but on high pages i have not loaded yet, get deleted from the index, i don't want them included in the results once i get to that page just because they were a match X minutes ago when my search started. I think what makes the most sense is to ensure we can support deep paging w/o searcher leases, and then if/when searcher leases are supported people who want both can have both. I'm attaching my current progress with a straw man impl + tests. It includes the basic functionality & tests for doing deep paging on a single node solr setup using numeric sorts. There are an absurd number of nocommits in this patch: most of them are in the impl and i'm not worried about them because im hoping the impl can ultimately be thrown out; some are in the test because of additional tests i want to write; some are in the test because of silly limitations in the impl. Only one class of nocommits really concerns me at this point and that's the issue of dealing with String sorts – the way Solr's distributed sorting code deals with fields that use SortField.Type.STRING (and presumably SolrTield.Type.STRING_VAL) results in the coordinator node having a String object even though the underlying FieldComparator expects/uses BytesRef as the comparison value. I could probably hack arround this, and convert the Strings back to BytesRef myself in the DeepPaging code – but this actually smells like a more fundamental problem we should address. It seems to be the same root problem that sarowe has been looking into in SOLR-5354 in order to play nicer with custom FieldTypes: safely "serializing" the true sort object (regardless of what it is) between shards->coordinator, and then deserializing it & using the real FieldComparator for each field to do the aggregated sorting of the docs from each shard. In any case, my next step is to get a some distributed tests setup and working against this straw man impl, and then dig into throwing away the straw man impl and trying to replace it with PagingFieldCollector – posibly with a side diversion to help sarowe fix the underlying problems in SOLR-5354 first.
          Hide
          Hoss Man added a comment -

          updated the straw man to included distributed search support.

          I still want to beef up the randomized testing some more before moving on to replacing the straw man with a lower level impl using the PagingCollector

          Show
          Hoss Man added a comment - updated the straw man to included distributed search support. I still want to beef up the randomized testing some more before moving on to replacing the straw man with a lower level impl using the PagingCollector
          Hide
          Hoss Man added a comment -

          Added simple support for sorts involving score, and added randomized testing of multi-level sorts, both in single node and distributed modes.

          next up i'm going to look into improving the serialization of the totem to make it work better with strings and CUSTOM SortFields – which requires leveraging the improvements sarowe is working on in SOLR-5354.

          Show
          Hoss Man added a comment - Added simple support for sorts involving score, and added randomized testing of multi-level sorts, both in single node and distributed modes. next up i'm going to look into improving the serialization of the totem to make it work better with strings and CUSTOM SortFields – which requires leveraging the improvements sarowe is working on in SOLR-5354 .
          Hide
          Hoss Man added a comment -

          The totem serialization now takes advantage of the work done in SOLR-5354 so that searchAfter now plays nice with basic string sorting and custom sort functions.

          One limitation i uncovered however is that MissingStringLastComparatorSource (which ironically is used by both sortMissingLast and sortMissingFirst) throws UsupOpEx for a large number of it's methods, which makes it impossible to use to filter out docs that are "before" the searchAfter totem. It should be fixable, but i've punted on this for now.

          Show
          Hoss Man added a comment - The totem serialization now takes advantage of the work done in SOLR-5354 so that searchAfter now plays nice with basic string sorting and custom sort functions. One limitation i uncovered however is that MissingStringLastComparatorSource (which ironically is used by both sortMissingLast and sortMissingFirst) throws UsupOpEx for a large number of it's methods, which makes it impossible to use to filter out docs that are "before" the searchAfter totem. It should be fixable, but i've punted on this for now.
          Hide
          Hoss Man added a comment -

          Whoops ... last patch was stale and didn't have the randomized string fields for testing.

          Show
          Hoss Man added a comment - Whoops ... last patch was stale and didn't have the randomized string fields for testing.
          Hide
          Hoss Man added a comment -

          Patch update...

          • additional tests that mix deletes & updates with walking a cursor
          • more randomization of the types of queries being run
          • more hardening of the SearchAfterTotem class (it should be useful beyond the strawman)
          • more tests for the SearchAfterTotem serialization
          • more tests of bad input/usage
          • hook strawman component into example to try it out

          ...at this point i was going to pursue tweaking the user facing API a bit, so that the "next" totem was never null, it always corrisponds to the last doc returned, and clients would check for 0 docs coming back to know when they are "done" and the "next" totem returned would be the same as the one they sent. If we do this, usecases like "i want every doc matching this query, and if there are no more, i want to remember where i left off and contiue again later" would be possible if the client uses a compatible sort (ie: a timestamp field)

          However.... when manually using this with the example configs in order to sanity check that that was going to feel right as an API, i discovered problems where the queryResultCache was coming into play and never getting past "page" 3. I felt stupid for not thinking to test this earlier, and updated the test configs to include queryResultCache, but i can't reproduce with a failure ... still not sure why.

          Need to investigate this further before doing anything else.

          Show
          Hoss Man added a comment - Patch update... additional tests that mix deletes & updates with walking a cursor more randomization of the types of queries being run more hardening of the SearchAfterTotem class (it should be useful beyond the strawman) more tests for the SearchAfterTotem serialization more tests of bad input/usage hook strawman component into example to try it out ...at this point i was going to pursue tweaking the user facing API a bit, so that the "next" totem was never null, it always corrisponds to the last doc returned, and clients would check for 0 docs coming back to know when they are "done" and the "next" totem returned would be the same as the one they sent. If we do this, usecases like "i want every doc matching this query, and if there are no more, i want to remember where i left off and contiue again later" would be possible if the client uses a compatible sort (ie: a timestamp field) However.... when manually using this with the example configs in order to sanity check that that was going to feel right as an API, i discovered problems where the queryResultCache was coming into play and never getting past "page" 3. I felt stupid for not thinking to test this earlier, and updated the test configs to include queryResultCache, but i can't reproduce with a failure ... still not sure why. Need to investigate this further before doing anything else.
          Hide
          Steve Rowe added a comment - - edited

          However.... when manually using this with the example configs in order to sanity check that that was going to feel right as an API, i discovered problems where the queryResultCache was coming into play and never getting past "page" 3. I felt stupid for not thinking to test this earlier, and updated the test configs to include queryResultCache, but i can't reproduce with a failure ... still not sure why.

          All the new tests pass for me with the latest patch.

          I'm never getting past "page" 2 (rather than 3) - searchAfter=<the page 2 totem> returns the same totem, along with the same page 2 results.

          Maybe not unexpected, but when I commented out <<queryResultWindowSize>20</queryResultWindowSize> and set <queryResultMaxDocsCached> to a number smaller than the rows param, I can get past page 2 - success getting as far as page 5, in fact.

          Seems like the queryResultCache isn't paying attention to all query params? (I don't know the code myself.) From solrconfig.xml - a literal reading is that only the q, sort, start and rows params are taken into consideration:

              <!-- Query Result Cache
                   
                   Caches results of searches - ordered lists of document ids
                   (DocList) based on a query, a sort, and the range of documents requested.  
                -->
          
          Show
          Steve Rowe added a comment - - edited However.... when manually using this with the example configs in order to sanity check that that was going to feel right as an API, i discovered problems where the queryResultCache was coming into play and never getting past "page" 3. I felt stupid for not thinking to test this earlier, and updated the test configs to include queryResultCache, but i can't reproduce with a failure ... still not sure why. All the new tests pass for me with the latest patch. I'm never getting past "page" 2 (rather than 3) - searchAfter=<the page 2 totem> returns the same totem, along with the same page 2 results. Maybe not unexpected, but when I commented out <<queryResultWindowSize>20</queryResultWindowSize> and set <queryResultMaxDocsCached> to a number smaller than the rows param, I can get past page 2 - success getting as far as page 5, in fact. Seems like the queryResultCache isn't paying attention to all query params? (I don't know the code myself.) From solrconfig.xml - a literal reading is that only the q , sort , start and rows params are taken into consideration: <!-- Query Result Cache Caches results of searches - ordered lists of document ids (DocList) based on a query, a sort, and the range of documents requested. -->
          Hide
          Steve Rowe added a comment -

          I'm never getting past "page" 2 (rather than 3) - searchAfter=<the page 2 totem> returns the same totem, along with the same page 2 results.

          I was wrong - I wasn't including page 1 (searchAfter=*) in the page count - searchAfter=<the page 3 totem> returns itself for me.

          Also, if it were the case that the queryResultCache were ignoring the searchAfter param, then pages 2 and 3 would return the same results as the first page, which is not the case...

          Show
          Steve Rowe added a comment - I'm never getting past "page" 2 (rather than 3) - searchAfter=<the page 2 totem> returns the same totem, along with the same page 2 results. I was wrong - I wasn't including page 1 ( searchAfter=* ) in the page count - searchAfter=<the page 3 totem> returns itself for me. Also, if it were the case that the queryResultCache were ignoring the searchAfter param, then pages 2 and 3 would return the same results as the first page, which is not the case...
          Hide
          Hoss Man added a comment -

          All the new tests pass for me with the latest patch.

          Right, i may not have been clear before but the problem that confuses me is: Even with queryResultCaching enabled in the tests, they pass – and do not exhibit the same problems i'm seeing when manually using searchAfter with the example configs.

          Also, if it were the case that the queryResultCache were ignoring the searchAfter param, then pages 2 and 3 would return the same results as the first page, which is not the case...

          When searchAfter=* is used, the search itself is no differnet then a regular query, it can be cached, or re-used from an existing cached query – the only thing special that happens is that the DeepPagingComponent knows it needs to compute the "nextSearchAfter" totem. Once you request "page #2" (with a searchAfter other then "*", then a PostFilter is used, and you can see from the cache stats that it's not considered a hit on the initial query – so far so good, but as you mentioned "page #3" is where we start to see the same data as page #2 returned because it's a "cache hit".

          It would not suprise me at all if there is a bug in my DeepPagingComponent that was causing caching to be used when it shouldn't be (even though i think i did everything right in the PostFilter) ... what suprises me is how easy it is to reproduce manually, but how hard it is to reproduce in the automated tests.

          Show
          Hoss Man added a comment - All the new tests pass for me with the latest patch. Right, i may not have been clear before but the problem that confuses me is: Even with queryResultCaching enabled in the tests, they pass – and do not exhibit the same problems i'm seeing when manually using searchAfter with the example configs. Also, if it were the case that the queryResultCache were ignoring the searchAfter param, then pages 2 and 3 would return the same results as the first page, which is not the case... When searchAfter=* is used, the search itself is no differnet then a regular query, it can be cached, or re-used from an existing cached query – the only thing special that happens is that the DeepPagingComponent knows it needs to compute the "nextSearchAfter" totem. Once you request "page #2" (with a searchAfter other then "*", then a PostFilter is used, and you can see from the cache stats that it's not considered a hit on the initial query – so far so good, but as you mentioned "page #3" is where we start to see the same data as page #2 returned because it's a "cache hit". It would not suprise me at all if there is a bug in my DeepPagingComponent that was causing caching to be used when it shouldn't be (even though i think i did everything right in the PostFilter) ... what suprises me is how easy it is to reproduce manually, but how hard it is to reproduce in the automated tests.
          Hide
          Hoss Man added a comment -

          Ah, HA!

          The reason the tests weren't failing is because i was an idiot and hadn't configured the queryResultCache properly in the test configs.

          now that i've made the tests fail, i can finally start trying to figure out why the tests fail.

          Show
          Hoss Man added a comment - Ah, HA! The reason the tests weren't failing is because i was an idiot and hadn't configured the queryResultCache properly in the test configs. now that i've made the tests fail, i can finally start trying to figure out why the tests fail.
          Hide
          Steve Rowe added a comment -

          i was an idiot and hadn't configured the queryResultCache properly in the test configs.

          aha - no wrapping <query> tags... yeah when I add that, boom, failures

          Show
          Steve Rowe added a comment - i was an idiot and hadn't configured the queryResultCache properly in the test configs. aha - no wrapping <query> tags... yeah when I add that, boom, failures
          Hide
          Hoss Man added a comment -

          cause of caching bug was trivially silly...

          I hadn't bothered implementing hashCode or equals on the PostFilter used in the straw man because:

          • I implemented it to always return "false" from "getCache()" so Solr should never try to cache them anyways
          • I figured the Object base impls (based on Object identity) would be fine in any cases where they might get called.
            ...forgetting that...
          • even though my PostFilters were never getting cached, they were still getting used as part of the QueryResultCache key for the main query
          • The Query class overrides Object's hashCode & equals to compare boosts, and my PostFilter class was inheriting that w/o augmenting it – so instances of my PostFilter were all being treated a equal

          Everything works as expected now with caching enabled, and all tests pass.

          Show
          Hoss Man added a comment - cause of caching bug was trivially silly... I hadn't bothered implementing hashCode or equals on the PostFilter used in the straw man because: I implemented it to always return "false" from "getCache()" so Solr should never try to cache them anyways I figured the Object base impls (based on Object identity) would be fine in any cases where they might get called. ...forgetting that... even though my PostFilters were never getting cached, they were still getting used as part of the QueryResultCache key for the main query The Query class overrides Object's hashCode & equals to compare boosts, and my PostFilter class was inheriting that w/o augmenting it – so instances of my PostFilter were all being treated a equal Everything works as expected now with caching enabled, and all tests pass.
          Hide
          Hoss Man added a comment -

          Ok, updated patch making the change in user semantics I mentioned wanting to try last week. Best way to explain it is with a walk through of a simple example (note: if you try the current strawman code, the "numFound" and "start" values returned in the docList don't match what i've pasted in the examples below – these examples show what the final results should look like in the finished solution)

          Initial requests using searchAfter should always start with a totem value of "*"

          {
            "responseHeader":{
              "status":0,
              "QTime":2},
            "response":{"numFound":32,"start":-1,"docs":[
                // ...20 docs here...
              ]
            },
            "nextSearchAfter":"AoEjTk9L"}
          

          The nextSearchAfter token returned by this request tells us what to use in the second request...

          {
            "responseHeader":{
              "status":0,
              "QTime":7},
            "response":{"numFound":32,"start":-1,"docs":[
                // ...12 docs here...
              ]
            },
            "nextSearchAfter":"AoEoMDU3OUIwMDI="}
          

          Since this result block contains fewer rows then were requested, the client could automatically stop, but the nextSearchAfter is still returned, and it's still safe to request a subsequent page (this is the fundemental diff from the previous patches, where nextSearchAfter was set to null anytime the code could tell there were no more results ...

          {
            "responseHeader":{
              "status":0,
              "QTime":1},
            "response":{"numFound":32,"start":-1,"docs":[]
            },
            "nextSearchAfter":"AoEoMDU3OUIwMDI="}
          

          Note that in this case, with no docs included in the response, the nextSearchAfter totem is the same as the input.

          For some sorts this makes it possible for clients to "resume" a full walk of all documents matching a query – picking up where they let off if more documents are added to the index that match (for example: when doing an ascending sort on a numeric uniqueKey field that always increases as new docs are added, sorting by a timestamp field (asc) indicating when documents are crawled, etc...)

          This also works as you would expect for searches that don't match any documents...

          {
            "responseHeader":{
              "status":0,
              "QTime":21},
            "response":{"numFound":0,"start":-1,"docs":[]
            },
            "nextSearchAfter":"*"}
          
          Show
          Hoss Man added a comment - Ok, updated patch making the change in user semantics I mentioned wanting to try last week. Best way to explain it is with a walk through of a simple example (note: if you try the current strawman code, the "numFound" and "start" values returned in the docList don't match what i've pasted in the examples below – these examples show what the final results should look like in the finished solution) Initial requests using searchAfter should always start with a totem value of " * " http://localhost:8983/solr/deep?q=*:*&rows=20&sort=id+desc&searchAfter=* { "responseHeader" :{ "status" :0, "QTime" :2}, "response" :{ "numFound" :32, "start" :-1, "docs" :[ // ...20 docs here... ] }, "nextSearchAfter" : "AoEjTk9L" } The nextSearchAfter token returned by this request tells us what to use in the second request... http://localhost:8983/solr/deep?q=*:*&rows=20&sort=id+desc&searchAfter=AoEjTk9L { "responseHeader" :{ "status" :0, "QTime" :7}, "response" :{ "numFound" :32, "start" :-1, "docs" :[ // ...12 docs here... ] }, "nextSearchAfter" : "AoEoMDU3OUIwMDI=" } Since this result block contains fewer rows then were requested, the client could automatically stop, but the nextSearchAfter is still returned, and it's still safe to request a subsequent page (this is the fundemental diff from the previous patches, where nextSearchAfter was set to null anytime the code could tell there were no more results ... http://localhost:8983/solr/deep?q=*:*&wt=json&indent=true&rows=20&fl=id,price&sort=id+desc&searchAfter=AoEoMDU3OUIwMDI= { "responseHeader" :{ "status" :0, "QTime" :1}, "response" :{ "numFound" :32, "start" :-1, "docs" :[] }, "nextSearchAfter" : "AoEoMDU3OUIwMDI=" } Note that in this case, with no docs included in the response, the nextSearchAfter totem is the same as the input. For some sorts this makes it possible for clients to "resume" a full walk of all documents matching a query – picking up where they let off if more documents are added to the index that match (for example: when doing an ascending sort on a numeric uniqueKey field that always increases as new docs are added, sorting by a timestamp field (asc) indicating when documents are crawled, etc...) This also works as you would expect for searches that don't match any documents... http://localhost:8983/solr/deep?q=text:bogus&rows=20&sort=id+desc&searchAfter=* { "responseHeader" :{ "status" :0, "QTime" :21}, "response" :{ "numFound" :0, "start" :-1, "docs" :[] }, "nextSearchAfter" : "*" }
          Hide
          Hoss Man added a comment -

          The one significant change i still want to make before abandoming this straw man and moving on to using PaginatingCollector under the covers is to rethink the vocabulary.

          at the Lucene/IndexSearcher level, this functionality is leveraged using a "searchAfter" param which indicates the exact "FieldDoc" returned by a previous search. The name makes a lot of sense in this API given that the FieldDoc you specify is expected to come from a previous search, and you are specifying that you want to "search for documents after this document" in the ocntext of the specified query/sort.

          For the Solr request API however, I feel like this terminology might confuse people. I'm concerned people might think they can use the uniqueKey of the last document they got on the previous page (instead of realizing they need to specify the special token they were returned as part of that page).

          My thinking is that from a user perspective, we should call this functionality a "Result Cursor" and rename the request param and response key appropriately. something along the lines of...

          {
            "responseHeader":{
              "status":0,
              "QTime":7},
            "response":{"numFound":32,"start":-1,"docs":[
                // ... docs here...
              ]
            },
            "cursorContinue":"AoEoMDU3OUIwMDI="}
          
          • searchAfter => cursor
          • nextSearchAfter => cursorContinue

          What do folks think?

          Show
          Hoss Man added a comment - The one significant change i still want to make before abandoming this straw man and moving on to using PaginatingCollector under the covers is to rethink the vocabulary. at the Lucene/IndexSearcher level, this functionality is leveraged using a "searchAfter" param which indicates the exact "FieldDoc" returned by a previous search. The name makes a lot of sense in this API given that the FieldDoc you specify is expected to come from a previous search, and you are specifying that you want to "search for documents after this document" in the ocntext of the specified query/sort. For the Solr request API however, I feel like this terminology might confuse people. I'm concerned people might think they can use the uniqueKey of the last document they got on the previous page (instead of realizing they need to specify the special token they were returned as part of that page). My thinking is that from a user perspective, we should call this functionality a "Result Cursor" and rename the request param and response key appropriately. something along the lines of... http://localhost:8983/solr/deep?q=*:*&rows=20&sort=id+desc&cursor=AoEjTk9L { "responseHeader" :{ "status" :0, "QTime" :7}, "response" :{ "numFound" :32, "start" :-1, "docs" :[ // ... docs here... ] }, "cursorContinue" : "AoEoMDU3OUIwMDI=" } searchAfter => cursor nextSearchAfter => cursorContinue What do folks think?
          Hide
          Steve Rowe added a comment - - edited
          • searchAfter => cursor
          • nextSearchAfter => cursorContinue

          +1

          I'm concerned people might think they can use the uniqueKey of the last document they got on the previous page

          I tried making this mistake (using the trailing unique id ("NOK" in this example) as the searchAfter param value, and I got the following error message:

          {
            "responseHeader":{
              "status":400,
              "QTime":2},
            "error":{
              "msg":"Unable to parse search after totem: NOK",
              "code":400}}
          

          (edit: cursorContinue => cursor in the sentence below)

          I think that error message should include the param name (cursor) that couldn't be parsed.

          Also, maybe it would be useful to include a prefix that will (probably) never be used in unique ids, to visually identify the cursor as such: like always prepending '*'? So your example of the future would become:

          {
            "responseHeader":{
              "status":0,
              "QTime":7},
            "response":{"numFound":32,"start":-1,"docs":[
                // ... docs here...
              ]
            },
            "cursorContinue":"*AoEoMDU3OUIwMDI="}
          

          The error message when someone gives an unparseable cursor could then include this piece of information: "cursors begin with an asterisk".

          Show
          Steve Rowe added a comment - - edited searchAfter => cursor nextSearchAfter => cursorContinue +1 I'm concerned people might think they can use the uniqueKey of the last document they got on the previous page I tried making this mistake (using the trailing unique id ("NOK" in this example) as the searchAfter param value, and I got the following error message: { "responseHeader" :{ "status" :400, "QTime" :2}, "error" :{ "msg" : "Unable to parse search after totem: NOK" , "code" :400}} ( edit : cursorContinue => cursor in the sentence below) I think that error message should include the param name ( cursor ) that couldn't be parsed. Also, maybe it would be useful to include a prefix that will (probably) never be used in unique ids, to visually identify the cursor as such: like always prepending '*'? So your example of the future would become: http://localhost:8983/solr/deep?q=*:*&rows=20&sort=id+desc&cursor=*AoEjTk9L { "responseHeader" :{ "status" :0, "QTime" :7}, "response" :{ "numFound" :32, "start" :-1, "docs" :[ // ... docs here... ] }, "cursorContinue" : "*AoEoMDU3OUIwMDI=" } The error message when someone gives an unparseable cursor could then include this piece of information: "cursors begin with an asterisk".
          Hide
          Steve Rowe added a comment -

          Another idea about the cursor: the Base64-encoded text is used verbatim, including the trailing padding '=' characters - these could be stripped out for external use (since they're there just to make the string length divisible by four), and then added back before Base64-decoding. In a URL non-metacharacter '='-s look weird, since they're already used to separate param names and values.

          Show
          Steve Rowe added a comment - Another idea about the cursor: the Base64-encoded text is used verbatim, including the trailing padding '=' characters - these could be stripped out for external use (since they're there just to make the string length divisible by four), and then added back before Base64-decoding. In a URL non-metacharacter '='-s look weird, since they're already used to separate param names and values.
          Hide
          Hoss Man added a comment -

          I think that error message should include the param name (cursor) that couldn't be parsed.

          Agreed ... the current error text is basically just a placeholder, ideally it should be something like...

          Unable to parse cursor param: value must either be '*' or the cursorContinue value from a previous search: NOK
          

          Also, maybe it would be useful to include a prefix that will (probably) never be used in unique ids, to visually identify the cursor as such: like always perpending '*'?

          Hmmm, I'm not sure if that's really worth the added bytes & parsing.

          If folks really felt like the param name should be "searchAfter" then i could certainly see the value in having some clear prefix, since the param name might lead folks to assuming they know what hte input should be; but with "cursor" i don't think we need to worry as much about people assuming they know what to put there, and with a clear error message instructing people how to get a valid cursor (from cursorContinue), that seems good enough. (right?)

          the Base64-encoded text is used verbatim, including the trailing padding '=' characters - these could be stripped out for external use (since they're there just to make the string length divisible by four), and then added back before Base64-decoding. In a URL non-metacharacter '='-s look weird, since they're already used to separate param names and values.

          Interesting idea ... again: i'm not sure how i feel about the added overhead to the parsing just to shorten the totem – especially since clients will always need to safely url encode anyway since Base64 strings can also include "+"

          However....

          In the current patch, I used the base64 utility class Solr already had (used by BinaryField and a few other places). But your suggestion reminds me that commons codec's Base64 class (jar already used by solr) supports a "url safe" variant of base64 (which looks like it's defined in RFC 4648?)...

          https://commons.apache.org/proper/commons-codec/javadocs/api-release/org/apache/commons/codec/binary/Base64.html#encodeBase64URLSafeString(byte[])

          ...something to consider.


          One other comment i got from a coworker offline was why I liked cursorContinue instead of nextCursor or cursorNext. My thinking was that since 'cursor', (as a concept) is a noun, "next cursor" might suggest that it was a (different) cursor then the one currently in use. I don't want people to think these strings are names of cursors, and they re-use the same name until they are done with it. I want to make it clear that to continue fetching results from this cursor, you have to specify the new value.

          Would "cursorAdvance" convey that better then cursorContinue ?

          Show
          Hoss Man added a comment - I think that error message should include the param name (cursor) that couldn't be parsed. Agreed ... the current error text is basically just a placeholder, ideally it should be something like... Unable to parse cursor param: value must either be '*' or the cursorContinue value from a previous search: NOK Also, maybe it would be useful to include a prefix that will (probably) never be used in unique ids, to visually identify the cursor as such: like always perpending '*'? Hmmm, I'm not sure if that's really worth the added bytes & parsing. If folks really felt like the param name should be "searchAfter" then i could certainly see the value in having some clear prefix, since the param name might lead folks to assuming they know what hte input should be; but with "cursor" i don't think we need to worry as much about people assuming they know what to put there, and with a clear error message instructing people how to get a valid cursor (from cursorContinue), that seems good enough. (right?) the Base64-encoded text is used verbatim, including the trailing padding '=' characters - these could be stripped out for external use (since they're there just to make the string length divisible by four), and then added back before Base64-decoding. In a URL non-metacharacter '='-s look weird, since they're already used to separate param names and values. Interesting idea ... again: i'm not sure how i feel about the added overhead to the parsing just to shorten the totem – especially since clients will always need to safely url encode anyway since Base64 strings can also include "+" However.... In the current patch, I used the base64 utility class Solr already had (used by BinaryField and a few other places). But your suggestion reminds me that commons codec's Base64 class (jar already used by solr) supports a "url safe" variant of base64 (which looks like it's defined in RFC 4648?)... https://commons.apache.org/proper/commons-codec/javadocs/api-release/org/apache/commons/codec/binary/Base64.html#encodeBase64URLSafeString(byte[ ]) ...something to consider. One other comment i got from a coworker offline was why I liked cursorContinue instead of nextCursor or cursorNext . My thinking was that since 'cursor', (as a concept) is a noun, "next cursor" might suggest that it was a (different) cursor then the one currently in use. I don't want people to think these strings are names of cursors, and they re-use the same name until they are done with it. I want to make it clear that to continue fetching results from this cursor, you have to specify the new value. Would " cursorAdvance " convey that better then cursorContinue ?
          Hide
          Steve Rowe added a comment -

          In the current patch, I used the base64 utility class Solr already had (used by BinaryField and a few other places). But your suggestion reminds me that commons codec's Base64 class (jar already used by solr) supports a "url safe" variant of base64 (which looks like it's defined in RFC 4648?)...
          [https://commons.apache.org/proper/commons-codec/javadocs/api-release/org/apache/commons/codec/binary/Base64.html#encodeBase64URLSafeString(byte[])]

          I forgot to mention that the "url safe" variant was discussed on the issue where the Base64 utility class was introduced: SOLR-1116, and if I understand correctly, people thought the "url safe" variant wasn't necessary, since all modern browsers accept URLs with embedded standard Base64.

          Show
          Steve Rowe added a comment - In the current patch, I used the base64 utility class Solr already had (used by BinaryField and a few other places). But your suggestion reminds me that commons codec's Base64 class (jar already used by solr) supports a "url safe" variant of base64 (which looks like it's defined in RFC 4648?)... [https://commons.apache.org/proper/commons-codec/javadocs/api-release/org/apache/commons/codec/binary/Base64.html#encodeBase64URLSafeString(byte[])] I forgot to mention that the "url safe" variant was discussed on the issue where the Base64 utility class was introduced: SOLR-1116 , and if I understand correctly, people thought the "url safe" variant wasn't necessary, since all modern browsers accept URLs with embedded standard Base64.
          Hide
          Steve Rowe added a comment -

          One other comment i got from a coworker offline was why I liked cursorContinue instead of nextCursor or cursorNext. My thinking was that since 'cursor', (as a concept) is a noun, "next cursor" might suggest that it was a (different) cursor then the one currently in use. I don't want people to think these strings are names of cursors, and they re-use the same name until they are done with it. I want to make it clear that to continue fetching results from this cursor, you have to specify the new value.

          Would "cursorAdvance" convey that better then cursorContinue ?

          You want to convey a (resumption) position in a result sequence. A cursor is not itself a position; it's a movable pointer to a position. The value of the cursor param is not the cursor itself; rather, it's the position from which to resume iterating over a result sequence.

          The problem as I see it is that the cursor itself has to be anonymous, since the implementation stores no server-side state; passing in an opaque label thus is not possible. So you're asking people to understand that they're moving a thing-that-can't-have-a-fixed-label, and that's cognitively dissonant.

          Maybe "continuation"/"nextContinuation" or "continue"/"nextContinue" or "pos"/"nextPos"? (I don't love any of them.)

          Show
          Steve Rowe added a comment - One other comment i got from a coworker offline was why I liked cursorContinue instead of nextCursor or cursorNext. My thinking was that since 'cursor', (as a concept) is a noun, "next cursor" might suggest that it was a (different) cursor then the one currently in use. I don't want people to think these strings are names of cursors, and they re-use the same name until they are done with it. I want to make it clear that to continue fetching results from this cursor, you have to specify the new value. Would "cursorAdvance" convey that better then cursorContinue ? You want to convey a (resumption) position in a result sequence. A cursor is not itself a position; it's a movable pointer to a position. The value of the cursor param is not the cursor itself; rather, it's the position from which to resume iterating over a result sequence. The problem as I see it is that the cursor itself has to be anonymous, since the implementation stores no server-side state; passing in an opaque label thus is not possible. So you're asking people to understand that they're moving a thing-that-can't-have-a-fixed-label, and that's cognitively dissonant. Maybe "continuation"/"nextContinuation" or "continue"/"nextContinue" or "pos"/"nextPos"? (I don't love any of them.)
          Hide
          Hoss Man added a comment -

          Maybe "continuation"/"nextContinuation" or "continue"/"nextContinue" or "pos"/"nextPos"? (I don't love any of them.)

          Of all the ideas so far, i dislike cursor & cursorAdvance the least of all.

          But i think you're on to something here...

          You want to convey a (resumption) position in a result sequence. A cursor is not itself a position; it's a movable pointer to a position. The value of the cursor param is not the cursor itself; rather, it's the position from which to resume iterating over a result sequence.

          Perhaps the key here is to pick a param name that makes it clear it's not a "cursor name" is a "position along the progression of a cursor" ... things like cursorPosition, cursorPoint, and cursorValue seems like they could all easily suffer the same confusion as searchAfter .. but perhaps we could find a suitable "attribute" qualifier on cursor that is more clearly and obviously disjoint from something the user might think they can define on their own?

          • cursorMark / nextCursorMark
          • cursorPhase / nextCursorPhase
          • cursorToken / nextCursorToken
          • cursorPosToken / nextCursorPosToken

          Any of this sounding resoundingly better then the existing ideas to anyone? I think my new favorite is cursorMark / nextCursorMark ... they seem suitably abstract that people won't try to presume they know what they mean.

          Show
          Hoss Man added a comment - Maybe "continuation"/"nextContinuation" or "continue"/"nextContinue" or "pos"/"nextPos"? (I don't love any of them.) Of all the ideas so far, i dislike cursor & cursorAdvance the least of all. But i think you're on to something here... You want to convey a (resumption) position in a result sequence. A cursor is not itself a position; it's a movable pointer to a position. The value of the cursor param is not the cursor itself; rather, it's the position from which to resume iterating over a result sequence. Perhaps the key here is to pick a param name that makes it clear it's not a "cursor name" is a "position along the progression of a cursor" ... things like cursorPosition , cursorPoint , and cursorValue seems like they could all easily suffer the same confusion as searchAfter .. but perhaps we could find a suitable "attribute" qualifier on cursor that is more clearly and obviously disjoint from something the user might think they can define on their own? cursorMark / nextCursorMark cursorPhase / nextCursorPhase cursorToken / nextCursorToken cursorPosToken / nextCursorPosToken Any of this sounding resoundingly better then the existing ideas to anyone? I think my new favorite is cursorMark / nextCursorMark ... they seem suitably abstract that people won't try to presume they know what they mean.
          Hide
          David Smiley added a comment -

          Nice work thus far Hoss!

          FWIW I dislike using the word "token" as it might be mistakenly associated with text analysis. I suggest cursorKey.

          Show
          David Smiley added a comment - Nice work thus far Hoss! FWIW I dislike using the word "token" as it might be mistakenly associated with text analysis. I suggest cursorKey .
          Hide
          Hoss Man added a comment -

          FWIW I dislike using the word "token" as it might be mistakenly associated with text analysis.

          Great point.

          I suggest cursorKey.

          I'm concerned that people might read that with the same implied assumptions that we've been worried about with "cursor" – that it's a "cursor name" or "cursor identifier" that tey can define on their own and/or should reuse in subsequent requests.

          Show
          Hoss Man added a comment - FWIW I dislike using the word "token" as it might be mistakenly associated with text analysis. Great point. I suggest cursorKey. I'm concerned that people might read that with the same implied assumptions that we've been worried about with "cursor" – that it's a "cursor name" or "cursor identifier" that tey can define on their own and/or should reuse in subsequent requests.
          Hide
          Steve Rowe added a comment -

          I think my new favorite is cursorMark / nextCursorMark

          +1

          I suggest cursorKey.

          I like "mark" better than "key", since in addition to conveying that it's a sign/symbol for something else (as "key" does), it also has a positional meaning. And it does both those things without inviting people to use unique ids in their place.

          Show
          Steve Rowe added a comment - I think my new favorite is cursorMark / nextCursorMark +1 I suggest cursorKey . I like "mark" better than "key", since in addition to conveying that it's a sign/symbol for something else (as "key" does), it also has a positional meaning. And it does both those things without inviting people to use unique ids in their place.
          Hide
          Bill Bell added a comment -

          How does this work across slaves? Won't we need to have set a sticky session - or are you hashing the key for the slaves?

          Show
          Bill Bell added a comment - How does this work across slaves? Won't we need to have set a sticky session - or are you hashing the key for the slaves?
          Hide
          Hoss Man added a comment -

          How does this work across slaves? Won't we need to have set a sticky session - or are you hashing the key for the slaves?

          Nope. all of the information needed (the sort values to "search after") is encoded in the totem string. So in multi node setups (either simple replication or multi-replica SolrClouds) no request affinity is needed.


          FYI: I've posted a blog with some general discussion about the goals of this feature and some performance numbers based on the straw man implementation so far...

          http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/


          I'm going to move forward with one final straw man patch renaming everything to use the cursorMark / nextCursorMark convention discussed above, and then proceed with trying to throw out the straw man and integrate directly into SolrIndexSearcher using the PaginationCollector.

          Show
          Hoss Man added a comment - How does this work across slaves? Won't we need to have set a sticky session - or are you hashing the key for the slaves? Nope. all of the information needed (the sort values to "search after") is encoded in the totem string. So in multi node setups (either simple replication or multi-replica SolrClouds) no request affinity is needed. FYI: I've posted a blog with some general discussion about the goals of this feature and some performance numbers based on the straw man implementation so far... http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/ I'm going to move forward with one final straw man patch renaming everything to use the cursorMark / nextCursorMark convention discussed above, and then proceed with trying to throw out the straw man and integrate directly into SolrIndexSearcher using the PaginationCollector.
          Hide
          Hoss Man added a comment -

          Final strawman patch i plan to work on...

          • SearchAfterTotem -> CursorMark
          • DeepPagingTest -> CursorPagingTest
          • TestDistribDeepPaging -> DistribCursorPagingTest
          • searchAfter -> cursorMark
          • nextSearchAfter -> nextCursorMark
          • improved javadocs
          • CursorMark.toString -> CursorMark.getSerializedTotem

          ...next up, throwing out DeepPagingComponent and hooking this logic directly into QueryComponent and SolrIdexSearcher (via PaginationCollector) using the same tests.

          Show
          Hoss Man added a comment - Final strawman patch i plan to work on... SearchAfterTotem -> CursorMark DeepPagingTest -> CursorPagingTest TestDistribDeepPaging -> DistribCursorPagingTest searchAfter -> cursorMark nextSearchAfter -> nextCursorMark improved javadocs CursorMark.toString -> CursorMark.getSerializedTotem ...next up, throwing out DeepPagingComponent and hooking this logic directly into QueryComponent and SolrIdexSearcher (via PaginationCollector) using the same tests.
          Hide
          Hoss Man added a comment -

          Baby steps towards real solution. This still has a DeepPagingComponent for doing setup & managing the sort values, but...

          • Gutted the vestigial remnants of SOLR-1726
          • replaced CursorMark.getPagingFilter with CursorMark.getSearchAfterFieldDoc
          • made ResponseBuilder and QueryCommand keep track of a CursorMark
          • refactored SolrIndexSearcher to add a buildTopDocsCollector helper
          • made buildTopDocsCollector aware of CursorMark.getSearchAfterFieldDoc
          • added test to ensure that the non-cachibility of the cursor query wouldn't affect the independent caching of the filter queries.
          Show
          Hoss Man added a comment - Baby steps towards real solution. This still has a DeepPagingComponent for doing setup & managing the sort values, but... Gutted the vestigial remnants of SOLR-1726 replaced CursorMark.getPagingFilter with CursorMark.getSearchAfterFieldDoc made ResponseBuilder and QueryCommand keep track of a CursorMark refactored SolrIndexSearcher to add a buildTopDocsCollector helper made buildTopDocsCollector aware of CursorMark.getSearchAfterFieldDoc added test to ensure that the non-cachibility of the cursor query wouldn't affect the independent caching of the filter queries.
          Hide
          Steve Rowe added a comment - - edited

          (edit: fixed misspelled PaginationCollector -> PagingFieldCollector)

          One limitation i uncovered however is that MissingStringLastComparatorSource (which ironically is used by both sortMissingLast and sortMissingFirst) throws UsupOpEx for a large number of it's methods, which makes it impossible to use to filter out docs that are "before" the searchAfter totem. It should be fixable, but i've punted on this for now.

          I investigated this, and added an implementation for the only method that was needed (TermOrdValComparator_SML.compareDocToValues(int doc, Comparable docValue)). The attached patch adds this implementation and uncomments the elements in schema-sorts.xml that Hoss commented out to get tests to pass, and assumes Hoss's most recent strawman patch has already been applied. (I also tried converting the class to extend FieldComparator<BytesRef> instead of the current FieldComparator<Comparable>, but that causes TestDistributedGrouping to fail with a message about Integer not being castable to BytesRef - that's why I introduced an instanceof BytesRef check to convert the incoming value to BytesRef if necessary, via toString().)

          NOTE: this patch is not required by Hoss's latest (non-strawman) patch - when I uncomment the commented-out elements in schema-sorts.xml that triggered test failures in the strawman, all tests still pass, without this patch. So I'm attaching the patch to this issue just for anybody who wants to try out the strawman implementation.

          I believe my patch is not needed in the latest non-strawman implementation because TermOrdValComparator_SML delegates to inner class AnyOrdComparator for per-segment sorting, and AnyOrdComparator has a functional implementation of compareDocToValues(); unlike the strawman implementation, the latest version invokes sorts on a per-segment basis, via PagingFieldCollector.

          Show
          Steve Rowe added a comment - - edited ( edit : fixed misspelled PaginationCollector -> PagingFieldCollector ) One limitation i uncovered however is that MissingStringLastComparatorSource (which ironically is used by both sortMissingLast and sortMissingFirst) throws UsupOpEx for a large number of it's methods, which makes it impossible to use to filter out docs that are "before" the searchAfter totem. It should be fixable, but i've punted on this for now. I investigated this, and added an implementation for the only method that was needed ( TermOrdValComparator_SML.compareDocToValues(int doc, Comparable docValue) ). The attached patch adds this implementation and uncomments the elements in schema-sorts.xml that Hoss commented out to get tests to pass, and assumes Hoss's most recent strawman patch has already been applied. (I also tried converting the class to extend FieldComparator<BytesRef> instead of the current FieldComparator<Comparable> , but that causes TestDistributedGrouping to fail with a message about Integer not being castable to BytesRef - that's why I introduced an instanceof BytesRef check to convert the incoming value to BytesRef if necessary, via toString() .) NOTE: this patch is not required by Hoss's latest (non-strawman) patch - when I uncomment the commented-out elements in schema-sorts.xml that triggered test failures in the strawman, all tests still pass, without this patch. So I'm attaching the patch to this issue just for anybody who wants to try out the strawman implementation. I believe my patch is not needed in the latest non-strawman implementation because TermOrdValComparator_SML delegates to inner class AnyOrdComparator for per-segment sorting, and AnyOrdComparator has a functional implementation of compareDocToValues() ; unlike the strawman implementation, the latest version invokes sorts on a per-segment basis, via PagingFieldCollector .
          Hide
          Hoss Man added a comment -

          Small bits of progress (went down a bad fork in the road and got sidetracked with a bad idea)...

          • re-enable missingLast test fields (thanks sarowe!)
          • refactored (single node) next CursorMarker sort value extraction into SolrIndexSearcher
          • improved tests based on things that occured to me as implementation changed
          Show
          Hoss Man added a comment - Small bits of progress (went down a bad fork in the road and got sidetracked with a bad idea)... re-enable missingLast test fields (thanks sarowe!) refactored (single node) next CursorMarker sort value extraction into SolrIndexSearcher improved tests based on things that occured to me as implementation changed
          Hide
          Hoss Man added a comment -

          DeepPagingComponent is dead, long live CursorMark!

          This patch removes DeepPagingComponent completely, as all of the necessary functionality is now integrated nicely into various places of QueryComponent, ResponseBuilde, and SolrIndexSearcher.

          there are still plenty of nocommits, but those are all mostly around test improvements and/or code paths that i want to give some more review before committing.

          I'm getting ready to go on vacation for a week+ so now is a really good time for people to take the patch for a spin and try out with their use cases (nudge, nudge) w/o needing to worry that i'll upload a new one as soon as they download it.

          Show
          Hoss Man added a comment - DeepPagingComponent is dead, long live CursorMark! This patch removes DeepPagingComponent completely, as all of the necessary functionality is now integrated nicely into various places of QueryComponent, ResponseBuilde, and SolrIndexSearcher. there are still plenty of nocommits, but those are all mostly around test improvements and/or code paths that i want to give some more review before committing. I'm getting ready to go on vacation for a week+ so now is a really good time for people to take the patch for a spin and try out with their use cases (nudge, nudge) w/o needing to worry that i'll upload a new one as soon as they download it.
          Hide
          Hoss Man added a comment -

          FYI: I've updated my previous blog with new graphs showing the improvements the current patch has over the (lazy) strawman...

          http://searchhub.org/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/#update_2013_12_18

          Show
          Hoss Man added a comment - FYI: I've updated my previous blog with new graphs showing the improvements the current patch has over the (lazy) strawman... http://searchhub.org/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/#update_2013_12_18
          Hide
          Steve Rowe added a comment -

          Hoss Man, I accidentally discovered that a blank cursorMark request param is ignored - is this intentional? I ask because although CursorMark.parseSerializedTotem("") throws an exception about the bad format of an empty totem, QueryComponent.prepare() ignores a blank cursorMark param:

          final String cursorStr = rb.req.getParams().get(CursorMark.CURSOR_MARK_PARAM,"");
          if (! StringUtils.isBlank(cursorStr) ) {
            final CursorMark cursorMark = new CursorMark(rb.req.getSchema(),
                                                         rb.getSortSpec());
            cursorMark.parseSerializedTotem(cursorStr);
            rb.setCursorMark(cursorMark);
          }
          

          Shouldn't this instead throw an exception when the param is present but has a blank value? Something like the following would allow parseSerializedTotem() to throw its exception for a blank cursorMark:

          final String cursorStr = rb.req.getParams().get(CursorMark.CURSOR_MARK_PARAM);
          if (null != cursorStr) {
            final CursorMark cursorMark = new CursorMark(rb.req.getSchema(),
                                                         rb.getSortSpec());
            cursorMark.parseSerializedTotem(cursorStr);
            rb.setCursorMark(cursorMark);
          }
          
          Show
          Steve Rowe added a comment - Hoss Man , I accidentally discovered that a blank cursorMark request param is ignored - is this intentional? I ask because although CursorMark.parseSerializedTotem("") throws an exception about the bad format of an empty totem, QueryComponent.prepare() ignores a blank cursorMark param: final String cursorStr = rb.req.getParams().get(CursorMark.CURSOR_MARK_PARAM,""); if (! StringUtils.isBlank(cursorStr) ) { final CursorMark cursorMark = new CursorMark(rb.req.getSchema(), rb.getSortSpec()); cursorMark.parseSerializedTotem(cursorStr); rb.setCursorMark(cursorMark); } Shouldn't this instead throw an exception when the param is present but has a blank value? Something like the following would allow parseSerializedTotem() to throw its exception for a blank cursorMark : final String cursorStr = rb.req.getParams().get(CursorMark.CURSOR_MARK_PARAM); if ( null != cursorStr) { final CursorMark cursorMark = new CursorMark(rb.req.getSchema(), rb.getSortSpec()); cursorMark.parseSerializedTotem(cursorStr); rb.setCursorMark(cursorMark); }
          Hide
          Steve Rowe added a comment -

          Hoss Man, you have nocommits wondering what the value of start should be in returned results when using cursor functionality, e.g. from DistribCursorPagingTest.doSimpleTest():

          // assertStartAt(-1, results); // nocommit: what should start be? -1?
          

          What would be cool is if start could be the number of docs in previous pages. PagingFieldCollector.collect() throws this info away now though.

          Show
          Steve Rowe added a comment - Hoss Man , you have nocommits wondering what the value of start should be in returned results when using cursor functionality, e.g. from DistribCursorPagingTest.doSimpleTest() : // assertStartAt(-1, results); // nocommit: what should start be? -1? What would be cool is if start could be the number of docs in previous pages. PagingFieldCollector.collect() throws this info away now though.
          Hide
          Hoss Man added a comment -

          I accidentally discovered that a blank cursorMark request param is ignored - is this intentional? I ask because although CursorMark.parseSerializedTotem("") throws an exception about the bad format of an empty totem, QueryComponent.prepare() ignores a blank cursorMark param:

          Yeah .. that was intentional. My thinking was that from CursorMark's perspective, attempting to parse a null or blank string was not valid – but from QueryComponent's perspective, a null or blank string ment "do not construct a CursorMark object at all"

          the basic motivation being that it didn't seem like it should be an error to have something like /select?q=foo&cursorMark=&... ... but i'm not adamant that it should work that way.

          In fact, thinking about it more, and looking at how some other params (like start, rows, facet, etc...) deal with blank strings, i agree with you – it should be an error.


          What would be cool is if start could be the number of docs in previous pages.

          PagingFieldCollector.collect() throws this info away now though.

          Yeah ... i was initially thinking that "-1" or some other marker value would be handy to help make it clear that they shouldn't infer any meaning from it when using a cursor – but then i realize that was probably more dangerous then just using "0" because at least then there was less risk of confusing off by 1 errors if they blindly use it in some context (but i forgot to remove those nocommit questions)

          if you can see an easy way to get the "real" number from PagingFieldCollector that might be handy too ... i'm not sure.

          I'm perfectly happy with a fixed value of 0 at the moment ... it could always be revisted in the future.

          Show
          Hoss Man added a comment - I accidentally discovered that a blank cursorMark request param is ignored - is this intentional? I ask because although CursorMark.parseSerializedTotem("") throws an exception about the bad format of an empty totem, QueryComponent.prepare() ignores a blank cursorMark param: Yeah .. that was intentional. My thinking was that from CursorMark's perspective, attempting to parse a null or blank string was not valid – but from QueryComponent's perspective, a null or blank string ment "do not construct a CursorMark object at all" the basic motivation being that it didn't seem like it should be an error to have something like /select?q=foo&cursorMark=&... ... but i'm not adamant that it should work that way. In fact, thinking about it more, and looking at how some other params (like start, rows, facet, etc...) deal with blank strings, i agree with you – it should be an error. What would be cool is if start could be the number of docs in previous pages. PagingFieldCollector.collect() throws this info away now though. Yeah ... i was initially thinking that "-1" or some other marker value would be handy to help make it clear that they shouldn't infer any meaning from it when using a cursor – but then i realize that was probably more dangerous then just using "0" because at least then there was less risk of confusing off by 1 errors if they blindly use it in some context (but i forgot to remove those nocommit questions) if you can see an easy way to get the "real" number from PagingFieldCollector that might be handy too ... i'm not sure. I'm perfectly happy with a fixed value of 0 at the moment ... it could always be revisted in the future.
          Hide
          Steve Rowe added a comment -

          Patch with a few changes added onto Hoss's most recent patch.

          I accidentally discovered that a blank cursorMark request param is ignored - is this intentional? I ask because although CursorMark.parseSerializedTotem("") throws an exception about the bad format of an empty totem, QueryComponent.prepare() ignores a blank cursorMark param:

          Yeah .. that was intentional. My thinking was that from CursorMark's perspective, attempting to parse a null or blank string was not valid – but from QueryComponent's perspective, a null or blank string ment "do not construct a CursorMark object at all"
          the basic motivation being that it didn't seem like it should be an error to have something like /select?q=foo&cursorMark=&... ... but i'm not adamant that it should work that way.
          In fact, thinking about it more, and looking at how some other params (like start, rows, facet, etc...) deal with blank strings, i agree with you – it should be an error.

          I changed the behavior to make blank cursorMark params raise an error, and added a couple tests for it to CursorMarkTest.

          I fixed a few misspellings in comments, and removed a few unused imports.

          I added a new test TestCursorMarkWithoutUniqueKey.

          I substituted CURSOR_MARK_PARAM for "cursorMark", and CURSOR_MARK_START for "*" in CursorPagingTest.

          Show
          Steve Rowe added a comment - Patch with a few changes added onto Hoss's most recent patch. I accidentally discovered that a blank cursorMark request param is ignored - is this intentional? I ask because although CursorMark.parseSerializedTotem("") throws an exception about the bad format of an empty totem, QueryComponent.prepare() ignores a blank cursorMark param: Yeah .. that was intentional. My thinking was that from CursorMark's perspective, attempting to parse a null or blank string was not valid – but from QueryComponent's perspective, a null or blank string ment "do not construct a CursorMark object at all" the basic motivation being that it didn't seem like it should be an error to have something like /select?q=foo&cursorMark=&... ... but i'm not adamant that it should work that way. In fact, thinking about it more, and looking at how some other params (like start, rows, facet, etc...) deal with blank strings, i agree with you – it should be an error. I changed the behavior to make blank cursorMark params raise an error, and added a couple tests for it to CursorMarkTest . I fixed a few misspellings in comments, and removed a few unused imports. I added a new test TestCursorMarkWithoutUniqueKey . I substituted CURSOR_MARK_PARAM for "cursorMark" , and CURSOR_MARK_START for "*" in CursorPagingTest .
          Hide
          Hoss Man added a comment -

          More patch improvements..

          • cleaned up the test nocommits related to the startAt value (for now it's just 0, sarowe opened LUCENE-5380 with his other idea, we can revist and change the cursor code/test later if/when that goes through)
          • added some comments regarding potential improvements via SOLR-5595
          • cleaned up some nocommits that were acting as reminders of things i wanted to review later with fresh eyes
          • converted some nocommit asserts to clean user exceptions
          • user exception if attempting to combine cursor with timeAllowed (sarowe pointed this out to me in IRC: the nextCursorMark would make no sense)
          • added more tests for bad user input conditions
          Show
          Hoss Man added a comment - More patch improvements.. cleaned up the test nocommits related to the startAt value (for now it's just 0, sarowe opened LUCENE-5380 with his other idea, we can revist and change the cursor code/test later if/when that goes through) added some comments regarding potential improvements via SOLR-5595 cleaned up some nocommits that were acting as reminders of things i wanted to review later with fresh eyes converted some nocommit asserts to clean user exceptions user exception if attempting to combine cursor with timeAllowed (sarowe pointed this out to me in IRC: the nextCursorMark would make no sense) added more tests for bad user input conditions
          Hide
          Hoss Man added a comment -

          Improvements in this patch...

          • Moved cursor param constants to .commons.params.CursorMarkParams
          • Added QueryResponse.getNextCursorMark
          • switch cloud tests to use QueryResponse.getNextCursorMark
          • fixed a small ordering inconsistency between the single core response and the cloud response
          • fix TestCursorMarkWithoutUniqueKey's lifecycle so it plays nicely with multiple run
          • add a custom sorting field type into the tests, leveraging sarowe's work in SOLR-5354
          Show
          Hoss Man added a comment - Improvements in this patch... Moved cursor param constants to .commons.params.CursorMarkParams Added QueryResponse.getNextCursorMark switch cloud tests to use QueryResponse.getNextCursorMark fixed a small ordering inconsistency between the single core response and the cloud response fix TestCursorMarkWithoutUniqueKey's lifecycle so it plays nicely with multiple run add a custom sorting field type into the tests, leveraging sarowe's work in SOLR-5354
          Hide
          Hoss Man added a comment -

          New in this patch...

          • user error if trying to use cursor with grouping
            • I actualy thought i put this in a while ago, because i couldn't wrap my head arround what it should mean, in a perfect world, to use grouping with a cursor (let alone how to implement it) but reviewing the tests i realized it wasn't there yet.
          • fixed the last remaining nocommit: sortDocSet
            • working through the code, i realized we could actually leverage the cached docSets in the useFilterForSortedQuery situation – so I refactored the method a bit to give it more context (so it could call the buildTopDocsCollector helper method i added), removed the restriction on useFilterCache for sorted doc set when a cursor is used, and randomized useFilterForSortedQuery in the cursor test configs
          • added simple test combining faceting w/cursor. this was something i was pretty certain would work fine, but reviewing the clover coverage reports when running just the cursor tests, i realized that the getDocListAndSet paths in SolrIndexSearcher weren't being hit, so we needed a test to prove it.

          There are probably still a lot more permutations of things that could be tested, but i'm felling really good about the state of this patch – i think it's ready to commit (to trunk) and let jenkins churn away at it.

          i'll plan on pushing to trunk on monday unless anyone has concerns.

          Show
          Hoss Man added a comment - New in this patch... user error if trying to use cursor with grouping I actualy thought i put this in a while ago, because i couldn't wrap my head arround what it should mean , in a perfect world, to use grouping with a cursor (let alone how to implement it) but reviewing the tests i realized it wasn't there yet. fixed the last remaining nocommit: sortDocSet working through the code, i realized we could actually leverage the cached docSets in the useFilterForSortedQuery situation – so I refactored the method a bit to give it more context (so it could call the buildTopDocsCollector helper method i added), removed the restriction on useFilterCache for sorted doc set when a cursor is used, and randomized useFilterForSortedQuery in the cursor test configs added simple test combining faceting w/cursor. this was something i was pretty certain would work fine, but reviewing the clover coverage reports when running just the cursor tests, i realized that the getDocListAndSet paths in SolrIndexSearcher weren't being hit, so we needed a test to prove it. There are probably still a lot more permutations of things that could be tested, but i'm felling really good about the state of this patch – i think it's ready to commit (to trunk) and let jenkins churn away at it. i'll plan on pushing to trunk on monday unless anyone has concerns.
          Hide
          ASF subversion and git services added a comment -

          Commit 1556036 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1556036 ]

          SOLR-5463: new 'cursorMark' request param for deep paging of sorted result sets

          Show
          ASF subversion and git services added a comment - Commit 1556036 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1556036 ] SOLR-5463 : new 'cursorMark' request param for deep paging of sorted result sets
          Hide
          Hoss Man added a comment -

          I just committed the latest patch to trunk...
          Committed revision 1556036.

          Since this is a pretty big change, I'm going to let it soak for a few days and get hammered by jenkins a bit before attempting to backport.

          Show
          Hoss Man added a comment - I just committed the latest patch to trunk... Committed revision 1556036. Since this is a pretty big change, I'm going to let it soak for a few days and get hammered by jenkins a bit before attempting to backport.
          Hide
          Hoss Man added a comment -

          I haven't seen any negative feedback or suspicious jenkins failures, so unless someone sees a problem i'll start backporting tomorrow.

          Show
          Hoss Man added a comment - I haven't seen any negative feedback or suspicious jenkins failures, so unless someone sees a problem i'll start backporting tomorrow.
          Hide
          Joel Bernstein added a comment -

          This is a great feature. I think this should work automatically with the CollapsingQParserPlugin so there's some grouping support. I'll do some testing on this to confirm.

          Show
          Joel Bernstein added a comment - This is a great feature. I think this should work automatically with the CollapsingQParserPlugin so there's some grouping support. I'll do some testing on this to confirm.
          Hide
          Hoss Man added a comment -

          I think this should work automatically with the CollapsingQParserPlugin so there's some grouping support. I'll do some testing on this to confirm.

          Cool – thanks.

          I think what we have on trunk now is pretty stable, so I'll go ahead and backport to 4x, and if any tweaks are needed to play nice with CollapsingQParser (or at a minimum: tests to show how to use them in combination and what the results are) we should probably track those in a new issue...

          cool with you Joel Bernstein?

          Show
          Hoss Man added a comment - I think this should work automatically with the CollapsingQParserPlugin so there's some grouping support. I'll do some testing on this to confirm. Cool – thanks. I think what we have on trunk now is pretty stable, so I'll go ahead and backport to 4x, and if any tweaks are needed to play nice with CollapsingQParser (or at a minimum: tests to show how to use them in combination and what the results are) we should probably track those in a new issue... cool with you Joel Bernstein ?
          Hide
          Joel Bernstein added a comment -

          Sounds like a plan. I'll check in after testing.

          Show
          Joel Bernstein added a comment - Sounds like a plan. I'll check in after testing.
          Hide
          ASF subversion and git services added a comment -

          Commit 1557192 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1557192 ]

          SOLR-5463: move CHANGES to 4.7 for backporting

          Show
          ASF subversion and git services added a comment - Commit 1557192 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1557192 ] SOLR-5463 : move CHANGES to 4.7 for backporting
          Hide
          ASF subversion and git services added a comment -

          Commit 1557196 from hossman@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1557196 ]

          SOLR-5463: new 'cursorMark' request param for deep paging of sorted result sets (merge r1556036 and r1557192)

          Show
          ASF subversion and git services added a comment - Commit 1557196 from hossman@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1557196 ] SOLR-5463 : new 'cursorMark' request param for deep paging of sorted result sets (merge r1556036 and r1557192)
          Hide
          Hoss Man added a comment -

          Backported to 4x w/o much complication (a small autoboxing situation caused the java6 compile to complain w/o an explicit cast)

          I'll get started on some new refguide content.....

          Show
          Hoss Man added a comment - Backported to 4x w/o much complication (a small autoboxing situation caused the java6 compile to complain w/o an explicit cast) I'll get started on some new refguide content.....
          Hide
          ASF subversion and git services added a comment -

          Commit 1557370 from Robert Muir in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1557370 ]

          SOLR-5463: disable codecs that don't support docvalues in these tests

          Show
          ASF subversion and git services added a comment - Commit 1557370 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1557370 ] SOLR-5463 : disable codecs that don't support docvalues in these tests
          Hide
          Steve Rowe added a comment -

          Patch adding a randomized faceting test to CursorPagingTest to validate that aggregating field value counts via a deep paging full walk arrives at the same results as faceting. Also checks that facet results are the same with each page.

          I'm running this test in a loop 100 times - once that finishes with no failures (none yet at ~75 iterations), I'll commit.

          Show
          Steve Rowe added a comment - Patch adding a randomized faceting test to CursorPagingTest to validate that aggregating field value counts via a deep paging full walk arrives at the same results as faceting. Also checks that facet results are the same with each page. I'm running this test in a loop 100 times - once that finishes with no failures (none yet at ~75 iterations), I'll commit.
          Hide
          ASF subversion and git services added a comment -

          Commit 1557800 from Steve Rowe in branch 'dev/trunk'
          [ https://svn.apache.org/r1557800 ]

          SOLR-5463: added randomized faceting test to CursorPagingTest

          Show
          ASF subversion and git services added a comment - Commit 1557800 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1557800 ] SOLR-5463 : added randomized faceting test to CursorPagingTest
          Hide
          ASF subversion and git services added a comment -

          Commit 1557821 from Steve Rowe in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1557821 ]

          SOLR-5463: added randomized faceting test to CursorPagingTest (merged trunk r1557800)

          Show
          ASF subversion and git services added a comment - Commit 1557821 from Steve Rowe in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1557821 ] SOLR-5463 : added randomized faceting test to CursorPagingTest (merged trunk r1557800)
          Hide
          ASF subversion and git services added a comment -

          Commit 1558939 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1558939 ]

          SOLR-5463: more details in case of spooky 'walk already seen' errors

          Show
          ASF subversion and git services added a comment - Commit 1558939 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1558939 ] SOLR-5463 : more details in case of spooky 'walk already seen' errors
          Hide
          ASF subversion and git services added a comment -

          Commit 1558945 from hossman@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558945 ]

          SOLR-5463: more details in case of spooky 'walk already seen' errors (merge r1558939)

          Show
          ASF subversion and git services added a comment - Commit 1558945 from hossman@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558945 ] SOLR-5463 : more details in case of spooky 'walk already seen' errors (merge r1558939)
          Hide
          yuanyun.cn added a comment - - edited

          This feature is great, and I am trying to use it.
          But I found one issue, for the first query: I set start=0&rows=10&cursorMark=*, it returns the first 10 data, then I update cursorMark withe the nextCursorMark value returned in last response.

          Then the request failed with error: Cursor functionality requires start=0
          This is caused by CursorMark constructor which checks the offset must be 0 when cursorMark is set.
          if (0 != sortSpec.getOffset())

          { throw new SolrException(ErrorCode.BAD_REQUEST, "Cursor functionality requires start=0"); }

          Did I miss anything?

          Thanks..

          Show
          yuanyun.cn added a comment - - edited This feature is great, and I am trying to use it. But I found one issue, for the first query: I set start=0&rows=10&cursorMark=*, it returns the first 10 data, then I update cursorMark withe the nextCursorMark value returned in last response. Then the request failed with error: Cursor functionality requires start=0 This is caused by CursorMark constructor which checks the offset must be 0 when cursorMark is set. if (0 != sortSpec.getOffset()) { throw new SolrException(ErrorCode.BAD_REQUEST, "Cursor functionality requires start=0"); } Did I miss anything? Thanks..
          Hide
          Steve Rowe added a comment -

          for the first query: I set start=0&rows=10&cursorMark=*, it returns the first 10 data, then I update cursorMark withe the nextCursorMark value returned in last response.

          On the second request, what value did you give the start param?

          Show
          Steve Rowe added a comment - for the first query: I set start=0&rows=10&cursorMark=*, it returns the first 10 data, then I update cursorMark withe the nextCursorMark value returned in last response. On the second request, what value did you give the start param?
          Hide
          yuanyun.cn added a comment -

          Thanks, Steve.

          I made a mistake in the second query: I set start to 10 as i usually do before.

          I changed start=0 in all subsequent queries, and update cursorMark accordingly, and it works well.

          But we should always use start=0 when use cusrorMark feature.
          This feature is great, and Thanks again.

          Show
          yuanyun.cn added a comment - Thanks, Steve. I made a mistake in the second query: I set start to 10 as i usually do before. I changed start=0 in all subsequent queries, and update cursorMark accordingly, and it works well. But we should always use start=0 when use cusrorMark feature. This feature is great, and Thanks again.
          Hide
          Yonik Seeley added a comment -

          Nice work guys!

          Some further thoughts:

          We should consider allowing non-zero "start" parameters with cursorMark. The primary use case is when someone is skipping pages (perhaps trying to get to a different section of results, or trying to get much later in a time based search, or just viewing the long tail).

          For example, a user at page 50 clicks on page 60. It would be nice to support this by just specifying start=90 (i.e. 600-510) assuming 10 docs per page, along with the normal cursorMark (that would have started at page 51 / doc 510). Currently, the prohibition on non-zero start parameters would mean that we would either have to abandon cursoring altogether, or we would have to actually retrieve 100 documents to continue it.

          The other thought is around how to do reverse paging efficiently. One way is to save previous cursorMarks on the client side and just return to them if one wants to page backwards. The other potential way is to reverse the sort parameters and use the current cursorMark. The only pitfall to this approach is that you don't get the current document you are on (because we "searchAfter").

          Show
          Yonik Seeley added a comment - Nice work guys! Some further thoughts: We should consider allowing non-zero "start" parameters with cursorMark. The primary use case is when someone is skipping pages (perhaps trying to get to a different section of results, or trying to get much later in a time based search, or just viewing the long tail). For example, a user at page 50 clicks on page 60. It would be nice to support this by just specifying start=90 (i.e. 600-510) assuming 10 docs per page, along with the normal cursorMark (that would have started at page 51 / doc 510). Currently, the prohibition on non-zero start parameters would mean that we would either have to abandon cursoring altogether, or we would have to actually retrieve 100 documents to continue it. The other thought is around how to do reverse paging efficiently. One way is to save previous cursorMarks on the client side and just return to them if one wants to page backwards. The other potential way is to reverse the sort parameters and use the current cursorMark. The only pitfall to this approach is that you don't get the current document you are on (because we "searchAfter").
          Hide
          Hoss Man added a comment -

          Some further thoughts: ...

          Yonik: no disagreement from me, but since what we've got so far has already been committed and backported to 4x, i think it would make sense to track your enhancement ideas in new issues for tracking purposes (unless you think you can help bang these out before 4.7).

          Show
          Hoss Man added a comment - Some further thoughts: ... Yonik: no disagreement from me, but since what we've got so far has already been committed and backported to 4x, i think it would make sense to track your enhancement ideas in new issues for tracking purposes (unless you think you can help bang these out before 4.7).
          Hide
          Alexander S. added a comment -

          Inability to use this without sorting by an unique key (e.g. id) makes this feature useless. Same could be achieved previously with sorting by id and searching for docs where id is >/< than the last received. See how cursors do work in MongoDB, that's the right direction.

          Show
          Alexander S. added a comment - Inability to use this without sorting by an unique key (e.g. id) makes this feature useless. Same could be achieved previously with sorting by id and searching for docs where id is >/< than the last received. See how cursors do work in MongoDB, that's the right direction.
          Show
          Alexander S. added a comment - http://docs.mongodb.org/manual/core/cursors/
          Hide
          Alexander S. added a comment -

          Sorry for spamming, but can't edit my previous message. I just found that in mongo they also aren't isolated and could return duplicates, I was thinking they are. But sorting docs by id is not acceptable in 99% of use cases, especially in Solr, where it is more expected to get results sorted by relevance.

          Show
          Alexander S. added a comment - Sorry for spamming, but can't edit my previous message. I just found that in mongo they also aren't isolated and could return duplicates, I was thinking they are. But sorting docs by id is not acceptable in 99% of use cases, especially in Solr, where it is more expected to get results sorted by relevance.
          Hide
          Yonik Seeley added a comment -

          But sorting docs by id is not acceptable in 99% of use cases, especially in Solr, where it is more expected to get results sorted by relevance.

          It's only a tiebreak by "id" that is needed. So "sort=score desc, id asc" is fine.

          Show
          Yonik Seeley added a comment - But sorting docs by id is not acceptable in 99% of use cases, especially in Solr, where it is more expected to get results sorted by relevance. It's only a tiebreak by "id" that is needed. So "sort=score desc, id asc" is fine.
          Hide
          Alexander S. added a comment -

          Oh, that's awesome, thanks for the tip.

          Show
          Alexander S. added a comment - Oh, that's awesome, thanks for the tip.
          Hide
          David Smiley added a comment -

          I think Solr could be more user-friendly here by auto-adding the ", id asc" if it's not there.

          Show
          David Smiley added a comment - I think Solr could be more user-friendly here by auto-adding the ", id asc" if it's not there.
          Hide
          Hoss Man added a comment -

          I think Solr could be more user-friendly here by auto-adding the ", id asc" if it's not there.

          The reason the code currently throws an error was because i figured it was better to force the user to choose which tie breaker they wanted (asc vs desc) then to just magically pick one arbitrarily.

          If folks think a magic default is a better i've got no serious objections – just open a new issue.

          Show
          Hoss Man added a comment - I think Solr could be more user-friendly here by auto-adding the ", id asc" if it's not there. The reason the code currently throws an error was because i figured it was better to force the user to choose which tie breaker they wanted (asc vs desc) then to just magically pick one arbitrarily. If folks think a magic default is a better i've got no serious objections – just open a new issue.
          Hide
          Alexander S. added a comment -

          If, as David mentioned, Solr will add it only if it is not there, this should keep the ability for users to manually specify another key and order when that is required (a rare case it seems).

          Show
          Alexander S. added a comment - If, as David mentioned, Solr will add it only if it is not there, this should keep the ability for users to manually specify another key and order when that is required (a rare case it seems).
          Hide
          Alexander S. added a comment -

          I have another idea about cursors implementation. That's just an idea, I am not sure if that's possible to do.

          Is it possible to use cursors together with "start" and "rows" parameters? That would allow to use pagination and draw links for prev, next, 1, 2, 3, n+1 pages, as we can do now. So that instead of using cursorMark we'll use cursorName, which could be a static. So the request start:0, rows:10, cursorName:* will return first page of results and a static cursor name, which could then be used for all other pages (i.e. start:10, rows:10, cursorName:#

          {received_cursor_name}

          ).

          Does that make sense?

          Show
          Alexander S. added a comment - I have another idea about cursors implementation. That's just an idea, I am not sure if that's possible to do. Is it possible to use cursors together with "start" and "rows" parameters? That would allow to use pagination and draw links for prev, next, 1, 2, 3, n+1 pages, as we can do now. So that instead of using cursorMark we'll use cursorName, which could be a static. So the request start:0, rows:10, cursorName:* will return first page of results and a static cursor name, which could then be used for all other pages (i.e. start:10, rows:10, cursorName:# {received_cursor_name} ). Does that make sense?

            People

            • Assignee:
              Hoss Man
              Reporter:
              Hoss Man
            • Votes:
              12 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development