Solr
  1. Solr
  2. SOLR-2218

Performance of start= and rows= parameters are exponentially slow with large data sets

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 1.4.1
    • Fix Version/s: None
    • Component/s: Build
    • Labels:
      None

      Description

      With large data sets, > 10M rows.

      Setting start=<large number> and rows=<large numbers> is slow, and gets slower the farther you get from start=0 with a complex query. Random also makes this slower.

      Would like to somehow make this performance faster for looping through large data sets. It would be nice if we could pass a pointer to the result set to loop, or support very large rows=<number>.

      Something like:
      rows=1000
      start=0
      spointer=string_my_query_1

      Then within interval (like 5 mins) I can reference this loop:
      Something like:
      rows=1000
      start=1000
      spointer=string_my_query_1

      What do you think? Since the data is too great the cache is not helping.

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          Dup of SOLR-1726

          Show
          Grant Ingersoll added a comment - Dup of SOLR-1726
          Hide
          jess canabou added a comment -

          Hi all

          I'm a bit confused by this thread, but think I have the same or almost same issue. I'm searching on a document with over 7000000 entries. I'm using the start and rows parameters (querying 30000 recs at a time), and notice the query times getting increasingly large, the further into the document I get. Unlike Bill, I do not care about scores or relevancy, and am having difficulty understanding whether the docid is a suitable solution to my problem. Is there something I can simply tack onto the end of my query to help speed up these query times? From what I understand, it's not necessary for me to be sorting all the rows before the chunk of data I'm querying on
          My query looks as below.
          http://hostname/solr/select/?q=blablabla&version=2.2&start=4000000rows=30000&indent=on&fl=<bunch of fields>

          Any help would be greatly appreciated

          Show
          jess canabou added a comment - Hi all I'm a bit confused by this thread, but think I have the same or almost same issue. I'm searching on a document with over 7000000 entries. I'm using the start and rows parameters (querying 30000 recs at a time), and notice the query times getting increasingly large, the further into the document I get. Unlike Bill, I do not care about scores or relevancy, and am having difficulty understanding whether the docid is a suitable solution to my problem. Is there something I can simply tack onto the end of my query to help speed up these query times? From what I understand, it's not necessary for me to be sorting all the rows before the chunk of data I'm querying on My query looks as below. http://hostname/solr/select/?q=blablabla&version=2.2&start=4000000rows=30000&indent=on&fl= <bunch of fields> Any help would be greatly appreciated
          Hide
          Hoss Man added a comment -

          Unfortunately, I need the results by highest score. Does fq support score?

          As i mentioned..

          if you are sorting on score this becomes tricker, but should be possible using the "frange" parser wit the "query" function)

          I think something like..

          LAST_SCORE=5.6
          ...?q=...&fq={!frange u=5.6}query($q)&sort=score+desc
          

          ...should work (but you have the issue of docs with identical scores to worry about – something that's not a problem with uniqueIds)

          Show
          Hoss Man added a comment - Unfortunately, I need the results by highest score. Does fq support score? As i mentioned.. if you are sorting on score this becomes tricker, but should be possible using the "frange" parser wit the "query" function) I think something like.. LAST_SCORE=5.6 ...?q=...&fq={!frange u=5.6}query($q)&sort=score+desc ...should work (but you have the issue of docs with identical scores to worry about – something that's not a problem with uniqueIds)
          Hide
          Bill Bell added a comment - - edited

          Hoss,

          So what you are saying is instead of:

          1. http://hostname/solr/select?fl=id&start=20000&rows=1000&q=*:*&sort=id asc

          I should use:

          LAST_ID=20000
          1. http://hostname/solr/select?fl=id&rows=1000&q=*:*&sort=id asc&fq=id:[<LAST_ID> TO *]

          This should definately be faster. Unfortunately, I need the results by highest score. Does fq support score?

          SCORE=5.6
          1. http://hostname/solr/select?fl=id,score&rows=1000&q=*:*&sort=score desc&fq=score:[0 TO <SCORE>]

          Thoughts?

          I get an error when using fq=score:...

          HTTP ERROR 400
          Problem accessing /solr/provs/select. Reason:

          undefined field score

          Show
          Bill Bell added a comment - - edited Hoss, So what you are saying is instead of: 1. http://hostname/solr/select?fl=id&start=20000&rows=1000&q=*:*&sort=id asc I should use: LAST_ID=20000 1. http://hostname/solr/select?fl=id&rows=1000&q=*:*&sort=id asc&fq=id: [<LAST_ID> TO *] This should definately be faster. Unfortunately, I need the results by highest score. Does fq support score? SCORE=5.6 1. http://hostname/solr/select?fl=id,score&rows=1000&q=*:*&sort=score desc&fq=score: [0 TO <SCORE>] Thoughts? I get an error when using fq=score:... HTTP ERROR 400 Problem accessing /solr/provs/select. Reason: undefined field score
          Hide
          Hoss Man added a comment -

          The performance gets slower as the start increases because in order to give you rows N...M sorted by score solr must collect the the top M documents (in sorted order) Lance's point is that if you use "sort=docid+asc" this collection of top ranking documents in sorted order doesn't have to happen.

          If you have to use sorting, keep in mind that the decrease in performance as the "start" param increases w/o bounds is primarily driven by the amount of documents that have to be collected/compared on the sort field – something thta wouldn't change if yo have a named cursor (you would just be paying that cost up front instead of per request).

          You should be able to get equivalent functionality by reducing the number of collected documents – instead of increasing the start param, add a filter on the sort field indicating that you only want documents with a field value higher (or lower if using "desc" sort) then the last document so far encountered. (if you are sorting on score this becomes tricker, but should be possible using the "frange" parser wit the "query" function)

          Show
          Hoss Man added a comment - The performance gets slower as the start increases because in order to give you rows N...M sorted by score solr must collect the the top M documents (in sorted order) Lance's point is that if you use "sort= docid +asc" this collection of top ranking documents in sorted order doesn't have to happen. If you have to use sorting, keep in mind that the decrease in performance as the "start" param increases w/o bounds is primarily driven by the amount of documents that have to be collected/compared on the sort field – something thta wouldn't change if yo have a named cursor (you would just be paying that cost up front instead of per request). You should be able to get equivalent functionality by reducing the number of collected documents – instead of increasing the start param, add a filter on the sort field indicating that you only want documents with a field value higher (or lower if using "desc" sort) then the last document so far encountered. (if you are sorting on score this becomes tricker, but should be possible using the "frange" parser wit the "query" function)
          Hide
          Bill Bell added a comment -

          Lance,

          I know how to do that. That is not the issue. Let me explain again.

          This is a performance issue.

          When you loop through results "deeply" the performance of the results get SLOWER and SLOWER.

          1. http://hostname/solr/select?fl=id&start=0&rows=1000&q=*:*
          <int name="QTime">2</int>

          2. http://hostname/solr/select?fl=id&start=10000&rows=1000&q=*:*
          <int name="QTime">8</int>

          3. http://hostname/solr/select?fl=id&start=20000&rows=1000&q=*:*
          <int name="QTime">38</int>

          It keeps getting slower!!

          We need it to be consistently fast at QTIME=2.

          Any solutions?

          Show
          Bill Bell added a comment - Lance, I know how to do that. That is not the issue. Let me explain again. This is a performance issue. When you loop through results "deeply" the performance of the results get SLOWER and SLOWER. 1. http://hostname/solr/select?fl=id&start=0&rows=1000&q=*:* <int name="QTime">2</int> 2. http://hostname/solr/select?fl=id&start=10000&rows=1000&q=*:* <int name="QTime">8</int> 3. http://hostname/solr/select?fl=id&start=20000&rows=1000&q=*:* <int name="QTime">38</int> It keeps getting slower!! We need it to be consistently fast at QTIME=2. Any solutions?
          Hide
          Lance Norskog added a comment -

          The search returns many things, including a Solr issue with this title: "Enable sort by docid".

          Show
          Lance Norskog added a comment - The search returns many things, including a Solr issue with this title: "Enable sort by docid".
          Hide
          Bill Bell added a comment -

          Lance,

          Can you point me directly to the document on Lucid's website? That search returns a Luke handler, that is not what I am asking.

          1. I have a query that returns thousands of results.
          2. I want to return fl=id, start=1000, rows=1000 and af I move start farther from 0, the results slow down substantially.
          3. Need the results to come back quickly even when start=10000 if I am looping across all the results.

          Show
          Bill Bell added a comment - Lance, Can you point me directly to the document on Lucid's website? That search returns a Luke handler, that is not what I am asking. 1. I have a query that returns thousands of results. 2. I want to return fl=id, start=1000, rows=1000 and af I move start farther from 0, the results slow down substantially. 3. Need the results to come back quickly even when start=10000 if I am looping across all the results.
          Hide
          Peter Karich added a comment -

          Lance, would you mind explaining this a bit in detail ?

          The idea is to grab all/alot documents from solr even if the dataset is very large, if I haven't misunderstood what Bill was requesting. This is very useful IMHO.

          Show
          Peter Karich added a comment - Lance, would you mind explaining this a bit in detail ? The idea is to grab all/alot documents from solr even if the dataset is very large, if I haven't misunderstood what Bill was requesting. This is very useful IMHO.
          Hide
          Lance Norskog added a comment -

          There is a workaround for this called docid.

          http://www.lucidimagination.com/search/?q=_docid_#/p:solr

          Show
          Lance Norskog added a comment - There is a workaround for this called docid . http://www.lucidimagination.com/search/?q=_docid_#/p:solr

            People

            • Assignee:
              Unassigned
              Reporter:
              Bill Bell
            • Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development