Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None

      Description

      Per http://search.lucidimagination.com/search/document/350c54fc90d257ed/lots_of_results#fbb84bd297d15dd5, it would be nice to offer some other Collectors that are better at handling really large number of results. This could be implemented in a variety of ways via Collectors. For instance, we could have a raw collector that does no sorting and just returns the ScoreDocs, or we could do as Mike suggests and have Collectors that have heuristics about memory tradeoffs and only heapify when appropriate.

      1. LUCENE-2127.patch
        43 kB
        Grant Ingersoll
      2. LUCENE-2127.patch
        49 kB
        Grant Ingersoll

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Grant: In your latest patch, the collector seems to ignore docBase (or the next IndexReader), so the whole thing only works with optimized indexes. It should simply store the base doc and add it on new ScoreDoc.

          Show
          Uwe Schindler added a comment - Grant: In your latest patch, the collector seems to ignore docBase (or the next IndexReader), so the whole thing only works with optimized indexes. It should simply store the base doc and add it on new ScoreDoc.
          Hide
          Grant Ingersoll added a comment -

          Hey Jason,

          My tests are inconclusive on the patch posted above. LUCENE-2215, however, seems promising and good to get into Solr as well.

          Show
          Grant Ingersoll added a comment - Hey Jason, My tests are inconclusive on the patch posted above. LUCENE-2215 , however, seems promising and good to get into Solr as well.
          Hide
          Jason Rutherglen added a comment -

          What's the status of this one? I'm qausi interested in getting it into Solr.

          Show
          Jason Rutherglen added a comment - What's the status of this one? I'm qausi interested in getting it into Solr.
          Hide
          Adam Heinz added a comment -

          Entered a placeholder issue for Aaron's patch: LUCENE-2215

          Show
          Adam Heinz added a comment - Entered a placeholder issue for Aaron's patch: LUCENE-2215
          Hide
          Grant Ingersoll added a comment -

          So, my patch seems to be faster than ordered collectors and about the same (or a little slower) as unordered collectors. At least that was my observation running collector.alg, but there still might be room for improvement.

          Show
          Grant Ingersoll added a comment - So, my patch seems to be faster than ordered collectors and about the same (or a little slower) as unordered collectors. At least that was my observation running collector.alg, but there still might be room for improvement.
          Hide
          Otis Gospodnetic added a comment -

          +1 for Aaron's patch in a separate issue, too.

          Show
          Otis Gospodnetic added a comment - +1 for Aaron's patch in a separate issue, too.
          Hide
          Jason Rutherglen added a comment -

          jira table output.

          Mike's Python script which compares the results of two benchmark runs and outputs in jira format.

          Show
          Jason Rutherglen added a comment - jira table output. Mike's Python script which compares the results of two benchmark runs and outputs in jira format.
          Hide
          Grant Ingersoll added a comment -

          Added a task to try out the specifics of different array sizes with the PostCollectSortCollector. Still looks pretty promising (collector-small.alg: 100,000 hits retrieved: TopDocs: 20.38 elapsed sec, PostCollect: 13.25 elapsed seconds), but needs more validation.

          Show
          Grant Ingersoll added a comment - Added a task to try out the specifics of different array sizes with the PostCollectSortCollector. Still looks pretty promising (collector-small.alg: 100,000 hits retrieved: TopDocs: 20.38 elapsed sec, PostCollect: 13.25 elapsed seconds), but needs more validation.
          Hide
          Grant Ingersoll added a comment -

          The benchmark code has jira table output. It might help to make the results more readable.

          Huh?

          It doesn't support sort queries?

          Not yet, see discussion earlier.

          Show
          Grant Ingersoll added a comment - The benchmark code has jira table output. It might help to make the results more readable. Huh? It doesn't support sort queries? Not yet, see discussion earlier.
          Hide
          Jason Rutherglen added a comment -

          There's a bunch of formatting changes in the patch making it slightly hard to parse. It doesn't support sort queries?

          Show
          Jason Rutherglen added a comment - There's a bunch of formatting changes in the patch making it slightly hard to parse. It doesn't support sort queries?
          Hide
          Jason Rutherglen added a comment -

          The benchmark code has jira table output. It might help to make the results more readable.

          Show
          Jason Rutherglen added a comment - The benchmark code has jira table output. It might help to make the results more readable.
          Hide
          Grant Ingersoll added a comment -

          OK, I think this has some legs, assuming I did everything right (especially the benchmarker stuff).

          Here's what I did:
          1. Added postCollect() method to Collector as an empty method
          2. Hooked it into IndexSearcher, MultiSearcher and ParallelMultiSearcher. I'm not sure I have all of the search paths covered yet, but...
          3. Hooked in the ability to specify the collector in benchmarker (see collector.alg)
          4. Added a new LongToEnglishContentSource and QueryMaker to create pretty much infinitely scalable number of docs based off the English.java test util.

          Prelim results (unvalidated) retrieving up to 1M records (out of 2M):

          ------------> Report sum by Prefix (SearchCollector) and Round (4 about 4 out of 8000034)
          Operation round coll runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
          SearchCollector_10 0org.apache.lucene.search.PostCollectSortCollector 1 10 0.14 73.32 290,371,776 386,625,536
          SearchCollector_10 - 1topDocOrdered - - 1 - - - 10 - - - 0.10 - - 98.37 - 449,582,048 - 588,189,696
          SearchCollector_10 2org.apache.lucene.search.PostCollectSortCollector 1 10 0.14 71.47 964,864,512 1,016,311,808
          SearchCollector_10 - 3topDocOrdered - - 1 - - - 10 - - - 0.10 - - 98.73 - 791,313,664 1,016,311,808

          Still lots to do, but wanted to put it up for people to look at and tell me what I'm doing wrong. I'd also love to hook in FieldComparator stuff. Even if we could have a wrapper that took in FieldComparator inside of a regular Comparator would be cool.

          Show
          Grant Ingersoll added a comment - OK, I think this has some legs, assuming I did everything right (especially the benchmarker stuff). Here's what I did: 1. Added postCollect() method to Collector as an empty method 2. Hooked it into IndexSearcher, MultiSearcher and ParallelMultiSearcher. I'm not sure I have all of the search paths covered yet, but... 3. Hooked in the ability to specify the collector in benchmarker (see collector.alg) 4. Added a new LongToEnglishContentSource and QueryMaker to create pretty much infinitely scalable number of docs based off the English.java test util. Prelim results (unvalidated) retrieving up to 1M records (out of 2M): ------------> Report sum by Prefix (SearchCollector) and Round (4 about 4 out of 8000034) Operation round coll runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem SearchCollector_10 0org.apache.lucene.search.PostCollectSortCollector 1 10 0.14 73.32 290,371,776 386,625,536 SearchCollector_10 - 1topDocOrdered - - 1 - - - 10 - - - 0.10 - - 98.37 - 449,582,048 - 588,189,696 SearchCollector_10 2org.apache.lucene.search.PostCollectSortCollector 1 10 0.14 71.47 964,864,512 1,016,311,808 SearchCollector_10 - 3topDocOrdered - - 1 - - - 10 - - - 0.10 - - 98.73 - 791,313,664 1,016,311,808 Still lots to do, but wanted to put it up for people to look at and tell me what I'm doing wrong. I'd also love to hook in FieldComparator stuff. Even if we could have a wrapper that took in FieldComparator inside of a regular Comparator would be cool.
          Hide
          Grant Ingersoll added a comment -

          Another issue w/ this "raw" collector is it sure would be nice if it could also be a TopDocsCollector, but that is so priority queue centric as well. I could extract a simple interface w/ just:

          public TopDocs topDocs();
          

          but that is currently marked final.

          Show
          Grant Ingersoll added a comment - Another issue w/ this "raw" collector is it sure would be nice if it could also be a TopDocsCollector, but that is so priority queue centric as well. I could extract a simple interface w/ just: public TopDocs topDocs(); but that is currently marked final.
          Hide
          Michael McCandless added a comment -

          It's not fixed.

          Ahh OK. We could presumably improve the FieldComparator impls to grow-on-demand as well...

          Show
          Michael McCandless added a comment - It's not fixed. Ahh OK. We could presumably improve the FieldComparator impls to grow-on-demand as well...
          Hide
          Grant Ingersoll added a comment -

          A fixed array of ScoreDocs should work fine? But you do need to tell FieldComparator up front the biggest it will be

          It's not fixed. I want to experiment with the idea of collecting all the results and then sorting, based on the discussion in the link above. I'm seeing more and more cases, esp. for heavy machine learning/post processing apps, where the app will retrieve 1M+ docs per query.

          Thus, I modified ArrayUtils to have a grow method that takes in a ScoreDoc array, but can also pass in a preallocated array. I'm not convinced this is going to be better, but I think it's interesting to find out. Should have a patch today.

          FWIW, I think you approximate PQ makes a lot of sense too.

          Show
          Grant Ingersoll added a comment - A fixed array of ScoreDocs should work fine? But you do need to tell FieldComparator up front the biggest it will be It's not fixed. I want to experiment with the idea of collecting all the results and then sorting, based on the discussion in the link above. I'm seeing more and more cases, esp. for heavy machine learning/post processing apps, where the app will retrieve 1M+ docs per query. Thus, I modified ArrayUtils to have a grow method that takes in a ScoreDoc array, but can also pass in a preallocated array. I'm not convinced this is going to be better, but I think it's interesting to find out. Should have a patch today. FWIW, I think you approximate PQ makes a lot of sense too.
          Hide
          Michael McCandless added a comment -

          I was just thinking of an array of ScoreDocs. I'm writing the benchmark code right now. With this approach, though, it seems like the FieldComparator doesn't quite work, b/c you have to pass in numHits, which then goes and allocates another array. Seems like it would need to be modified to take in an already filled array.

          A fixed array of ScoreDocs should work fine? But you do need to tell FieldComparator up front the biggest it will be, eg 20K in my example above (when you are intending to only return 10K in the end). You then refer to the slots in your ScoreDoc array when interacting w/ the comparator... FieldComparator needs to allocate a separate array to track whatever its internal values are, to do the comparisons.

          Fixing benchmark to allow for custom collector sounds great!

          Show
          Michael McCandless added a comment - I was just thinking of an array of ScoreDocs. I'm writing the benchmark code right now. With this approach, though, it seems like the FieldComparator doesn't quite work, b/c you have to pass in numHits, which then goes and allocates another array. Seems like it would need to be modified to take in an already filled array. A fixed array of ScoreDocs should work fine? But you do need to tell FieldComparator up front the biggest it will be, eg 20K in my example above (when you are intending to only return 10K in the end). You then refer to the slots in your ScoreDoc array when interacting w/ the comparator... FieldComparator needs to allocate a separate array to track whatever its internal values are, to do the comparisons. Fixing benchmark to allow for custom collector sounds great!
          Hide
          Grant Ingersoll added a comment -

          If your "large queue" is a list/array, then it has slots, and you just reference those slots when asking FieldComparator to compare.

          I was just thinking of an array of ScoreDocs. I'm writing the benchmark code right now. With this approach, though, it seems like the FieldComparator doesn't quite work, b/c you have to pass in numHits, which then goes and allocates another array. Seems like it would need to be modified to take in an already filled array.

          But... have you thought about theoretical cost of "true pqueue" vs "approximate pqueue"? I think in the worst case, where results are returned in precisely reverse sort order, so that you always fully turnover the queue, is tricky.

          Yes, but haven't looked at implementing yet.

          But actually that 1/10K constant = 1/M, and so I think the approximate PQ works out to O(N) cost, which is in fact much cheaper. I think?

          Sounds right to me. Once we have a benchmarker in place that allows for replacement of the Collector, we can try all of this fun stuff out. We also need a collection where we can return large numbers of results with scores (in other words, not just MatchAllDocsQuery, although that still might be valid)

          Show
          Grant Ingersoll added a comment - If your "large queue" is a list/array, then it has slots, and you just reference those slots when asking FieldComparator to compare. I was just thinking of an array of ScoreDocs. I'm writing the benchmark code right now. With this approach, though, it seems like the FieldComparator doesn't quite work, b/c you have to pass in numHits, which then goes and allocates another array. Seems like it would need to be modified to take in an already filled array. But... have you thought about theoretical cost of "true pqueue" vs "approximate pqueue"? I think in the worst case, where results are returned in precisely reverse sort order, so that you always fully turnover the queue, is tricky. Yes, but haven't looked at implementing yet. But actually that 1/10K constant = 1/M, and so I think the approximate PQ works out to O(N) cost, which is in fact much cheaper. I think? Sounds right to me. Once we have a benchmarker in place that allows for replacement of the Collector, we can try all of this fun stuff out. We also need a collection where we can return large numbers of results with scores (in other words, not just MatchAllDocsQuery, although that still might be valid)
          Hide
          Michael McCandless added a comment -

          I think you should be able to use FieldComparator for the "large queue".

          If your "large queue" is a list/array, then it has slots, and you just reference those slots when asking FieldComparator to compare.

          The only difference from the FieldComparator's standpoint is you are doing far fewer compare calls as-you-go, then periodically you have a big burst of compares (the "selection algorithm", which should be implementable on FieldComparator) to find a new rough bottom.

          But... have you thought about theoretical cost of "true pqueue" vs "approximate pqueue"? I think in the worst case, where results are returned in precisely reverse sort order, so that you always fully turnover the queue, is tricky.

          Say your queue size is 10K, but you tolerate say 20 K until pruning. With worst case, every 10K hits you will have to re-select to dump the 10K worst. So cost is O(N * M), where M = queue size and N = total number of hits, though with a smallish 1/10K constant.

          With true pqueue, every hit pays log(M) price, so total cost is O(N * log(M)).

          But actually that 1/10K constant = 1/M, and so I think the approximate PQ works out to O(N) cost, which is in fact much cheaper. I think?

          Show
          Michael McCandless added a comment - I think you should be able to use FieldComparator for the "large queue". If your "large queue" is a list/array, then it has slots, and you just reference those slots when asking FieldComparator to compare. The only difference from the FieldComparator's standpoint is you are doing far fewer compare calls as-you-go, then periodically you have a big burst of compares (the "selection algorithm", which should be implementable on FieldComparator) to find a new rough bottom. But... have you thought about theoretical cost of "true pqueue" vs "approximate pqueue"? I think in the worst case, where results are returned in precisely reverse sort order, so that you always fully turnover the queue, is tricky. Say your queue size is 10K, but you tolerate say 20 K until pruning. With worst case, every 10K hits you will have to re-select to dump the 10K worst. So cost is O(N * M), where M = queue size and N = total number of hits, though with a smallish 1/10K constant. With true pqueue, every hit pays log(M) price, so total cost is O(N * log(M)). But actually that 1/10K constant = 1/M, and so I think the approximate PQ works out to O(N) cost, which is in fact much cheaper. I think?
          Hide
          Grant Ingersoll added a comment -

          Very cool, Aaron! I've kicked around that idea, but never implemented. That actually solves a slightly different problem that what I'm working on with this one, but is definitely a welcome addition, too. Please feel free to attach or open a new issue.

          Show
          Grant Ingersoll added a comment - Very cool, Aaron! I've kicked around that idea, but never implemented. That actually solves a slightly different problem that what I'm working on with this one, but is definitely a welcome addition, too. Please feel free to attach or open a new issue.
          Hide
          Aaron McCurry added a comment -

          I have implemented a paging collector for 2.4 and I am porting it to 3.0 that allows for users to page to any depth in their results.

          Depending on how you are using lucene it may work a little bit differently but essentially it does an initial search and collects X number of documents. Once that page of results is exhausted, it re-searches feeding the last docid and score into the next page of collected documents. Assuming that the documents are collected in order, if the score of the current docid being collected is less than the previous passes score then it may be collected. If the scores are equal, then compare the docids, and if the current docid being scored is greater than the previous docid, then it may be collected.

          Basically it throws out the previous page of results.

          This can be a little bit difficult to explain, but using this technique and with our internal collector pages being set to 50,000 scoredocs we have a production system that allows users to page to the last page of their results no matter how large the results, and we have results in the 10s of millions at times.

          I can provide my 2.4 collector, and/or my 3.0 (once complete), this solution may solve this issue.

          Show
          Aaron McCurry added a comment - I have implemented a paging collector for 2.4 and I am porting it to 3.0 that allows for users to page to any depth in their results. Depending on how you are using lucene it may work a little bit differently but essentially it does an initial search and collects X number of documents. Once that page of results is exhausted, it re-searches feeding the last docid and score into the next page of collected documents. Assuming that the documents are collected in order, if the score of the current docid being collected is less than the previous passes score then it may be collected. If the scores are equal, then compare the docids, and if the current docid being scored is greater than the previous docid, then it may be collected. Basically it throws out the previous page of results. This can be a little bit difficult to explain, but using this technique and with our internal collector pages being set to 50,000 scoredocs we have a production system that allows users to page to the last page of their results no matter how large the results, and we have results in the 10s of millions at times. I can provide my 2.4 collector, and/or my 3.0 (once complete), this solution may solve this issue.
          Hide
          Jason Rutherglen added a comment -

          Grant,

          This issue could be useful and I may be able to test out the
          resulting patches.

          What's the specific problem with the
          Sort/SortField/FieldComparator apparatus? We don't want to
          generate new objects using DocComparator.value(int slot) and
          store these in FieldDoc? We probably don't want to excessively
          generate objects. Maybe we can figure out a way to access to the
          primitive array and compare values which seems doable?

          Show
          Jason Rutherglen added a comment - Grant, This issue could be useful and I may be able to test out the resulting patches. What's the specific problem with the Sort/SortField/FieldComparator apparatus? We don't want to generate new objects using DocComparator.value(int slot) and store these in FieldDoc? We probably don't want to excessively generate objects. Maybe we can figure out a way to access to the primitive array and compare values which seems doable?
          Hide
          Grant Ingersoll added a comment -

          I've started work on this by looking at a collector that does sorting after all the results are collected. Now, I could simply sort on an array of ScoreDocs, but I'd really like to be able to take advantage of the existing Sort/SortField/FieldComparator stuff, but it seems really geared towards using a PriorityQueue and managing an internal values array, whereas in my case I already have the array allocated that I want to manage (well, for the most part.)

          The obvious choice is to punt and just provide score and doc id sorting, but wanted to hear other people's thoughts

          Show
          Grant Ingersoll added a comment - I've started work on this by looking at a collector that does sorting after all the results are collected. Now, I could simply sort on an array of ScoreDocs, but I'd really like to be able to take advantage of the existing Sort/SortField/FieldComparator stuff, but it seems really geared towards using a PriorityQueue and managing an internal values array, whereas in my case I already have the array allocated that I want to manage (well, for the most part.) The obvious choice is to punt and just provide score and doc id sorting, but wanted to hear other people's thoughts

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              5 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development