Lucene - Core
  1. Lucene - Core
  2. LUCENE-1483

Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them.

      This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReaders, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment.

      When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new field sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily.

      All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway).

      • Introduces
        • MultiReaderHitCollector - a HitCollector that can collect across multiple IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
        • TopFieldCollector - a HitCollector that can compare values/ordinals across IndexReaders and sort on fields.
        • FieldValueHitQueue - a Priority queue that is part of the TopFieldCollector implementation.
        • FieldComparator - a new Comparator class that works across IndexReaders. Part of the TopFieldCollector implementation.
        • FieldComparatorSource - new class to allow for custom Comparators.
      • Alters
        • IndexSearcher uses a single HitCollector to collect hits against each individual SegmentReader. All the other changes stem from this
      • Deprecates
        • TopFieldDocCollector
        • FieldSortedHitQueue
      1. LUCENE-1483.patch
        6 kB
        Michael McCandless
      2. LUCENE-1483.patch
        198 kB
        Michael McCandless
      3. LUCENE-1483.patch
        199 kB
        Michael McCandless
      4. LUCENE-1483.patch
        199 kB
        Michael McCandless
      5. LUCENE-1483.patch
        199 kB
        Michael McCandless
      6. LUCENE-1483.patch
        192 kB
        Michael McCandless
      7. LUCENE-1483.patch
        206 kB
        Michael McCandless
      8. LUCENE-1483.patch
        207 kB
        Michael McCandless
      9. LUCENE-1483.patch
        218 kB
        Michael McCandless
      10. LUCENE-1483.patch
        207 kB
        Michael McCandless
      11. LUCENE-1483.patch
        202 kB
        Michael McCandless
      12. LUCENE-1483.patch
        187 kB
        Mark Miller
      13. LUCENE-1483.patch
        184 kB
        Mark Miller
      14. LUCENE-1483.patch
        183 kB
        Mark Miller
      15. LUCENE-1483.patch
        183 kB
        Mark Miller
      16. LUCENE-1483.patch
        171 kB
        Michael McCandless
      17. LUCENE-1483.patch
        171 kB
        Michael McCandless
      18. LUCENE-1483.patch
        167 kB
        Michael McCandless
      19. LUCENE-1483.patch
        167 kB
        Michael McCandless
      20. LUCENE-1483.patch
        159 kB
        Mark Miller
      21. LUCENE-1483.patch
        160 kB
        Mark Miller
      22. LUCENE-1483.patch
        168 kB
        Mark Miller
      23. LUCENE-1483.patch
        108 kB
        Mark Miller
      24. LUCENE-1483.patch
        108 kB
        Mark Miller
      25. LUCENE-1483.patch
        78 kB
        Michael McCandless
      26. LUCENE-1483.patch
        61 kB
        Mark Miller
      27. LUCENE-1483.patch
        18 kB
        Mark Miller
      28. LUCENE-1483.patch
        18 kB
        Mark Miller
      29. LUCENE-1483.patch
        18 kB
        Mark Miller
      30. LUCENE-1483.patch
        12 kB
        Mark Miller
      31. LUCENE-1483.patch
        11 kB
        Mark Miller
      32. LUCENE-1483.patch
        11 kB
        Michael McCandless
      33. LUCENE-1483.patch
        10 kB
        Mark Miller
      34. LUCENE-1483.patch
        11 kB
        Mark Miller
      35. LUCENE-1483.patch
        12 kB
        Mark Miller
      36. LUCENE-1483-backcompat.patch
        4 kB
        Michael McCandless
      37. LUCENE-1483-partial.patch
        62 kB
        Michael McCandless
      38. sortBench.py
        6 kB
        Michael McCandless
      39. sortCollate.py
        2 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Mark did you intend to attach the patch here?

          Show
          Michael McCandless added a comment - Mark did you intend to attach the patch here?
          Hide
          Mark Miller added a comment -

          I had meant to attach a patch, but then a bunch of stuff wasn't working...

          This is still a poor mans patch. I need to switch to using the expose subreaders patch. This also doesnt include the multisearcher sort patch yet, because when I tried the first one (2nd rev) everything broke. I'll work on integrating that later.

          I think all tests pass except for the very last sort test.

          Some cleanup needed, including the possible drop of using MultiSearcher itself.

          Basically, its still in a proof of concept stage.

          Show
          Mark Miller added a comment - I had meant to attach a patch, but then a bunch of stuff wasn't working... This is still a poor mans patch. I need to switch to using the expose subreaders patch. This also doesnt include the multisearcher sort patch yet, because when I tried the first one (2nd rev) everything broke. I'll work on integrating that later. I think all tests pass except for the very last sort test. Some cleanup needed, including the possible drop of using MultiSearcher itself. Basically, its still in a proof of concept stage.
          Hide
          Marvin Humphrey added a comment -

          > Quick micro bench - did it twice and both times came out 17% slower.

          I'd guess that all the OO construction/destruction costs in this part of your patch are slowing things down.

          +    Searchable[] searchers = new Searchable[readers.length];
          +    for(int i = 0; i < readers.length; i++) {
          +      searchers[i] = new IndexSearcher(readers[i]);
          +    }
          +
          +    MultiSearcher multiSearcher = new MultiSearcher(searchers);
          +    return multiSearcher.search(weight, filter, nDocs, sort);
          
          Show
          Marvin Humphrey added a comment - > Quick micro bench - did it twice and both times came out 17% slower. I'd guess that all the OO construction/destruction costs in this part of your patch are slowing things down. + Searchable[] searchers = new Searchable[readers.length]; + for ( int i = 0; i < readers.length; i++) { + searchers[i] = new IndexSearcher(readers[i]); + } + + MultiSearcher multiSearcher = new MultiSearcher(searchers); + return multiSearcher.search(weight, filter, nDocs, sort);
          Hide
          Mark Miller added a comment - - edited

          I'll be sure to include that info with the next set of results.

          I don't think those results represent getting lucky though: its 4 rounds and 2 runs with the same results (17% both runs). Nothing scientific, just did it real quick to get a base feel of the slowdown before the patch is finished up.

          EDIT Just like I forgot to take the optimize out of the sort alg when I pasted it here, looks like I missed it for the benches as well. Disregard those numbers.

          Here is the alg I used:

          merge.factor=mrg:50
          compound=false
          
          sort.rng=20000:10000:20000:10000
          
          analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
          directory=FSDirectory
          #directory=RamDirectory
          
          doc.stored=true
          doc.tokenized=true
          doc.term.vector=false
          doc.add.log.step=100000
          
          docs.dir=reuters-out
          
          doc.maker=org.apache.lucene.benchmark.byTask.feeds.SortableSimpleDocMaker
          
          query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker
          
          # task at this depth or less would print when they start
          task.max.depth.log=2
          
          log.queries=true
          # -------------------------------------------------------------------------------------
          
          { "Rounds"
          	{ "Run"
                ResetSystemErase
          
                { "Populate"
                  -CreateIndex
                  { "MAddDocs" AddDoc(100) > : 500000
                  -CloseIndex
                }
              
                { "TestSortSpeed"
                  OpenReader  
                  { "LoadFieldCacheAndSearch" SearchWithSort(sort_field:int) > : 1 
                  { "SearchWithSort" SearchWithSort(sort_field) > : 5000
                  CloseReader 
                
                }
              
                NewRound
               } : 4
          
          } 
          
          RepSumByName
          
          
          Show
          Mark Miller added a comment - - edited I'll be sure to include that info with the next set of results. I don't think those results represent getting lucky though: its 4 rounds and 2 runs with the same results (17% both runs). Nothing scientific, just did it real quick to get a base feel of the slowdown before the patch is finished up. EDIT Just like I forgot to take the optimize out of the sort alg when I pasted it here, looks like I missed it for the benches as well. Disregard those numbers. Here is the alg I used: merge.factor=mrg:50 compound=false sort.rng=20000:10000:20000:10000 analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer directory=FSDirectory #directory=RamDirectory doc.stored=true doc.tokenized=true doc.term.vector=false doc.add.log.step=100000 docs.dir=reuters-out doc.maker=org.apache.lucene.benchmark.byTask.feeds.SortableSimpleDocMaker query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker # task at this depth or less would print when they start task.max.depth.log=2 log.queries=true # ------------------------------------------------------------------------------------- { "Rounds" { "Run" ResetSystemErase { "Populate" -CreateIndex { "MAddDocs" AddDoc(100) > : 500000 -CloseIndex } { "TestSortSpeed" OpenReader { "LoadFieldCacheAndSearch" SearchWithSort(sort_field:int) > : 1 { "SearchWithSort" SearchWithSort(sort_field) > : 5000 CloseReader } NewRound } : 4 } RepSumByName
          Hide
          Mark Miller added a comment -

          Okay, I straightened things out, and now it looks like possibly no loss (for few segments anyway). Last I looked at the index, only 6 segments. I've got to put real time into all this later though. Only been able to give it some very backgroundish time this morning.

          Show
          Mark Miller added a comment - Okay, I straightened things out, and now it looks like possibly no loss (for few segments anyway). Last I looked at the index, only 6 segments. I've got to put real time into all this later though. Only been able to give it some very backgroundish time this morning.
          Hide
          Yonik Seeley added a comment -

          Okay, I straightened things out, and now it looks like possibly no loss

          So if there was a 17% loss on the optimized index, and very little loss on a segmented index, I assume that means that matching/scoring is enough slower on the segmented index that the loss in sorting performance doesn't matter as much?

          Show
          Yonik Seeley added a comment - Okay, I straightened things out, and now it looks like possibly no loss So if there was a 17% loss on the optimized index, and very little loss on a segmented index, I assume that means that matching/scoring is enough slower on the segmented index that the loss in sorting performance doesn't matter as much?
          Hide
          Mark Miller added a comment -

          Ignore those first results entirely. It turns out I had the latest 1471 patched in. That shouldnt slow down a single segment though. Neither this or 1471 should have slowed things down because they only affect multisegment and multiindex searches I thought. Odd, but I just junked all of that and started fresh, did the tests a little closer to right, and see the numbers looking the same. Didn't want to get too into benching before its sorted out a bit more. I'll try to get enough time to be more rigorous later though. My free moments are under heavy attack by the female that appears to have made herself at home in my house.

          As a side not, 1471 doesn't work in a couple ways with this patch - it throws both a nullpointer exception and a class cast exception in different circumstances.

          Show
          Mark Miller added a comment - Ignore those first results entirely. It turns out I had the latest 1471 patched in. That shouldnt slow down a single segment though. Neither this or 1471 should have slowed things down because they only affect multisegment and multiindex searches I thought. Odd, but I just junked all of that and started fresh, did the tests a little closer to right, and see the numbers looking the same. Didn't want to get too into benching before its sorted out a bit more. I'll try to get enough time to be more rigorous later though. My free moments are under heavy attack by the female that appears to have made herself at home in my house. As a side not, 1471 doesn't work in a couple ways with this patch - it throws both a nullpointer exception and a class cast exception in different circumstances.
          Hide
          Michael McCandless added a comment -

          I think there should be very little impact to performance, for single or multi segment indices, for the search itself against a warmed reader. (And actually LUCENE-1471 should make things a wee bit faster, especially if n,m are largeish, though this will typically be in the noise).

          But warming after reopen should be much faster with this patch (we should try to measure that too).

          Show
          Michael McCandless added a comment - I think there should be very little impact to performance, for single or multi segment indices, for the search itself against a warmed reader. (And actually LUCENE-1471 should make things a wee bit faster, especially if n,m are largeish, though this will typically be in the noise). But warming after reopen should be much faster with this patch (we should try to measure that too).
          Hide
          Mark Miller added a comment -

          I think there should be very little impact to performance, for single or multi segment indices, for the search itself against a warmed reader. (And actually LUCENE-1471 should make things a wee bit faster, especially if n,m are largeish, though this will typically be in the noise).

          That seems to be inline with what I got with 6 segments. I'm running some 30-50 seg range tests on my other laptop now.

          But warming after reopen should be much faster with this patch (we should try to measure that too).

          I've got a base alg for that type of thing around somewhere too, from 831. It should be about the same, which means pretty dramatic reopen improvements if you multiple segments, especially if the new segment is small. Its likely to be small in comparison to all of the segs anyway, which means pretty great improvements.

          Show
          Mark Miller added a comment - I think there should be very little impact to performance, for single or multi segment indices, for the search itself against a warmed reader. (And actually LUCENE-1471 should make things a wee bit faster, especially if n,m are largeish, though this will typically be in the noise). That seems to be inline with what I got with 6 segments. I'm running some 30-50 seg range tests on my other laptop now. But warming after reopen should be much faster with this patch (we should try to measure that too). I've got a base alg for that type of thing around somewhere too, from 831. It should be about the same, which means pretty dramatic reopen improvements if you multiple segments, especially if the new segment is small. Its likely to be small in comparison to all of the segs anyway, which means pretty great improvements.
          Hide
          Mark Miller added a comment -

          Ill bench again after this issue is polished up, but it looks like at 100 segments I am seeing the 20% drop. I didn't see any drop at 6 segments in a retest.

          Ill do some longer, more thought out benchmarks when the patch is in better shape.

          Show
          Mark Miller added a comment - Ill bench again after this issue is polished up, but it looks like at 100 segments I am seeing the 20% drop. I didn't see any drop at 6 segments in a retest. Ill do some longer, more thought out benchmarks when the patch is in better shape.
          Hide
          Michael McCandless added a comment -

          Hmmmmmmm.

          OK I think I see what could explain this: insertion into the pqueue is
          fairly costly. So, because we now make 100 pqueues, each gathering
          top N results, we are paying much more insertion cost overall than the
          single queue that IndexSearcher(MultiReader) uses.

          So.... how about still doing the searches per-sub-reader(searcher),
          but, make a HitCollector that gathers the results into a single
          pqueue, passing that HitCollector to each sub-searcher?

          If that turns out OK, then I think it would make LUCENE-1471 moot
          because we should similarly change MultiSearcher to use a single
          shared pqueue.

          Actually I think this approach should be a bit faster, because there
          is some very small method call overhead to how MultiReader implements
          TermDocs/Positions by "concatenating" its sub-readers. So by pushing
          Searcher down onto each SegmentReader we should gain a bit, but it
          could very well be in the noise. For this reason we may in fact want
          to do this same thing for the "normal" (sort by relevance)
          IndexSearcher.search.

          I wish I thought of this sooner. Sorry for the runaround Mark!

          Show
          Michael McCandless added a comment - Hmmmmmmm. OK I think I see what could explain this: insertion into the pqueue is fairly costly. So, because we now make 100 pqueues, each gathering top N results, we are paying much more insertion cost overall than the single queue that IndexSearcher(MultiReader) uses. So.... how about still doing the searches per-sub-reader(searcher), but, make a HitCollector that gathers the results into a single pqueue, passing that HitCollector to each sub-searcher? If that turns out OK, then I think it would make LUCENE-1471 moot because we should similarly change MultiSearcher to use a single shared pqueue. Actually I think this approach should be a bit faster, because there is some very small method call overhead to how MultiReader implements TermDocs/Positions by "concatenating" its sub-readers. So by pushing Searcher down onto each SegmentReader we should gain a bit, but it could very well be in the noise. For this reason we may in fact want to do this same thing for the "normal" (sort by relevance) IndexSearcher.search. I wish I thought of this sooner. Sorry for the runaround Mark!
          Hide
          Doug Cutting added a comment -

          > make a HitCollector that gathers the results into a single pqueue

          That's good when everything's local, but bad when things are distributed. If we movE RemoteSearchable to contrib (as discussed in LUCENE-1314) then this may not be a problem, but we might still leave hooks so that someone can write a search that uses a separate top-queue per remote segment.

          Show
          Doug Cutting added a comment - > make a HitCollector that gathers the results into a single pqueue That's good when everything's local, but bad when things are distributed. If we movE RemoteSearchable to contrib (as discussed in LUCENE-1314 ) then this may not be a problem, but we might still leave hooks so that someone can write a search that uses a separate top-queue per remote segment.
          Hide
          Michael McCandless added a comment -

          >> make a HitCollector that gathers the results into a single pqueue
          >
          > That's good when everything's local, but bad when things are distributed. If we movE RemoteSearchable to contrib (as discussed in LUCENE-1314) then this may not be a problem, but we might still leave hooks so that someone can write a search that uses a separate top-queue per remote segment.

          Good point; so this means we can't blindly do this optimization to MultiSearcher (w/o having option to do separate queues, merged in the end). But for IndexSearcher(Multi*Reader).search it should be safe?

          Show
          Michael McCandless added a comment - >> make a HitCollector that gathers the results into a single pqueue > > That's good when everything's local, but bad when things are distributed. If we movE RemoteSearchable to contrib (as discussed in LUCENE-1314 ) then this may not be a problem, but we might still leave hooks so that someone can write a search that uses a separate top-queue per remote segment. Good point; so this means we can't blindly do this optimization to MultiSearcher (w/o having option to do separate queues, merged in the end). But for IndexSearcher(Multi*Reader).search it should be safe?
          Hide
          Doug Cutting added a comment -

          > But for IndexSearcher(Multi*Reader).search it should be safe?

          Right. Perhaps this is a reason to encourage folks to use MultiReader instead of MultiSearcher. Are there cases, other than distributed, where MultiSearcher is required? If not, perhaps it could be moved to the contrib/distributed layer too.

          Show
          Doug Cutting added a comment - > But for IndexSearcher(Multi*Reader).search it should be safe? Right. Perhaps this is a reason to encourage folks to use MultiReader instead of MultiSearcher. Are there cases, other than distributed, where MultiSearcher is required? If not, perhaps it could be moved to the contrib/distributed layer too.
          Hide
          Mark Miller added a comment -

          make a HitCollector that gathers the results into a single pqueue

          Well it certainly made the code cleaner and the patch a bit nicer, but on first quick test, I still see the 20% slowdown with 100 more/less segments.

          I'm looking through too see where I may have done something funny.

          Show
          Mark Miller added a comment - make a HitCollector that gathers the results into a single pqueue Well it certainly made the code cleaner and the patch a bit nicer, but on first quick test, I still see the 20% slowdown with 100 more/less segments. I'm looking through too see where I may have done something funny.
          Hide
          Michael McCandless added a comment -

          on first quick test, I still see the 20% slowdown with 100 more/less segments.

          Argh! Can you post your current patch?

          Show
          Michael McCandless added a comment - on first quick test, I still see the 20% slowdown with 100 more/less segments. Argh! Can you post your current patch?
          Hide
          Mark Miller added a comment -

          Here is what I've got. The final sort test still fails, but the rest should pass.

          • Mark
          Show
          Mark Miller added a comment - Here is what I've got. The final sort test still fails, but the rest should pass. Mark
          Hide
          Mark Miller added a comment -

          I also did a quick reopen alg. The speed gain on this can really vary depending on index access patterns. I tried adding 500000 (very small) docs with a random sort field. Then added 50000 docs and then reopen 10 times. Repeat all 4 times. Comparison is of the time it takes to load the fieldcache and do one search. With this patch it came out about 40-50% faster. Obviously going to depend on many factors in real world though. In certain applications I'm sure it could be many times faster or slower.

          Show
          Mark Miller added a comment - I also did a quick reopen alg. The speed gain on this can really vary depending on index access patterns. I tried adding 500000 (very small) docs with a random sort field. Then added 50000 docs and then reopen 10 times. Repeat all 4 times. Comparison is of the time it takes to load the fieldcache and do one search. With this patch it came out about 40-50% faster. Obviously going to depend on many factors in real world though. In certain applications I'm sure it could be many times faster or slower.
          Hide
          Mark Miller added a comment - - edited

          Doing a little profiling on the new code and off the top results of interest are:

          FieldSortedHitQueue.lessThan(object,object) approx 12%
          FieldSortedHitQueue.insertWIthOverflow(object) approx 12%
          MultiReaderTopFieldDocCollector.collect(int,float) 6.3%
          FieldSortedHitQueue$4.compare() 5.3 %

          and on...

          For Lucene trunk, a day or two ago:

          FieldSortedHitQueue.insertWIthOverflow(object) approx 11%
          TopFieldDocCollector.collect(int,float) 7.1%
          FieldSortedHitQueue.lessThan(object,object) approx 6.7%
          FieldSortedHitQueue.updateMaxScore 3.2%
          FieldSortedHitQueue$4.compare() 3.2 %

          Show
          Mark Miller added a comment - - edited Doing a little profiling on the new code and off the top results of interest are: FieldSortedHitQueue.lessThan(object,object) approx 12% FieldSortedHitQueue.insertWIthOverflow(object) approx 12% MultiReaderTopFieldDocCollector.collect(int,float) 6.3% FieldSortedHitQueue$4.compare() 5.3 % and on... For Lucene trunk, a day or two ago: FieldSortedHitQueue.insertWIthOverflow(object) approx 11% TopFieldDocCollector.collect(int,float) 7.1% FieldSortedHitQueue.lessThan(object,object) approx 6.7% FieldSortedHitQueue.updateMaxScore 3.2% FieldSortedHitQueue$4.compare() 3.2 %
          Hide
          Mark Miller added a comment -

          I think I see my mistake. Dumb one - I think I did it well trying to get things to work and didn't realize I left it in. I'll put up another patch after some benches finish.

          Show
          Mark Miller added a comment - I think I see my mistake. Dumb one - I think I did it well trying to get things to work and didn't realize I left it in. I'll put up another patch after some benches finish.
          Hide
          Mark Miller added a comment -

          Okay, now its as fast, if not a bit faster. I was doing the priority queue n*readers, as part of getting tests to pass early. Left it after fixing things elsewhere, and boom - explains all the lessthan nonsense in the profiling. Whoops.

          Looks good now though - still need to investigate failure of last sort test.

          Show
          Mark Miller added a comment - Okay, now its as fast, if not a bit faster. I was doing the priority queue n*readers, as part of getting tests to pass early. Left it after fixing things elsewhere, and boom - explains all the lessthan nonsense in the profiling. Whoops. Looks good now though - still need to investigate failure of last sort test.
          Hide
          Michael McCandless added a comment -

          Exellent! That was a sneaky one.

          I attached a tiny change to the patch, which is to set the docBase in MultiReaderTopFieldDocCollector; this saves the lookup into starts in each collect call.

          Show
          Michael McCandless added a comment - Exellent! That was a sneaky one. I attached a tiny change to the patch, which is to set the docBase in MultiReaderTopFieldDocCollector; this saves the lookup into starts in each collect call.
          Hide
          Michael McCandless added a comment -

          OK, I ran a quick perf test on a 100 segment index with 1 million docs
          (10K docs per segment), for a single TermQuery ("text"), and I'm
          seeing 11.1% speedup (best of 4: 20.36s -> 18.11s) with this patch, on
          Mac OS X. On Linux I see 6.3% speedup (best of 4: 23.31s -> 21.84s).

          Single segment index shows no difference, as expected.

          I think the speedup is due to avoiding the extra method call plus 2nd
          pass through the int docs[] to add in the doc base, in
          MultiSegmentReader.MultiTermDocs.read(int[] docs, int[] freqs).

          This is a nice "side effect", ie in addition to getting faster reopen
          performance (the original goal here), we get a bump in single term
          search performance.

          I think given this, we should cutover other search methods
          (sort-by-relevance, custom HitCollector) to this approach? Maybe if
          we add a new Scorer.score method that can accept a "docBase" which it
          adds into the doc() before calling collect()? In fact, if we do that,
          we may not even need the new MultiReaderTopFieldDocCollector at all?

          Hmm, though, a Scorer may override that score(HitCollector), eg
          BooleanScorer does. Maybe we have to make a wrapper HitCollector that
          simply adds in the docBase and then invokes the real
          HitCollector.collect after shifting the docBase? Though that costs us
          an extra method call per collect().

          Here's the alg I used (slight modified from the one above):

          merge.factor=1000
          compound=false
          
          analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
          directory=FSDirectory
          #directory=RamDirectory
          
          doc.tokenized=true
          doc.term.vector=false
          doc.add.log.step=100000
          max.buffered=10000
          ram.flush.mb=1000
          
          work.dir = /lucene/work
          
          doc.maker=org.apache.lucene.benchmark.byTask.feeds.SortableSimpleDocMaker
          
          query.maker=org.apache.lucene.benchmark.byTask.feeds.FileBasedQueryMaker
          file.query.maker.file = test.queries
          
          task.max.depth.log=2
          
          log.queries=true
          
          { "Populate"
            -CreateIndex
            { "MAddDocs" AddDoc(100) > : 1000000
            -CloseIndex
          }
              
          { "Rounds"
            { "Run"
              { "TestSortSpeed"
                OpenReader  
                { "LoadFieldCacheAndSearch" SearchWithSort(sort_field:int) > : 1 
                { "SearchWithSort" SearchWithSort(sort_field) > : 500
                CloseReader 
              }
              NewRound
            } : 4
          } 
          
          RepSumByPrefRound SearchWithSort
          

          It creates the index once, then does 4 rounds of searching with the
          single query "text" in test.queries (SimpleQueryMaker was creating
          other queries that were getting 0 or 1 hits).

          I'm running with "java -Xms1024M -Xmx1024M -Xbatch -server"; java is
          1.6.0_07 on Mac Pro OS X 10.5.5 and 1.6.0_10-rc on 2.6.22.1 linux
          kernel.

          Show
          Michael McCandless added a comment - OK, I ran a quick perf test on a 100 segment index with 1 million docs (10K docs per segment), for a single TermQuery ("text"), and I'm seeing 11.1% speedup (best of 4: 20.36s -> 18.11s) with this patch, on Mac OS X. On Linux I see 6.3% speedup (best of 4: 23.31s -> 21.84s). Single segment index shows no difference, as expected. I think the speedup is due to avoiding the extra method call plus 2nd pass through the int docs[] to add in the doc base, in MultiSegmentReader.MultiTermDocs.read(int[] docs, int[] freqs). This is a nice "side effect", ie in addition to getting faster reopen performance (the original goal here), we get a bump in single term search performance. I think given this, we should cutover other search methods (sort-by-relevance, custom HitCollector) to this approach? Maybe if we add a new Scorer.score method that can accept a "docBase" which it adds into the doc() before calling collect()? In fact, if we do that, we may not even need the new MultiReaderTopFieldDocCollector at all? Hmm, though, a Scorer may override that score(HitCollector), eg BooleanScorer does. Maybe we have to make a wrapper HitCollector that simply adds in the docBase and then invokes the real HitCollector.collect after shifting the docBase? Though that costs us an extra method call per collect(). Here's the alg I used (slight modified from the one above): merge.factor=1000 compound= false analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer directory=FSDirectory #directory=RamDirectory doc.tokenized= true doc.term.vector= false doc.add.log.step=100000 max.buffered=10000 ram.flush.mb=1000 work.dir = /lucene/work doc.maker=org.apache.lucene.benchmark.byTask.feeds.SortableSimpleDocMaker query.maker=org.apache.lucene.benchmark.byTask.feeds.FileBasedQueryMaker file.query.maker.file = test.queries task.max.depth.log=2 log.queries= true { "Populate" -CreateIndex { "MAddDocs" AddDoc(100) > : 1000000 -CloseIndex } { "Rounds" { "Run" { "TestSortSpeed" OpenReader { "LoadFieldCacheAndSearch" SearchWithSort(sort_field: int ) > : 1 { "SearchWithSort" SearchWithSort(sort_field) > : 500 CloseReader } NewRound } : 4 } RepSumByPrefRound SearchWithSort It creates the index once, then does 4 rounds of searching with the single query "text" in test.queries (SimpleQueryMaker was creating other queries that were getting 0 or 1 hits). I'm running with "java -Xms1024M -Xmx1024M -Xbatch -server"; java is 1.6.0_07 on Mac Pro OS X 10.5.5 and 1.6.0_10-rc on 2.6.22.1 linux kernel.
          Hide
          Mark Miller added a comment -

          Thanks Mike.

          Are we going to be screwed with this filter stuff though? I'm looking into the last sort test failing - my first thought is that a filter with say 8 docs, is getting pushed down onto 4 searches with 2 docs in each reader type of thing. A filter that allows the 8th doc wont do very well on a bunch of 2 docs searches...or hopefully I don't know what I'm talking about. Looking to see if I can figure my way around it now.

          Show
          Mark Miller added a comment - Thanks Mike. Are we going to be screwed with this filter stuff though? I'm looking into the last sort test failing - my first thought is that a filter with say 8 docs, is getting pushed down onto 4 searches with 2 docs in each reader type of thing. A filter that allows the 8th doc wont do very well on a bunch of 2 docs searches...or hopefully I don't know what I'm talking about. Looking to see if I can figure my way around it now.
          Hide
          Mark Miller added a comment - - edited

          Hmmm...well this makes the test pass (I subtract the base from the filter doc id)...

          Let me know if I'm all wet on this...

          EDIT

          that cant be quite right...I'll try making a different test.

          Show
          Mark Miller added a comment - - edited Hmmm...well this makes the test pass (I subtract the base from the filter doc id)... Let me know if I'm all wet on this... EDIT that cant be quite right...I'll try making a different test.
          Hide
          Michael McCandless added a comment -

          One thing to fix is ParalellReader: it currently defines a getSubReaders method, but, this won't work for searching (because ParallelReader's sub-readers are not "concatenated"). I think we need a more specific name than "getSubReaders" (there's some initial discussion of this in LUCENE-1475). Maybe getConcatenatedReaders? getSequentialReaders? Something else...?

          Show
          Michael McCandless added a comment - One thing to fix is ParalellReader: it currently defines a getSubReaders method, but, this won't work for searching (because ParallelReader's sub-readers are not "concatenated"). I think we need a more specific name than "getSubReaders" (there's some initial discussion of this in LUCENE-1475 ). Maybe getConcatenatedReaders? getSequentialReaders? Something else...?
          Hide
          Mark Miller added a comment -

          I think given this, we should cutover other search methods (sort-by-relevance, custom HitCollector) to this approach? Maybe if we add a new Scorer.score method that can accept a "docBase" which it adds into the doc() before calling collect()? In fact, if we do that, we may not even need the new MultiReaderTopFieldDocCollector at all?

          I like this idea. I'd love to figure out how 'outside' systems could get the reopen benefit as well (solr caches and such, beyond internal sorting). This seems like a first step towards that possibility (though I admittedly don't see a clear path yet).

          Hmm, though, a Scorer may override that score(HitCollector), eg BooleanScorer does. Maybe we have to make a wrapper HitCollector that simply adds in the docBase and then invokes the real HitCollector.collect after shifting the docBase? Though that costs us an extra method call per collect().

          Well, we might as well bench and see what we lose...

          Show
          Mark Miller added a comment - I think given this, we should cutover other search methods (sort-by-relevance, custom HitCollector) to this approach? Maybe if we add a new Scorer.score method that can accept a "docBase" which it adds into the doc() before calling collect()? In fact, if we do that, we may not even need the new MultiReaderTopFieldDocCollector at all? I like this idea. I'd love to figure out how 'outside' systems could get the reopen benefit as well (solr caches and such, beyond internal sorting). This seems like a first step towards that possibility (though I admittedly don't see a clear path yet). Hmm, though, a Scorer may override that score(HitCollector), eg BooleanScorer does. Maybe we have to make a wrapper HitCollector that simply adds in the docBase and then invokes the real HitCollector.collect after shifting the docBase? Though that costs us an extra method call per collect(). Well, we might as well bench and see what we lose...
          Hide
          Michael McCandless added a comment -

          Maybe, we should add "setDocBase" to HitCollector. Then, fix all core/contrib HitCollectors to respect this. Then add a method "supportsDocBase()" which returns false in default impl in HitCollector. Then, in search() if we are dealing w/ a HitCollector that does not supportDocBase() we have to wrap?

          Alternatively, we could throw an UnsupportedOperationException in setDocBase() by default, catch that, and fallback to wrapping.

          This way we avoid the extra collect() method call in the common cases (builtin HitCollectors). Also, we save an add when the doc is not competitive.

          Show
          Michael McCandless added a comment - Maybe, we should add "setDocBase" to HitCollector. Then, fix all core/contrib HitCollectors to respect this. Then add a method "supportsDocBase()" which returns false in default impl in HitCollector. Then, in search() if we are dealing w/ a HitCollector that does not supportDocBase() we have to wrap? Alternatively, we could throw an UnsupportedOperationException in setDocBase() by default, catch that, and fallback to wrapping. This way we avoid the extra collect() method call in the common cases (builtin HitCollectors). Also, we save an add when the doc is not competitive.
          Hide
          Michael McCandless added a comment -

          You shouldn't need to subtract base from the filter's docID, since the filter is operating in the docID space of the sub-reader.

          Is it TestSort.testTopDocScores that you see failing (that's what I see)?

          Unfortunately, the filter in that test is assuming that the IndexReader it's passed in is equal to the "full" IndexReader, because it references docs1.scoreDocs[0].doc, which is the docID space of the full reader.

          I would say the test is buggy (it's making an assumption about the API that happens to be true but was not guaranteed). However, this could mean similar filters "out there" think they can grab docIDs from the top-level IndexReader, cache them, and then assign them inside the getDocIdSet method.

          Show
          Michael McCandless added a comment - You shouldn't need to subtract base from the filter's docID, since the filter is operating in the docID space of the sub-reader. Is it TestSort.testTopDocScores that you see failing (that's what I see)? Unfortunately, the filter in that test is assuming that the IndexReader it's passed in is equal to the "full" IndexReader, because it references docs1.scoreDocs [0] .doc, which is the docID space of the full reader. I would say the test is buggy (it's making an assumption about the API that happens to be true but was not guaranteed). However, this could mean similar filters "out there" think they can grab docIDs from the top-level IndexReader, cache them, and then assign them inside the getDocIdSet method.
          Hide
          Mark Miller added a comment -

          A filter fix thats closer to correct and a new test.

          Show
          Mark Miller added a comment - A filter fix thats closer to correct and a new test.
          Hide
          Mark Miller added a comment - - edited

          Is it TestSort.testTopDocScores that you see failing (that's what I see)?

          right

          Unfortunately, the filter in that test is assuming that the IndexReader it's passed in is equal to the "full" IndexReader, because it references docs1.scoreDocs[0].doc, which is the docID space of the full reader.

          right, its using the full id space - isnt that legal though? It did work.

          I would say the test is buggy (it's making an assumption about the API that happens to be true but was not guaranteed). However, this could mean similar filters "out there" think they can grab docIDs from the top-level IndexReader, cache them, and then assign them inside the getDocIdSet method.

          That was my first thought - are they aloud to build the filter this way? But it worked to do it right? I like your assumption thought though - suits my lazy side

          What if someone made a filter that is supposed to only allow the first doc? So it is set for 0...all of the first docs of each sub reader would match. That seemed to be supported before right?

          EDIT

          nm, I guess that still fits the 'assumption' thing. Feels odd changing this behavior still though - I guess you have this same issue over multisearchers though ...

          How about those keeping their own reader to filter cache or something? A single filter could correctly apply against a single MultiReader before...now you would need a different filter for each subreader right?

          Show
          Mark Miller added a comment - - edited Is it TestSort.testTopDocScores that you see failing (that's what I see)? right Unfortunately, the filter in that test is assuming that the IndexReader it's passed in is equal to the "full" IndexReader, because it references docs1.scoreDocs [0] .doc, which is the docID space of the full reader. right, its using the full id space - isnt that legal though? It did work. I would say the test is buggy (it's making an assumption about the API that happens to be true but was not guaranteed). However, this could mean similar filters "out there" think they can grab docIDs from the top-level IndexReader, cache them, and then assign them inside the getDocIdSet method. That was my first thought - are they aloud to build the filter this way? But it worked to do it right? I like your assumption thought though - suits my lazy side What if someone made a filter that is supposed to only allow the first doc? So it is set for 0...all of the first docs of each sub reader would match. That seemed to be supported before right? EDIT nm, I guess that still fits the 'assumption' thing. Feels odd changing this behavior still though - I guess you have this same issue over multisearchers though ... How about those keeping their own reader to filter cache or something? A single filter could correctly apply against a single MultiReader before...now you would need a different filter for each subreader right?
          Hide
          Michael McCandless added a comment -

          How about those keeping their own reader to filter cache or something? A single filter could correctly apply against a single MultiReader before...now you would need a different filter for each subreader right?

          Such cases would then be caching by sub-reader, right? That's the benefit of this approach. EG I had been thinking we'd need to fix the recently added FieldCacheRangeFilter to also "understand" when it's dealing w/ concatenated sub-readers, but in fact with this change that filter is only given the sub-readers, one at a time, and so it only asks FieldCache per sub-reader, and we automatically get faster reopen performance "for free" (no change to FieldCacheRangeFilter required).

          Still, I agree this is probably dangerous to suddenly change, since there could easily be filters out there that are [illegally] using a docID not belonging/corresponding to the reader that was passed in. So maybe we should provide a migration path. EG, add "allowSubReaders" to Filter, defaulting to "return false" so that any external Filter impls still get passed the Multi*Reader, and then fix all core/contrib filters to return true from that method?

          Show
          Michael McCandless added a comment - How about those keeping their own reader to filter cache or something? A single filter could correctly apply against a single MultiReader before...now you would need a different filter for each subreader right? Such cases would then be caching by sub-reader, right? That's the benefit of this approach. EG I had been thinking we'd need to fix the recently added FieldCacheRangeFilter to also "understand" when it's dealing w/ concatenated sub-readers, but in fact with this change that filter is only given the sub-readers, one at a time, and so it only asks FieldCache per sub-reader, and we automatically get faster reopen performance "for free" (no change to FieldCacheRangeFilter required). Still, I agree this is probably dangerous to suddenly change, since there could easily be filters out there that are [illegally] using a docID not belonging/corresponding to the reader that was passed in. So maybe we should provide a migration path. EG, add "allowSubReaders" to Filter, defaulting to "return false" so that any external Filter impls still get passed the Multi*Reader, and then fix all core/contrib filters to return true from that method?
          Hide
          Mark Miller added a comment -

          Okay, I follow now. If you did things correctly, you'll always be passed the right segmentreader through a hook. Nice. I think we really do want to do this for the other search methods.

          Still, I agree this is probably dangerous to suddenly change, since there could easily be filters out there that are [illegally] using a docID not belonging/corresponding to the reader that was passed in. So maybe we should provide a migration path. EG, add "allowSubReaders" to Filter, defaulting to "return false" so that any external Filter impls still get passed the Multi*Reader, and then fix all core/contrib filters to return true from that method?

          This seems reasonable. I am following your [illegal] argument better now though, so I wouldn't feel so bad leaving it out. If its unsupported behavior, I like the idea of adding backward compat cruft much less. I had it in my head that you might be caching things based on the top level multireader, but it looks like now, that you always should be using the reader provided by a hook - which will be the single seg reader.

          Show
          Mark Miller added a comment - Okay, I follow now. If you did things correctly, you'll always be passed the right segmentreader through a hook. Nice. I think we really do want to do this for the other search methods. Still, I agree this is probably dangerous to suddenly change, since there could easily be filters out there that are [illegally] using a docID not belonging/corresponding to the reader that was passed in. So maybe we should provide a migration path. EG, add "allowSubReaders" to Filter, defaulting to "return false" so that any external Filter impls still get passed the Multi*Reader, and then fix all core/contrib filters to return true from that method? This seems reasonable. I am following your [illegal] argument better now though, so I wouldn't feel so bad leaving it out. If its unsupported behavior, I like the idea of adding backward compat cruft much less. I had it in my head that you might be caching things based on the top level multireader, but it looks like now, that you always should be using the reader provided by a hook - which will be the single seg reader.
          Hide
          Mark Miller added a comment -

          I think this is almost there: added the new logic to relevance and hitcollector search as well.

          Show
          Mark Miller added a comment - I think this is almost there: added the new logic to relevance and hitcollector search as well.
          Hide
          Mark Miller added a comment -

          That actually needed a couple of tweaks.

          Show
          Mark Miller added a comment - That actually needed a couple of tweaks.
          Hide
          Michael McCandless added a comment -

          Looks good! Do we really need IndexReader.supportsSequentialReaders? Because the default impl (return length 1 array of itself) seems sufficient?

          Also, I don't think ParallelReader needs to throw UnsupportedOperationException in that method – it can fallback to the default?

          It would be nice to somehow deprecate "supportsDocBase" so that all outside HitCollectors would need to support it on upgrading to 3.0, but, I'm not sure how to cleanly do that. (Ie I'd rather not have that method continue to exist in 3.0).

          It's a delightfully small patch now!

          Show
          Michael McCandless added a comment - Looks good! Do we really need IndexReader.supportsSequentialReaders? Because the default impl (return length 1 array of itself) seems sufficient? Also, I don't think ParallelReader needs to throw UnsupportedOperationException in that method – it can fallback to the default? It would be nice to somehow deprecate "supportsDocBase" so that all outside HitCollectors would need to support it on upgrading to 3.0, but, I'm not sure how to cleanly do that. (Ie I'd rather not have that method continue to exist in 3.0). It's a delightfully small patch now!
          Hide
          Michael McCandless added a comment -

          Also, back compat tests fail to compile because we renamed Multi*Reader.getSubReaders --> getSequentialSubReaders. So when committing this be sure to also fix the test-tag branch!

          Show
          Michael McCandless added a comment - Also, back compat tests fail to compile because we renamed Multi*Reader.getSubReaders --> getSequentialSubReaders. So when committing this be sure to also fix the test-tag branch!
          Hide
          Mark Miller added a comment -

          Looks good! Do we really need IndexReader.supportsSequentialReaders? Because the default impl (return length 1 array of itself) seems sufficient?

          Let me investigate. If you try using the default impl, and you change the parralellreaderreopen test to use getSequintialReaders rather than getSubReaders, the test will throw a stack trace overflow on checking if the reader is closed. I put the illegal in there to make sure I wasnt calling it, because upon switching to getSubReaders, problem goes away. Seemed fair enough, but I'll I guess have to understand what was going on to really respond.

          It would be nice to somehow deprecate "supportsDocBase" so that all outside HitCollectors would need to support it on upgrading to 3.0, but, I'm not sure how to cleanly do that. (Ie I'd rather not have that method continue to exist in 3.0).

          +1. I dont see what we can do but release it deprecated with a note explaining. Fair enough for 3.0 I think.

          It's a delightfully small patch now!

          Yeah, this one had the great feeling of the multiterm patch - rolled right up into something nice. Goto love the Lucene API, flaws or not.

          Show
          Mark Miller added a comment - Looks good! Do we really need IndexReader.supportsSequentialReaders? Because the default impl (return length 1 array of itself) seems sufficient? Let me investigate. If you try using the default impl, and you change the parralellreaderreopen test to use getSequintialReaders rather than getSubReaders, the test will throw a stack trace overflow on checking if the reader is closed. I put the illegal in there to make sure I wasnt calling it, because upon switching to getSubReaders, problem goes away. Seemed fair enough, but I'll I guess have to understand what was going on to really respond. It would be nice to somehow deprecate "supportsDocBase" so that all outside HitCollectors would need to support it on upgrading to 3.0, but, I'm not sure how to cleanly do that. (Ie I'd rather not have that method continue to exist in 3.0). +1. I dont see what we can do but release it deprecated with a note explaining. Fair enough for 3.0 I think. It's a delightfully small patch now! Yeah, this one had the great feeling of the multiterm patch - rolled right up into something nice. Goto love the Lucene API, flaws or not.
          Hide
          Michael McCandless added a comment -

          If you try using the default impl, and you change the parralellreaderreopen test to use getSequintialReaders rather than getSubReaders

          But you shouldn't change that test (it should continue to call ParalellReader.getSubReaders).

          I dont see what we can do but release it deprecated with a note explaining. Fair enough for 3.0 I think.

          The problem is this doesn't create a the right sequence. Ie, if we mark supportsDocBase as deprecated, then some ExternalHitCollector won't see the deprecation warning (they have not overridden supportsDocBase), and so then we can't remove it in 3.0 since their code will then silently break. I think the only thing to do would be deprecate collect() in favor of another method. Or, deprecate HitCollector entirely in favor of new class DocBaseHitCollector (don't like that name). Sigh...

          Yeah, this one had the great feeling of the multiterm patch - rolled right up into something nice. Goto love the Lucene API, flaws or not.

          I love this kind

          Show
          Michael McCandless added a comment - If you try using the default impl, and you change the parralellreaderreopen test to use getSequintialReaders rather than getSubReaders But you shouldn't change that test (it should continue to call ParalellReader.getSubReaders). I dont see what we can do but release it deprecated with a note explaining. Fair enough for 3.0 I think. The problem is this doesn't create a the right sequence. Ie, if we mark supportsDocBase as deprecated, then some ExternalHitCollector won't see the deprecation warning (they have not overridden supportsDocBase), and so then we can't remove it in 3.0 since their code will then silently break. I think the only thing to do would be deprecate collect() in favor of another method. Or, deprecate HitCollector entirely in favor of new class DocBaseHitCollector (don't like that name). Sigh... Yeah, this one had the great feeling of the multiterm patch - rolled right up into something nice. Goto love the Lucene API, flaws or not. I love this kind
          Hide
          Mark Miller added a comment -

          So I don't know what that stackover flow was, but I just put things to how they should be and all tests pass.

          So very close now. I'm not ready to commit myself though. I do most of my thinking after the work

          Show
          Mark Miller added a comment - So I don't know what that stackover flow was, but I just put things to how they should be and all tests pass. So very close now. I'm not ready to commit myself though. I do most of my thinking after the work
          Hide
          Michael McCandless added a comment -

          Oh one more thing: I think we don't need both supportsDocBase and setDocBase throwing UOE by default? How about just the 2nd one, and you fix search to catch the UOE and wrap in DocBaseCollector?

          Show
          Michael McCandless added a comment - Oh one more thing: I think we don't need both supportsDocBase and setDocBase throwing UOE by default? How about just the 2nd one, and you fix search to catch the UOE and wrap in DocBaseCollector?
          Hide
          Mark Miller added a comment -

          But you shouldn't change that test (it should continue to call ParalellReader.getSubReaders).

          Right, I was coming at it from the wrong angle. I had refactored in the change with eclipse, and seeing that ParalellReader.getSuquentialReaders could cause a statckoverflow, thats why i put in the isSequentialSuported check. Guess its not a concern though. I didn't take much time to understand it.

          The problem is this doesn't create a the right sequence. Ie, if we mark supportsDocBase as deprecated, then some ExternalHitCollector won't see the deprecation warning (they have not overridden supportsDocBase), and so then we can't remove it in 3.0 since their code will then silently break. I think the only thing to do would be deprecate collect() in favor of another method. Or, deprecate HitCollector entirely in favor of new class DocBaseHitCollector (don't like that name). Sigh...

          This depends right? Don't we have a lot of latitude with 3.0? I would think we could require that you read some upgrade notes on changes...3.0 is our hope to change some things we couldn't normally get away with I thought I agree we should be friendly, like 1 to 2, but its tempting to use 3.0 to do some things more cleanly rather than less.

          Show
          Mark Miller added a comment - But you shouldn't change that test (it should continue to call ParalellReader.getSubReaders). Right, I was coming at it from the wrong angle. I had refactored in the change with eclipse, and seeing that ParalellReader.getSuquentialReaders could cause a statckoverflow, thats why i put in the isSequentialSuported check. Guess its not a concern though. I didn't take much time to understand it. The problem is this doesn't create a the right sequence. Ie, if we mark supportsDocBase as deprecated, then some ExternalHitCollector won't see the deprecation warning (they have not overridden supportsDocBase), and so then we can't remove it in 3.0 since their code will then silently break. I think the only thing to do would be deprecate collect() in favor of another method. Or, deprecate HitCollector entirely in favor of new class DocBaseHitCollector (don't like that name). Sigh... This depends right? Don't we have a lot of latitude with 3.0? I would think we could require that you read some upgrade notes on changes...3.0 is our hope to change some things we couldn't normally get away with I thought I agree we should be friendly, like 1 to 2, but its tempting to use 3.0 to do some things more cleanly rather than less.
          Hide
          Mark Miller added a comment -

          Oh one more thing: I think we don't need both supportsDocBase and setDocBase throwing UOE by default? How about just the 2nd one, and you fix search to catch the UOE and wrap in DocBaseCollector?

          If you'd like...I'm not a fan of control with exceptions. I am also not a fan of of isSupported methods though...I was leaning that way over exceptions...sounds like you lean the other way...? I guess it has its appeal in this case...

          Show
          Mark Miller added a comment - Oh one more thing: I think we don't need both supportsDocBase and setDocBase throwing UOE by default? How about just the 2nd one, and you fix search to catch the UOE and wrap in DocBaseCollector? If you'd like...I'm not a fan of control with exceptions. I am also not a fan of of isSupported methods though...I was leaning that way over exceptions...sounds like you lean the other way...? I guess it has its appeal in this case...
          Hide
          Doug Cutting added a comment -

          Could we add a new class like

          public abstract class Hitable {
            public abstract void setBase(int base);
            public abstract void hit(int doc, float score);
          }
          

          upgrade everything in trunk to use this, and change HitCollector to:

          /** @deprecated */
          public abstract class HitCollector extends Hitable {
            public abstract void setBase() { throw new UOE(); }
          }
          

          then, for back-compatiblity, wrap anything that extends HitCollector to rebase?

          Then you'd have neither an isSupported method nor use exceptions for control.

          Show
          Doug Cutting added a comment - Could we add a new class like public abstract class Hitable { public abstract void setBase( int base); public abstract void hit( int doc, float score); } upgrade everything in trunk to use this, and change HitCollector to: /** @deprecated */ public abstract class HitCollector extends Hitable { public abstract void setBase() { throw new UOE(); } } then, for back-compatiblity, wrap anything that extends HitCollector to rebase? Then you'd have neither an isSupported method nor use exceptions for control.
          Hide
          Mark Miller added a comment -

          Thats a great idea, I'll try that route.

          Another tiny win in this is that MuliSearcher doesn't have to use a doubled up hitcollector anymore.

          Show
          Mark Miller added a comment - Thats a great idea, I'll try that route. Another tiny win in this is that MuliSearcher doesn't have to use a doubled up hitcollector anymore.
          Hide
          Mark Miller added a comment -

          upgrade everything in trunk to use this,

          If users are using a HitCollector as a ref to a trunk HitCollector that becomes a Hitable, won't their code break? I really like this idea, but does it get around back compat?

          Show
          Mark Miller added a comment - upgrade everything in trunk to use this, If users are using a HitCollector as a ref to a trunk HitCollector that becomes a Hitable, won't their code break? I really like this idea, but does it get around back compat?
          Hide
          Doug Cutting added a comment -

          > If users are using a HitCollector as a ref to a trunk HitCollector that becomes a Hitable, won't their code break?

          Lucene never passes someone a HitCollector that they didn't create. So all the classes that folks create may need to remain subclasses of HitCollector, but classes used internally do not need to, and we can deprecate all public HitCollector implementations and provide new versions that extend Hitable.

          http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/class-use/HitCollector.html

          Does that address your concern?

          I'm not too fond of Hitable, but can't think of anything better. With that name, the method might better be called hit() than collect(), but that's a more invasive change. DocCollector? Collector?

          Show
          Doug Cutting added a comment - > If users are using a HitCollector as a ref to a trunk HitCollector that becomes a Hitable, won't their code break? Lucene never passes someone a HitCollector that they didn't create. So all the classes that folks create may need to remain subclasses of HitCollector, but classes used internally do not need to, and we can deprecate all public HitCollector implementations and provide new versions that extend Hitable. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/class-use/HitCollector.html Does that address your concern? I'm not too fond of Hitable, but can't think of anything better. With that name, the method might better be called hit() than collect(), but that's a more invasive change. DocCollector? Collector?
          Hide
          Mark Miller added a comment - - edited

          Okay, so subclasses of HitCollector stay subclasses until 3, and then they extend Hitable? Sounds good.
          EDIT
          No that wont work and I read your comment wrong - we have to provide new impls, and old impls will be wrapped. Got it. Patch coming.

          I kind of like Hitable myself. You are collecting a 'hit' with a Hittable object. A hit is a doc and score. Any other opinions? I'm less fond of other Collector variations. hit() could be better than collect(), but collect() is not such a stretch in my mind.

          EDIT

          One further thought, if we are replacing TopDocCollector, what would it become? TopDocDocCollector, TopDocHitable? I'm liking these names less...

          Show
          Mark Miller added a comment - - edited Okay, so subclasses of HitCollector stay subclasses until 3, and then they extend Hitable? Sounds good. EDIT No that wont work and I read your comment wrong - we have to provide new impls, and old impls will be wrapped. Got it. Patch coming. I kind of like Hitable myself. You are collecting a 'hit' with a Hittable object. A hit is a doc and score. Any other opinions? I'm less fond of other Collector variations. hit() could be better than collect(), but collect() is not such a stretch in my mind. EDIT One further thought, if we are replacing TopDocCollector, what would it become? TopDocDocCollector, TopDocHitable? I'm liking these names less...
          Hide
          Michael McCandless added a comment -

          > I'm not a fan of control with exceptions. I am also not a fan of of isSupported methods though...I was leaning that way over exceptions...sounds like you lean the other way...? I guess it has its appeal in this case...

          I agree, though I leaned towards using an exception because 1) no new
          API needing future deprecation would then have been added, and 2) the
          expectation (but not forced, and this is a problem) is that over time
          all HitCollector subclasses would implement setBase, so the exception
          would be the exception (heh) not the rule.

          But I like Doug's proposal instead, since on upgrading to 2.9 you'll
          see your code is deprecated, which then allows us to drop it in 3.0.
          I have a slight preference for DocCollector.

          This means any methods that accept a HitCollector would also be
          deprecated, and we'd add a new method that takes DocCollector instead,
          and change the deprecated one to wrap a DocBaseCollector around the
          HitCollector and invoke the new method.

          > we can deprecate all public HitCollector implementations and provide new versions that extend Hitable

          Could we leave (eg) TopDocCollector extending HitCollector in 2.9, and
          then in 3.0 change it to directly extend DocCollector (Hitable)?
          (This saves having to deprecate TopDocCollector and at least 2
          others).

          > Don't we have a lot of latitude with 3.0?

          I think in 3.0, when changing APIs, we are only allowed to remove
          deprecated APIs from 2.9? Ie we can't do more drastic changes.

          Show
          Michael McCandless added a comment - > I'm not a fan of control with exceptions. I am also not a fan of of isSupported methods though...I was leaning that way over exceptions...sounds like you lean the other way...? I guess it has its appeal in this case... I agree, though I leaned towards using an exception because 1) no new API needing future deprecation would then have been added, and 2) the expectation (but not forced, and this is a problem) is that over time all HitCollector subclasses would implement setBase, so the exception would be the exception (heh) not the rule. But I like Doug's proposal instead, since on upgrading to 2.9 you'll see your code is deprecated, which then allows us to drop it in 3.0. I have a slight preference for DocCollector. This means any methods that accept a HitCollector would also be deprecated, and we'd add a new method that takes DocCollector instead, and change the deprecated one to wrap a DocBaseCollector around the HitCollector and invoke the new method. > we can deprecate all public HitCollector implementations and provide new versions that extend Hitable Could we leave (eg) TopDocCollector extending HitCollector in 2.9, and then in 3.0 change it to directly extend DocCollector (Hitable)? (This saves having to deprecate TopDocCollector and at least 2 others). > Don't we have a lot of latitude with 3.0? I think in 3.0, when changing APIs, we are only allowed to remove deprecated APIs from 2.9? Ie we can't do more drastic changes.
          Hide
          Mark Miller added a comment -

          Could we leave (eg) TopDocCollector extending HitCollector in 2.9, and then in 3.0 change it to directly extend DocCollector (Hitable)? (This saves having to deprecate TopDocCollector and at least 2 others).

          If we do that, I don't think we can do our check to use base or not by checking for HitCollector right?

          Show
          Mark Miller added a comment - Could we leave (eg) TopDocCollector extending HitCollector in 2.9, and then in 3.0 change it to directly extend DocCollector (Hitable)? (This saves having to deprecate TopDocCollector and at least 2 others). If we do that, I don't think we can do our check to use base or not by checking for HitCollector right?
          Hide
          Mark Miller added a comment -

          Is that right? We better get going on java5 conversion then...

          Show
          Mark Miller added a comment - Is that right? We better get going on java5 conversion then...
          Hide
          Michael McCandless added a comment -

          > If we do that, I don't think we can do our check to use base or not by checking for HitCollector right?

          Ahh right, so if a user outside creates a TopDocCollector & passes it in, we can't tell that it can handle setBase natively. Or we could test & catch the UOE (since HitCollector is deprecated, using an exception here will go away in 3.0).

          Show
          Michael McCandless added a comment - > If we do that, I don't think we can do our check to use base or not by checking for HitCollector right? Ahh right, so if a user outside creates a TopDocCollector & passes it in, we can't tell that it can handle setBase natively. Or we could test & catch the UOE (since HitCollector is deprecated, using an exception here will go away in 3.0).
          Hide
          Michael McCandless added a comment -

          > Is that right? We better get going on java5 conversion then...

          Ahh right, that too But, as of 3.0 we can start accepting/making changes to Lucene that require 1.5 JRE, but still we can't just swap out APIs w/o going through deprecation first?

          Show
          Michael McCandless added a comment - > Is that right? We better get going on java5 conversion then... Ahh right, that too But, as of 3.0 we can start accepting/making changes to Lucene that require 1.5 JRE, but still we can't just swap out APIs w/o going through deprecation first?
          Hide
          Mark Miller added a comment -

          I thought there were some that wanted to change some of the API to java
          5 for the 3.0 release, cause I thought back compat was less restricted
          2-3. I guess mabye that won't end up happening, if it was going to, it
          seems we'd want to deprecate what will be changed in 2.9.

          • Mark
          Show
          Mark Miller added a comment - I thought there were some that wanted to change some of the API to java 5 for the 3.0 release, cause I thought back compat was less restricted 2-3. I guess mabye that won't end up happening, if it was going to, it seems we'd want to deprecate what will be changed in 2.9. Mark
          Hide
          Mark Miller added a comment -

          I think its almost there. I still want to spend a little time looking it over, but I think its looking good.

          Show
          Mark Miller added a comment - I think its almost there. I still want to spend a little time looking it over, but I think its looking good.
          Hide
          Hoss Man added a comment -

          I haven't been following this issue too closely (or truthfully: at all) but if there's talk of deprecating HItCollector and introducing a new superclass to replace it, would it also make sense to revist the idea of adding a return type to the collect/hit method? ... ie: an enum style result indicating "OK" or "ABORT" (with the potential of adding additional constants later ala FieldSelectorResult)

          I remember this coming up as a "wish list" back when TimeLimitedCollector was added (but i don't really remember if it was decided that the current implementation was actually better then if collect() did have a return value)

          Anyway, just something to ponder...

          public abstract class Hitable {
            public abstract void setBase(int base);
            public abstract HitStatus hit(int doc, float score);
          }
          /** @deprecated */
          public abstract class HitCollector extends Hitable {
            public abstract void collect(int doc, float score);
            public void setBase(int base) { throw new UOE(); }
            public HitStatus hit(int doc, float score) { 
              collect(doc, score); 
              return HitStatus.OK;
            }
          }
          
          Show
          Hoss Man added a comment - I haven't been following this issue too closely (or truthfully: at all) but if there's talk of deprecating HItCollector and introducing a new superclass to replace it, would it also make sense to revist the idea of adding a return type to the collect/hit method? ... ie: an enum style result indicating "OK" or "ABORT" (with the potential of adding additional constants later ala FieldSelectorResult) I remember this coming up as a "wish list" back when TimeLimitedCollector was added (but i don't really remember if it was decided that the current implementation was actually better then if collect() did have a return value) Anyway, just something to ponder... public abstract class Hitable { public abstract void setBase( int base); public abstract HitStatus hit( int doc, float score); } /** @deprecated */ public abstract class HitCollector extends Hitable { public abstract void collect( int doc, float score); public void setBase( int base) { throw new UOE(); } public HitStatus hit( int doc, float score) { collect(doc, score); return HitStatus.OK; } }
          Hide
          Michael McCandless added a comment -

          Mark, I got on hunk (HitCollector) failed on applying the patch – looks like it's the $Id$ issue again (your area doesn't expand $Id$ tags). No problem – I just applied it manually.

          Show
          Michael McCandless added a comment - Mark, I got on hunk (HitCollector) failed on applying the patch – looks like it's the $Id$ issue again (your area doesn't expand $Id$ tags). No problem – I just applied it manually.
          Hide
          Michael McCandless added a comment -

          adding a return type to the collect/hit method? ... ie: an enum style result indicating "OK" or "ABORT" (with the potential of adding additional constants later ala FieldSelectorResult)

          I think we should consider this, though this then implies an if stmt checking the return result & doing something, on each hit, so we should test the cost of doing so vs the cost of throwing an exception instead (eg we could define a typed exception in this new interface which means "abort the search now" and maybe another to mean "stop searching & return the results you got so far", etc.).

          Show
          Michael McCandless added a comment - adding a return type to the collect/hit method? ... ie: an enum style result indicating "OK" or "ABORT" (with the potential of adding additional constants later ala FieldSelectorResult) I think we should consider this, though this then implies an if stmt checking the return result & doing something, on each hit, so we should test the cost of doing so vs the cost of throwing an exception instead (eg we could define a typed exception in this new interface which means "abort the search now" and maybe another to mean "stop searching & return the results you got so far", etc.).
          Hide
          Uwe Schindler added a comment -

          adding a return type to the collect/hit method? ... ie: an enum style result indicating "OK" or "ABORT" (with the potential of adding additional constants later ala FieldSelectorResult)

          I think we should consider this, though this then implies an if stmt checking the return result & doing something, on each hit, so we should test the cost of doing so vs the cost of throwing an exception instead (eg we could define a typed exception in this new interface which means "abort the search now" and maybe another to mean "stop searching & return the results you got so far", etc.).

          This looks like a really good idea. Currently to stop a iterator, I use an exception class that extends RuntimeException (to have it unchecked) to cancel a search. Very nice if you support it directly.

          Show
          Uwe Schindler added a comment - adding a return type to the collect/hit method? ... ie: an enum style result indicating "OK" or "ABORT" (with the potential of adding additional constants later ala FieldSelectorResult) I think we should consider this, though this then implies an if stmt checking the return result & doing something, on each hit, so we should test the cost of doing so vs the cost of throwing an exception instead (eg we could define a typed exception in this new interface which means "abort the search now" and maybe another to mean "stop searching & return the results you got so far", etc.). This looks like a really good idea. Currently to stop a iterator, I use an exception class that extends RuntimeException (to have it unchecked) to cancel a search. Very nice if you support it directly.
          Hide
          Michael McCandless added a comment -

          Duh: I just realized that when we switched back to a single pqueue for
          gathering results across the N subreaders, we lost the original
          intended "benefit" for this issue. Hard to keep the forrest in mind
          when looking at all the trees....

          Ie, we are now (again) creating a single FieldSortedHitQueue, which
          pulls the FieldCache for the entire MultiReader, not per-segment. So
          warming time is still slow, when sorting by fields.

          Really we've "stumbled" on 2 rather different optimizations:

          1. Run Scorer at the "sub reader" level: this gains performance
            because you save the cost of going through a MultiReader. This
            requires the new DocCollector class, so we can setDocBase(...).
          2. Do collection (sort comparison w/ pqueue) at the "sub reader"
            level: this gains warming performance because we only ask for
            FieldCache for each subreader. But, it seems to hurt search
            performance (pqueue comparison & insertion cost went up), so it's
            no longer a no-brainer tradeoff (by default at least).

          Given that #1 has emerged as a tentatively fairly compelling gain, I
          now think we should decouple it from #2. Even though #2 was the
          original intent here, let's now morph this issue into addressing #1
          (since that's what current patch does), and I'll open a new issue for
          #2?

          Show
          Michael McCandless added a comment - Duh: I just realized that when we switched back to a single pqueue for gathering results across the N subreaders, we lost the original intended "benefit" for this issue. Hard to keep the forrest in mind when looking at all the trees.... Ie, we are now (again) creating a single FieldSortedHitQueue, which pulls the FieldCache for the entire MultiReader, not per-segment. So warming time is still slow, when sorting by fields. Really we've "stumbled" on 2 rather different optimizations: Run Scorer at the "sub reader" level: this gains performance because you save the cost of going through a MultiReader. This requires the new DocCollector class, so we can setDocBase(...). Do collection (sort comparison w/ pqueue) at the "sub reader" level: this gains warming performance because we only ask for FieldCache for each subreader. But, it seems to hurt search performance (pqueue comparison & insertion cost went up), so it's no longer a no-brainer tradeoff (by default at least). Given that #1 has emerged as a tentatively fairly compelling gain, I now think we should decouple it from #2. Even though #2 was the original intent here, let's now morph this issue into addressing #1 (since that's what current patch does), and I'll open a new issue for #2?
          Hide
          Mark Miller added a comment -

          Ugg...you know I was afraid of that when I was making the change, but I easily convinced myself that FieldSortedHitQueue was just taking that Reader for AUTO detect and didnt really relook. It also makes the comparators. Bummer. I guess lets open a new issue if we can't easily deal with it here (I've got to look at it some more).

          Show
          Mark Miller added a comment - Ugg...you know I was afraid of that when I was making the change, but I easily convinced myself that FieldSortedHitQueue was just taking that Reader for AUTO detect and didnt really relook. It also makes the comparators. Bummer. I guess lets open a new issue if we can't easily deal with it here (I've got to look at it some more).
          Hide
          Mark Miller added a comment -

          I've got a quick idea I want to try to fix it.

          Show
          Mark Miller added a comment - I've got a quick idea I want to try to fix it.
          Hide
          Mark Miller added a comment -

          Bah. You can't share that queue and get the reopen benefit without jumping too many hoops. All of a sudden you can't use ordinals, comparators need to know how to compare across comparators, and it just breaks down fast. How disappointing.

          Show
          Mark Miller added a comment - Bah. You can't share that queue and get the reopen benefit without jumping too many hoops. All of a sudden you can't use ordinals, comparators need to know how to compare across comparators, and it just breaks down fast. How disappointing.
          Hide
          Michael McCandless added a comment -

          Yeah, ugg. This is the nature of "progress"! It's not exactly a
          straight line from point A to B Lots of fits & starts, dead ends,
          jumps, etc.

          We could simply offer both ("collect into single pqueue but pay high
          warming cost" or "collect into separate pqueues, then merge, and pay
          low warming cost"), but that sure is an annoying choice to have to
          make.

          Oh, here's another idea: do separate pqueues (again!), but after the
          first segment is done, grab the values for the worst scoring doc in
          the pqueue (assuming the queue filled up to its numHits) and use this
          as the "cutoff" before inserting into the next segment's pqueue.

          In grabbing that cutoff we'd have to 1) map ord->value for segment 1,
          then 2) map value->ord for segment 2, then 3) use that cutoff for
          segment 2. (And likewise for all segment N -> N+1).

          I think this'd greatly reduce the number of inserts & comparisons done
          in subsequent queues because it mimics how a single pqueue behaves:
          you don't bother re-considering hits that won't be globally
          competitive.

          We could also maybe merge after each segment is processed; that way
          the cutoff we carry to the next segment is "true" so we'd reduce
          comparisons even further.

          Would this work? Let's try to think hard before writing code

          Show
          Michael McCandless added a comment - Yeah, ugg. This is the nature of "progress"! It's not exactly a straight line from point A to B Lots of fits & starts, dead ends, jumps, etc. We could simply offer both ("collect into single pqueue but pay high warming cost" or "collect into separate pqueues, then merge, and pay low warming cost"), but that sure is an annoying choice to have to make. Oh, here's another idea: do separate pqueues (again!), but after the first segment is done, grab the values for the worst scoring doc in the pqueue (assuming the queue filled up to its numHits) and use this as the "cutoff" before inserting into the next segment's pqueue. In grabbing that cutoff we'd have to 1) map ord->value for segment 1, then 2) map value->ord for segment 2, then 3) use that cutoff for segment 2. (And likewise for all segment N -> N+1). I think this'd greatly reduce the number of inserts & comparisons done in subsequent queues because it mimics how a single pqueue behaves: you don't bother re-considering hits that won't be globally competitive. We could also maybe merge after each segment is processed; that way the cutoff we carry to the next segment is "true" so we'd reduce comparisons even further. Would this work? Let's try to think hard before writing code
          Hide
          Mark Miller added a comment -

          We could simply offer both ("collect into single pqueue but pay high warming cost" or "collect into separate pqueues, then merge, and pay low warming cost"), but that sure is an annoying choice to have to make.

          Agreed. I really hope we don't have to settle for this.

          Oh, here's another idea:

          Good one! Keep those ideas coming.

          Would this work?

          It sounds like you've nailed it to me, but I'll let it float around in my head for a bit while I work on some other things.

          Let's try to think hard before writing code

          Now theres a new concept for me. My brain will work itself to death trying to avoid real work

          Show
          Mark Miller added a comment - We could simply offer both ("collect into single pqueue but pay high warming cost" or "collect into separate pqueues, then merge, and pay low warming cost"), but that sure is an annoying choice to have to make. Agreed. I really hope we don't have to settle for this. Oh, here's another idea: Good one! Keep those ideas coming. Would this work? It sounds like you've nailed it to me, but I'll let it float around in my head for a bit while I work on some other things. Let's try to think hard before writing code Now theres a new concept for me. My brain will work itself to death trying to avoid real work
          Hide
          Doug Cutting added a comment -

          > public abstract void setBase(int base);

          It occurred to me last night that this really has no place in HitCollector. We're forcing applications to handle an implementation detail that they really shouldn't have to know about. It would be better to pass the base down to the scorer implementations and have them add it on before they call collect(), no?

          Show
          Doug Cutting added a comment - > public abstract void setBase(int base); It occurred to me last night that this really has no place in HitCollector. We're forcing applications to handle an implementation detail that they really shouldn't have to know about. It would be better to pass the base down to the scorer implementations and have them add it on before they call collect(), no?
          Hide
          Michael McCandless added a comment -

          > It would be better to pass the base down to the scorer implementations and have them add it on before they call collect(), no?

          So we'd add Scorer.setDocBase instead?

          The only downside I can think of here is that often you will perform the addition when it wasn't necessary.

          Ie, if the score is not competitive at all, then you wouldn't need to create the full docID and so you'd save one add opcode.

          Admittedly, this is a very small (tiny) cost, and I do agree that making HitCollector know about docBase is really an abstraction violation...

          Show
          Michael McCandless added a comment - > It would be better to pass the base down to the scorer implementations and have them add it on before they call collect(), no? So we'd add Scorer.setDocBase instead? The only downside I can think of here is that often you will perform the addition when it wasn't necessary. Ie, if the score is not competitive at all, then you wouldn't need to create the full docID and so you'd save one add opcode. Admittedly, this is a very small (tiny) cost, and I do agree that making HitCollector know about docBase is really an abstraction violation...
          Hide
          Mark Miller added a comment - - edited

          Oh, here's another idea: do separate pqueues (again!), but after the first segment is done, grab the values for the worst scoring doc in the pqueue (assuming the queue filled up to its numHits) and use this as the "cutoff" before inserting into the next segment's pqueue.

          We've got to try it. Whats the hard part in this? Converting a value to an ord?

          EDIT

          Okay, I see, we can just find our place by running through new value Comparables.

          An added cost for going back to per reader is that all doc id values (not ords) also need to be adjusted (for the multisearcher).

          Show
          Mark Miller added a comment - - edited Oh, here's another idea: do separate pqueues (again!), but after the first segment is done, grab the values for the worst scoring doc in the pqueue (assuming the queue filled up to its numHits) and use this as the "cutoff" before inserting into the next segment's pqueue. We've got to try it. Whats the hard part in this? Converting a value to an ord? EDIT Okay, I see, we can just find our place by running through new value Comparables. An added cost for going back to per reader is that all doc id values (not ords) also need to be adjusted (for the multisearcher).
          Hide
          Michael McCandless added a comment -

          OK, here's another tweak on the last proposal: maybe we could,
          instead, take the pqueue produced by segment 1 and "convert" it into
          the ords matching segment 2, and then do normal searching for segment
          2 using that single pqueue (and the same for all seg N -> N+1
          transitions)?

          For all numeric fields, the conversion is a no-op (their ord is
          currently the actual numeric byte, short, int, etc. value, though
          conceivably that could change in the future); only String fields, and
          custom (hmm) would need to do something.

          This should be more efficient than the cutoff approach because it'd
          result in less comparisons/inserts. Ie, it's exactly a single pqueue
          again, just with some "conversion" between segments. The conversion
          cost is near zero for numeric fields, and for string fields it'd be
          O(numHits*log2(numValue)), where numValue is number of unique string
          values in next segment for that sort field. I think for most cases
          (many more docs than numHits requested) this would be faster than the
          cutoff approach.

          Would that work?

          Show
          Michael McCandless added a comment - OK, here's another tweak on the last proposal: maybe we could, instead, take the pqueue produced by segment 1 and "convert" it into the ords matching segment 2, and then do normal searching for segment 2 using that single pqueue (and the same for all seg N -> N+1 transitions)? For all numeric fields, the conversion is a no-op (their ord is currently the actual numeric byte, short, int, etc. value, though conceivably that could change in the future); only String fields, and custom (hmm) would need to do something. This should be more efficient than the cutoff approach because it'd result in less comparisons/inserts. Ie, it's exactly a single pqueue again, just with some "conversion" between segments. The conversion cost is near zero for numeric fields, and for string fields it'd be O(numHits*log2(numValue)), where numValue is number of unique string values in next segment for that sort field. I think for most cases (many more docs than numHits requested) this would be faster than the cutoff approach. Would that work?
          Hide
          Yonik Seeley added a comment -

          segment 1 has terms: apple, banana, orange
          segment 2 has terms: apple, orange

          What is the ord of banana in segment2?

          Show
          Yonik Seeley added a comment - segment 1 has terms: apple, banana, orange segment 2 has terms: apple, orange What is the ord of banana in segment2?
          Hide
          Michael McCandless added a comment -

          > What is the ord of banana in segment2?

          How about 0.5?

          Ie, we just need an ord that means it's in-between two ords for the current segment.

          On encountering that, we'd also need to record it's real value so that subsequent segments could look it up properly (or, if it survives until the end, to return the correct value "banana").

          Show
          Michael McCandless added a comment - > What is the ord of banana in segment2? How about 0.5? Ie, we just need an ord that means it's in-between two ords for the current segment. On encountering that, we'd also need to record it's real value so that subsequent segments could look it up properly (or, if it survives until the end, to return the correct value "banana").
          Hide
          Mark Miller added a comment - - edited

          Okay, but how am I going to squeeze between two customs? I guess you'd have to store as a compare against either side?

          EDIT

          There is also the problem that all compares are done based on ScoreDocs that index into a single ord array by doc. The previous pq's ScoreDocs will not compare right - they won't index into the ord array for the current Reader - they are indexes into the array for the previous Reader. This is what made me give up on single pq earlier.

          EDIT

          I guess we put them on the ScoreDoc like we do the values for multisearcher? Then we could use a PQ like FieldDocPQ that used ords rather than vals?

          EDIT

          Hmmm...How do I get at the ordinals though? The value is exposed, but the ordinals are hidden behind a compare method...

          Show
          Mark Miller added a comment - - edited Okay, but how am I going to squeeze between two customs? I guess you'd have to store as a compare against either side? EDIT There is also the problem that all compares are done based on ScoreDocs that index into a single ord array by doc. The previous pq's ScoreDocs will not compare right - they won't index into the ord array for the current Reader - they are indexes into the array for the previous Reader. This is what made me give up on single pq earlier. EDIT I guess we put them on the ScoreDoc like we do the values for multisearcher? Then we could use a PQ like FieldDocPQ that used ords rather than vals? EDIT Hmmm...How do I get at the ordinals though? The value is exposed, but the ordinals are hidden behind a compare method...
          Hide
          Mark Miller added a comment -

          Okay, in a pinch I guess we just grab the ordinals straight from the field cache and violate the comparator abit. But we don't have the score docs until after running the collector, so we cant perch stuff on them.

          Hmm - we need something like the current hits work with the standard get ord mechanism (which has its own probs cause we compare, we don't look at ords) and the last hits work with an ord on the scoredoc or something. Its all ugly stuff in my head.

          Show
          Mark Miller added a comment - Okay, in a pinch I guess we just grab the ordinals straight from the field cache and violate the comparator abit. But we don't have the score docs until after running the collector, so we cant perch stuff on them. Hmm - we need something like the current hits work with the standard get ord mechanism (which has its own probs cause we compare, we don't look at ords) and the last hits work with an ord on the scoredoc or something. Its all ugly stuff in my head.
          Hide
          Michael McCandless added a comment -

          I'm exploring one possible approach, with a new Comparator API that's told when to switch to the next subReader (which gives it the chance to translate the ords in the queue). Not sure it'll work out yet though...

          Show
          Michael McCandless added a comment - I'm exploring one possible approach, with a new Comparator API that's told when to switch to the next subReader (which gives it the chance to translate the ords in the queue). Not sure it'll work out yet though...
          Hide
          Mark Miller added a comment -

          Thats were I don't follow though - its not ords in the queue right? Its ScoreDocs. Thats whats getting me at the moment.

          Show
          Mark Miller added a comment - Thats were I don't follow though - its not ords in the queue right? Its ScoreDocs. Thats whats getting me at the moment.
          Hide
          Michael McCandless added a comment -

          Attached initial patch (derived from one of the earlier patches).
          Alot of work remains. TestSort (and likely others) fail.

          > Thats were I don't follow though - its not ords in the queue right? Its ScoreDocs. Thats whats getting me at the moment.

          Exactly – so I built first cut at the alternative "copy value"
          approach, where the comparator (new FieldComparator abstract class) is
          responsible for holding the values it needs for docs inserted into the
          queue. I also added TopFieldValueDocCollector (extends DocCollector),
          and ByValueFieldSortedHitQueue (extends PriorityQueue) that interacts
          with the FieldComparators. (We can change these names...). I updated
          IndexSearcher to use this new queue for field sorting.

          This patch only handles SortField.

          {DOC,SCORE,INT}

          now, but I think the
          approach has early surprising promise: I'm seeing a sizable
          performance gain for the "sort by int field" case (13.76 sec vs 17.95
          sec for 300 queries getting top 100 hits from 1M results) --> 23%
          faster. I verified for the test sort alg (above) it's producing the
          right results (at least top 40 docs match).

          I didn't expect such performance gain (I was hoping for not much
          performance loss, actually). I think it may be that although the
          initial value copy adds some cost, the within-queue comparsions are
          then faster because you don't have to deref back to the fieldcache
          array. It seems we keep accidentally discovering performance gains
          here

          If we go forward with this approach I think it'd mean deprecating
          FieldSortedHitQueue & ScoreDocComparator, because I think there's no
          back-compatible way to migrate forward. I also like that this
          approach means we only need an iterator interface to FieldCache
          values (for LUCENE-831).

          Mark can you look this over and see if it makes sense and maybe try to
          tackle the other sort types? String will be the most interesting but
          I think very doable.

          Show
          Michael McCandless added a comment - Attached initial patch (derived from one of the earlier patches). Alot of work remains. TestSort (and likely others) fail. > Thats were I don't follow though - its not ords in the queue right? Its ScoreDocs. Thats whats getting me at the moment. Exactly – so I built first cut at the alternative "copy value" approach, where the comparator (new FieldComparator abstract class) is responsible for holding the values it needs for docs inserted into the queue. I also added TopFieldValueDocCollector (extends DocCollector), and ByValueFieldSortedHitQueue (extends PriorityQueue) that interacts with the FieldComparators. (We can change these names...). I updated IndexSearcher to use this new queue for field sorting. This patch only handles SortField. {DOC,SCORE,INT} now, but I think the approach has early surprising promise: I'm seeing a sizable performance gain for the "sort by int field" case (13.76 sec vs 17.95 sec for 300 queries getting top 100 hits from 1M results) --> 23% faster. I verified for the test sort alg (above) it's producing the right results (at least top 40 docs match). I didn't expect such performance gain (I was hoping for not much performance loss, actually). I think it may be that although the initial value copy adds some cost, the within-queue comparsions are then faster because you don't have to deref back to the fieldcache array. It seems we keep accidentally discovering performance gains here If we go forward with this approach I think it'd mean deprecating FieldSortedHitQueue & ScoreDocComparator, because I think there's no back-compatible way to migrate forward. I also like that this approach means we only need an iterator interface to FieldCache values (for LUCENE-831 ). Mark can you look this over and see if it makes sense and maybe try to tackle the other sort types? String will be the most interesting but I think very doable.
          Hide
          Mark Miller added a comment -

          Fantastic Mike! I've started working through the tests and I've corrected a small ordering problem and added most of the other basic types. This is great stuff I think, but I need some more time to digest it all. Still looks like String will be the interesting piece. I think we have to fill fields for the multisearcher as well, which is annoying, because the multisearcher will have to let the indexsearcher know to do it, or we may do it for no reason. Then we just have to clear up the setbase stuff. Light at the end of the tunnel!

          Show
          Mark Miller added a comment - Fantastic Mike! I've started working through the tests and I've corrected a small ordering problem and added most of the other basic types. This is great stuff I think, but I need some more time to digest it all. Still looks like String will be the interesting piece. I think we have to fill fields for the multisearcher as well, which is annoying, because the multisearcher will have to let the indexsearcher know to do it, or we may do it for no reason. Then we just have to clear up the setbase stuff. Light at the end of the tunnel!
          Hide
          Mark Miller added a comment -

          It seems there are a few hoops to jump through with strings beyond what I was thinking. We don't have the full array of unique terms now, but rather a bunch of smaller arrays of unique terms, with overlap. That makes converting Strings to ords very difficult right? We almost have to create the full array. Also, we still have to do some fancy foot work to get a proper ord. Then we have to juggle and track things so that we can return the String value in getValue (we are more concerned with ords, so its not fun). It seems a lot of spinning/tracking unless we go back to a full StringIndex. Perhaps its just as good to just compare by String value and give up on ord? It really seems that by the time we jump through every hoop to create ords from Strings on a bunch of smaller StringIndexes, we will have at least eaten as much as it costs to do String.compare? Not a conclusion really though, I am still mucking...

          Any insight?

          Show
          Mark Miller added a comment - It seems there are a few hoops to jump through with strings beyond what I was thinking. We don't have the full array of unique terms now, but rather a bunch of smaller arrays of unique terms, with overlap. That makes converting Strings to ords very difficult right? We almost have to create the full array. Also, we still have to do some fancy foot work to get a proper ord. Then we have to juggle and track things so that we can return the String value in getValue (we are more concerned with ords, so its not fun). It seems a lot of spinning/tracking unless we go back to a full StringIndex. Perhaps its just as good to just compare by String value and give up on ord? It really seems that by the time we jump through every hoop to create ords from Strings on a bunch of smaller StringIndexes, we will have at least eaten as much as it costs to do String.compare? Not a conclusion really though, I am still mucking... Any insight?
          Hide
          Mark Miller added a comment - - edited

          I was off on the fillFields stuff, I forgot we are back to a single hit collector. EDIT - can't even follow my own thoughts - I wasn't off, you are just already handling the adjustment that needs to be made. I'd like to avoid filling fields unless we are in a multisearcher still though...

          I've got everything working except custom and locale stuff, in a suboptimal way anyway. String values rather than ords, and there is plenty that can probably be improved.

          4 tests still fail (3 with custom, 1 with locale). Still trying to lock down the best way to deal with String ords/values.

          Show
          Mark Miller added a comment - - edited I was off on the fillFields stuff, I forgot we are back to a single hit collector. EDIT - can't even follow my own thoughts - I wasn't off, you are just already handling the adjustment that needs to be made. I'd like to avoid filling fields unless we are in a multisearcher still though... I've got everything working except custom and locale stuff, in a suboptimal way anyway. String values rather than ords, and there is plenty that can probably be improved. 4 tests still fail (3 with custom, 1 with locale). Still trying to lock down the best way to deal with String ords/values.
          Hide
          Michael McCandless added a comment -

          > I've started working through the tests and I've corrected a small ordering problem and added most of the other basic types.

          Great!

          > Then we just have to clear up the setbase stuff.

          Yeah I think we should remove that (and use setNextReader instead).

          > That makes converting Strings to ords very difficult right?

          Right (this was the challenging example Yonik brought up above).

          How about something like this: at any given time, the slots are filled
          with an instance that has 1) the ord (that "matches" the current
          reader) and 2) the actual value. When transitioning readers you have
          to remap all ords to the new reader (but keep the value unchagned):
          for each slot, you take its String value look it up in the new reader
          w/ binary search. If it's present, assign it the corresponding ord.
          If it's not present, it must fall between ord X and X+1 so assign it
          an ord of X+0.5.

          Then proceed like normal, since all ords are now "matched" to your
          current reader. You compare with ords, never with values.

          The one caveat is... we have to take care with precision. A float
          X+0.5 won't be precise enough. We could use double, or, we could use
          long and "remap" all valid ords to be 2X (even) their value, and all
          "in between" ords (carried over from previous reader) to be odd
          values. I think we'd need long for this since in the worst case, if
          you have max number of docs and each doc has unique string you would
          have 2^31 unique ords to represent.

          Show
          Michael McCandless added a comment - > I've started working through the tests and I've corrected a small ordering problem and added most of the other basic types. Great! > Then we just have to clear up the setbase stuff. Yeah I think we should remove that (and use setNextReader instead). > That makes converting Strings to ords very difficult right? Right (this was the challenging example Yonik brought up above). How about something like this: at any given time, the slots are filled with an instance that has 1) the ord (that "matches" the current reader) and 2) the actual value. When transitioning readers you have to remap all ords to the new reader (but keep the value unchagned): for each slot, you take its String value look it up in the new reader w/ binary search. If it's present, assign it the corresponding ord. If it's not present, it must fall between ord X and X+1 so assign it an ord of X+0.5. Then proceed like normal, since all ords are now "matched" to your current reader. You compare with ords, never with values. The one caveat is... we have to take care with precision. A float X+0.5 won't be precise enough. We could use double, or, we could use long and "remap" all valid ords to be 2X (even) their value, and all "in between" ords (carried over from previous reader) to be odd values. I think we'd need long for this since in the worst case, if you have max number of docs and each doc has unique string you would have 2^31 unique ords to represent.
          Hide
          Uwe Schindler added a comment -

          You do not need to map to long, if you would use negative integers as marker for in between. So doc 0 has -1, doc 1 has -2, doc 2 has -3,..., doc (2^31-1 == Integer.MAX_VALUE) would have (-2^32 == Integer.MIN_VALUE)

          Just an idea.

          Show
          Uwe Schindler added a comment - You do not need to map to long, if you would use negative integers as marker for in between. So doc 0 has -1, doc 1 has -2, doc 2 has -3,..., doc (2^31-1 == Integer.MAX_VALUE) would have (-2^32 == Integer.MIN_VALUE) Just an idea.
          Hide
          Uwe Schindler added a comment -

          Sorry that does not work because of comparing, but you could substract 2^31 form doc number and multiply by two.

          Show
          Uwe Schindler added a comment - Sorry that does not work because of comparing, but you could substract 2^31 form doc number and multiply by two.
          Hide
          Mark Miller added a comment -

          How about something like this:

          Okay, I actually got a ways down that path before I gave up...it was clearer in my head before I got the new code - that had me trying to map one ord at a time (my fault, not the code of course) - but right, I'd have to map them all on reader change. Clears up the haze a bit.

          Locale is in, custom coming. That will make all tests pass. Then I'll get back on this String stuff (switch to ord from the value stuff I have going now). Then cleanup. Very awesome stuff , thanks Mike.

          Show
          Mark Miller added a comment - How about something like this: Okay, I actually got a ways down that path before I gave up...it was clearer in my head before I got the new code - that had me trying to map one ord at a time (my fault, not the code of course) - but right, I'd have to map them all on reader change. Clears up the haze a bit. Locale is in, custom coming. That will make all tests pass. Then I'll get back on this String stuff (switch to ord from the value stuff I have going now). Then cleanup. Very awesome stuff , thanks Mike.
          Hide
          Mark Miller added a comment -

          What do we do about the old SortComparatorSource et al? I can't figure out a way to make old custom implementations work with this - they implement logic that compares ScoreDocs and I cant see how I can make a new comparator that works based on custom SortComparators.

          Show
          Mark Miller added a comment - What do we do about the old SortComparatorSource et al? I can't figure out a way to make old custom implementations work with this - they implement logic that compares ScoreDocs and I cant see how I can make a new comparator that works based on custom SortComparators.
          Hide
          Mark Miller added a comment - - edited

          You compare with ords, never with values.

          But I still have to implement getValue for fillFields. That means either storing the value along with ords (at each slot), or storing the orig ord plus the orig StringIndex is came out of for each slot...right?

          EDIT

          Okay, that answer is in your earlier comment anyway - we have to track value by slot as well. I'm almost there...

          Show
          Mark Miller added a comment - - edited You compare with ords, never with values. But I still have to implement getValue for fillFields. That means either storing the value along with ords (at each slot), or storing the orig ord plus the orig StringIndex is came out of for each slot...right? EDIT Okay, that answer is in your earlier comment anyway - we have to track value by slot as well. I'm almost there...
          Hide
          Michael McCandless added a comment -

          > You do not need to map to long, if you would use negative integers as marker for in between

          Actually a variation on this would work: we could logically do the even/odd ord mapping, but shift it down into negative numbers to make use of the full int range. That'd give us enough precision.

          Show
          Michael McCandless added a comment - > You do not need to map to long, if you would use negative integers as marker for in between Actually a variation on this would work: we could logically do the even/odd ord mapping, but shift it down into negative numbers to make use of the full int range. That'd give us enough precision.
          Hide
          Michael McCandless added a comment -

          > What do we do about the old SortComparatorSource et al?

          I think we have to deprecate SortComparatorSource and ScoreDocComparator (in favor of FieldComparator and we should add a FieldComparatorSource, too I guess). Ie, custom sort implementations would need to migrate to the new API?

          Show
          Michael McCandless added a comment - > What do we do about the old SortComparatorSource et al? I think we have to deprecate SortComparatorSource and ScoreDocComparator (in favor of FieldComparator and we should add a FieldComparatorSource, too I guess). Ie, custom sort implementations would need to migrate to the new API?
          Hide
          Michael McCandless added a comment -

          > Sorry that does not work because of comparing, but you could substract 2^31 form doc number and multiply by two.

          Ahh I missed that (I said the same thing 4 comments later... moving fast here!).

          Show
          Michael McCandless added a comment - > Sorry that does not work because of comparing, but you could substract 2^31 form doc number and multiply by two. Ahh I missed that (I said the same thing 4 comments later... moving fast here!).
          Hide
          Mark Miller added a comment -

          Does this have to wait for 3 then? How can we back compatibly tell everyone they have to write new custom comparator implementations?

          Show
          Mark Miller added a comment - Does this have to wait for 3 then? How can we back compatibly tell everyone they have to write new custom comparator implementations?
          Hide
          Michael McCandless added a comment -

          > Does this have to wait for 3 then? How can we back compatibly tell everyone they have to write new custom comparator implementations?

          If we deprecate old and introduce new in 2.9, then we can remove old in 3.0, right?

          Show
          Michael McCandless added a comment - > Does this have to wait for 3 then? How can we back compatibly tell everyone they have to write new custom comparator implementations? If we deprecate old and introduce new in 2.9, then we can remove old in 3.0, right?
          Hide
          Mark Miller added a comment -

          But if you are using the old one it wont work with the new Searcher methods? I guess I'll have to think about it some more.

          I've got the String stuff working (I think) except for one issue...if a bunch of ords already in the queue map to the same index, they need to be adjusted to be in order (eg when you convert a value, its not in the terms list for the next binary search). I tried some naive tricks to adjust the number based on the old ord, but no real progress yet...

          Show
          Mark Miller added a comment - But if you are using the old one it wont work with the new Searcher methods? I guess I'll have to think about it some more. I've got the String stuff working (I think) except for one issue...if a bunch of ords already in the queue map to the same index, they need to be adjusted to be in order (eg when you convert a value, its not in the terms list for the next binary search). I tried some naive tricks to adjust the number based on the old ord, but no real progress yet...
          Hide
          Michael McCandless added a comment -

          > But if you are using the old one it wont work with the new Searcher methods? I guess I'll have to think about it some more.

          I think the search methods, on seeing that there is a SortField.CUSTOM type and then seeing that it's a SortComparatorSource in there, would fallback to the current impl, but all other Sorts would use the new one?

          > if a bunch of ords already in the queue map to the same index, they need to be adjusted to be in order

          Hmmm! Perhaps on comparing two "odd" values that are the same we could fallback to a true String.compareTo (I don't like that added if, though, it should be rare). Or, better, we could add a "subord" to break the tie. That subord'd have to be computed by gathering all String values that sorted to the same "odd ord" and sorting them to assign subords.

          Show
          Michael McCandless added a comment - > But if you are using the old one it wont work with the new Searcher methods? I guess I'll have to think about it some more. I think the search methods, on seeing that there is a SortField.CUSTOM type and then seeing that it's a SortComparatorSource in there, would fallback to the current impl, but all other Sorts would use the new one? > if a bunch of ords already in the queue map to the same index, they need to be adjusted to be in order Hmmm! Perhaps on comparing two "odd" values that are the same we could fallback to a true String.compareTo (I don't like that added if, though, it should be rare). Or, better, we could add a "subord" to break the tie. That subord'd have to be computed by gathering all String values that sorted to the same "odd ord" and sorting them to assign subords.
          Hide
          Yonik Seeley added a comment -

          If it's not present, it must fall between ord X and X+1 so assign it an ord of X+0.5.

          I went down this thought path in the past... float doesn't have the precision, double might, could use long and left shift to make space for "inbetween", etc.

          One problem is that putting an ord "inbetween" isn't good enough since you may be mapping many values. At that point, one starts wondering if it's worth getting an ord, and what is solved or optimized by using ords.

          Why are ords useful:
          (a) fast comparison when reordering the priority queue (insertion/removal of items)
          (b) fast comparison to determine if something should be inserted into the priority queue
          (c) other (non sorting) usage, using term numbers instead of term values.

          Using ords for (a) seems expensive, as all ords in the queue must be translated.
          Using ords for (b) means calculating an ord only for the smallest element of the priority queue.

          When doing a large search with a diverse set of keys to sort by, there are many more checks to see if an item should be inserted into the queue than are actually inserted. Also, when using an ord for (b), it doesn't have to be exact - it can be rounded since it's purpose isn't an exact comparison, but just an optimization to determine if insertion into the priority queue can be skipped.

          Show
          Yonik Seeley added a comment - If it's not present, it must fall between ord X and X+1 so assign it an ord of X+0.5. I went down this thought path in the past... float doesn't have the precision, double might, could use long and left shift to make space for "inbetween", etc. One problem is that putting an ord "inbetween" isn't good enough since you may be mapping many values. At that point, one starts wondering if it's worth getting an ord, and what is solved or optimized by using ords. Why are ords useful: (a) fast comparison when reordering the priority queue (insertion/removal of items) (b) fast comparison to determine if something should be inserted into the priority queue (c) other (non sorting) usage, using term numbers instead of term values. Using ords for (a) seems expensive, as all ords in the queue must be translated. Using ords for (b) means calculating an ord only for the smallest element of the priority queue. When doing a large search with a diverse set of keys to sort by, there are many more checks to see if an item should be inserted into the queue than are actually inserted. Also, when using an ord for (b), it doesn't have to be exact - it can be rounded since it's purpose isn't an exact comparison, but just an optimization to determine if insertion into the priority queue can be skipped.
          Hide
          Mark Miller added a comment -

          One problem is that putting an ord "inbetween" isn't good enough since you may be mapping many values.

          I don't like that added if, though, it should be rare

          Not rare right? Let say there are 20 values in the queue, and now the new reader has 2 values. Lots will map to the same index.

          I'm still digesting the rest of what yonik said. I don't have this String ord stuff working 100%, but I have it working sort of (seems to sort pieces right, but then the pieces are out of order). I don't think the precision exists to do what I'm doing anyway, but since it kind of works, I could prob bench the diff of using values rather than ords.

          Show
          Mark Miller added a comment - One problem is that putting an ord "inbetween" isn't good enough since you may be mapping many values. I don't like that added if, though, it should be rare Not rare right? Let say there are 20 values in the queue, and now the new reader has 2 values. Lots will map to the same index. I'm still digesting the rest of what yonik said. I don't have this String ord stuff working 100%, but I have it working sort of (seems to sort pieces right, but then the pieces are out of order). I don't think the precision exists to do what I'm doing anyway, but since it kind of works, I could prob bench the diff of using values rather than ords.
          Hide
          Mark Miller added a comment -

          > But if you are using the old one it wont work with the new Searcher methods? I guess I'll have to think about it some more.

          >>I think the search methods, on seeing that there is a SortField.CUSTOM type and then seeing that it's a SortComparatorSource in there, would fallback to the current impl, but all other Sorts would use the new one?

          Ahhh...I was afraid of the ugliness. All right.

          Show
          Mark Miller added a comment - > But if you are using the old one it wont work with the new Searcher methods? I guess I'll have to think about it some more. >>I think the search methods, on seeing that there is a SortField.CUSTOM type and then seeing that it's a SortComparatorSource in there, would fallback to the current impl, but all other Sorts would use the new one? Ahhh...I was afraid of the ugliness. All right.
          Hide
          Michael McCandless added a comment -

          > Not rare right? Let say there are 20 values in the queue, and now the new reader has 2 values. Lots will map to the same index.

          True, not rare in that case, but then the added cost will be tiny since the new segment has relatively so few hits.

          Show
          Michael McCandless added a comment - > Not rare right? Let say there are 20 values in the queue, and now the new reader has 2 values. Lots will map to the same index. True, not rare in that case, but then the added cost will be tiny since the new segment has relatively so few hits.
          Hide
          Michael McCandless added a comment -

          > At that point, one starts wondering if it's worth getting an ord, and what is solved or optimized by using ords.

          I agree: all this complexity may not be worth it, if simple
          compare-by-value in fact performs well enough. I think we should
          benchmark both.

          Mark if you get things to a semi-stable state & post a patch we
          can do some benching. You could actually just have two SortField
          types (STRING and STRING_ORD) then we could easily swap
          back & forth.

          Show
          Michael McCandless added a comment - > At that point, one starts wondering if it's worth getting an ord, and what is solved or optimized by using ords. I agree: all this complexity may not be worth it, if simple compare-by-value in fact performs well enough. I think we should benchmark both. Mark if you get things to a semi-stable state & post a patch we can do some benching. You could actually just have two SortField types (STRING and STRING_ORD) then we could easily swap back & forth.
          Hide
          Mark Miller added a comment - - edited

          Or, better, we could add a "subord" to break the tie. That subord'd have to be computed by gathering all String values that sorted to the same "odd ord" and sorting them to assign subords.

          That seems like a bit or work to manage. What if I just kept a separate double subords that holds the old mapping? If two slots are equal, we can fall down to that? We would have the whole new array, but the other way we have to track what maps to the same index, sort that, and then have a map or another array anyway. What do we save? We could have an int[] rather than a double[]? But a lot more juggling...

          EDIT
          Nm...that logic doesnt quite work...

          EDIT

          Or it does...man I have a scatterbrain and a half. I got it to work (at least partially - the tests pass). I'll now work on getting something up for you to test with. My main worry at this point is that I'm doing things inefficiently enough that its not a fair test

          Show
          Mark Miller added a comment - - edited Or, better, we could add a "subord" to break the tie. That subord'd have to be computed by gathering all String values that sorted to the same "odd ord" and sorting them to assign subords. That seems like a bit or work to manage. What if I just kept a separate double subords that holds the old mapping? If two slots are equal, we can fall down to that? We would have the whole new array, but the other way we have to track what maps to the same index, sort that, and then have a map or another array anyway. What do we save? We could have an int[] rather than a double[]? But a lot more juggling... EDIT Nm...that logic doesnt quite work... EDIT Or it does...man I have a scatterbrain and a half. I got it to work (at least partially - the tests pass). I'll now work on getting something up for you to test with. My main worry at this point is that I'm doing things inefficiently enough that its not a fair test
          Hide
          Mark Miller added a comment -

          Quick question: what should I use for the value version...Strings or StringIndex? StringIndex takes less mem, but requires two array lookups.

          Show
          Mark Miller added a comment - Quick question: what should I use for the value version...Strings or StringIndex? StringIndex takes less mem, but requires two array lookups.
          Hide
          Mark Miller added a comment -

          Here you go. I'm sorry I've made everything so messy I'll clean later.

          There is now SortField.STRING_VAL and STRING_ORD.

          STRING uses StringIndex,
          STRING_VAL uses Strings[]
          STRING_ORD uses ordinals.

          I havn't benched anything yet myself.

          There is still a lot of cleanup to do beyond this String stuff (and the old comparators still don't work)

          I havn't gotten over everything post scramble yet: hope I'm not doing anything too stupid. Sort tests pass other than custom comparator.

          Show
          Mark Miller added a comment - Here you go. I'm sorry I've made everything so messy I'll clean later. There is now SortField.STRING_VAL and STRING_ORD. STRING uses StringIndex, STRING_VAL uses Strings[] STRING_ORD uses ordinals. I havn't benched anything yet myself. There is still a lot of cleanup to do beyond this String stuff (and the old comparators still don't work) I havn't gotten over everything post scramble yet: hope I'm not doing anything too stupid. Sort tests pass other than custom comparator.
          Hide
          Mark Miller added a comment -

          One small error:

          + public int sortType()

          { + return SortField.STRING_VA; + }

          should be

          + public int sortType()

          { + return SortField.STRING_VAL; + }
          Show
          Mark Miller added a comment - One small error: + public int sortType() { + return SortField.STRING_VA; + } should be + public int sortType() { + return SortField.STRING_VAL; + }
          Hide
          Mark Miller added a comment -

          That actually had a system.out as well. Another patch that takes care of the above error and the system.out, and a manual fix of the HitCollector $id.

          Show
          Mark Miller added a comment - That actually had a system.out as well. Another patch that takes care of the above error and the system.out, and a manual fix of the HitCollector $id.
          Hide
          Michael McCandless added a comment -

          Thanks Mark. FWIW, when I generate a patch (with "svn diff") that is near the $Id$ tag, I too cannot apply the patch. So it seems like "svn diff" has some "smarts" whereby it un-expands a keyword, thus screwing up patch. We really need that "svn patch" command... (which IIRC is coming in an upcoming svn release).

          Show
          Michael McCandless added a comment - Thanks Mark. FWIW, when I generate a patch (with "svn diff") that is near the $Id$ tag, I too cannot apply the patch. So it seems like "svn diff" has some "smarts" whereby it un-expands a keyword, thus screwing up patch. We really need that "svn patch" command... (which IIRC is coming in an upcoming svn release).
          Hide
          Mark Miller added a comment -

          Some more quick comments:

          the StringIndex value version does two much array derefing, so I'd fix that, or just use the String version.

          I'm dong ordinals by keeping a second Double subord array. Everytime ords are mapped to the new IndexReader, if an ord doesnt map directly onto the new terms array, if the subord is not 0, I multiply the old ord mapping into the sub ord and give it the new ord eg the subord becomes the old ord times the current subord and the ord is updated. When comparing, if two ords are the same, we drop to the subord.

          I havn't though about precision issues (it prob doesn't work, but I don't know), but it works for the tests.

          Show
          Mark Miller added a comment - Some more quick comments: the StringIndex value version does two much array derefing, so I'd fix that, or just use the String version. I'm dong ordinals by keeping a second Double subord array. Everytime ords are mapped to the new IndexReader, if an ord doesnt map directly onto the new terms array, if the subord is not 0, I multiply the old ord mapping into the sub ord and give it the new ord eg the subord becomes the old ord times the current subord and the ord is updated. When comparing, if two ords are the same, we drop to the subord. I havn't though about precision issues (it prob doesn't work, but I don't know), but it works for the tests.
          Hide
          Michael McCandless added a comment -

          BTW, one important difference w/ the new TopFieldValueDocCollector is it does not track the max score – we probably need to add that back in, until Hits is removed in 3.0 (is it needed beyond that?).

          Show
          Michael McCandless added a comment - BTW, one important difference w/ the new TopFieldValueDocCollector is it does not track the max score – we probably need to add that back in, until Hits is removed in 3.0 (is it needed beyond that?).
          Hide
          Michael McCandless added a comment -

          There's something wrong w/ the SortField.STRING_ORD case – I'm trying "sort by title" w/ a Wikipedia index, and while I see the same results in a clean checkout vs STRING_VAL, I get different (wrong) results for STRING_ORD.

          Clean checkout & STRING_VAL get:

          hit 0: docID=1974521 title="Born into Trouble as the Sparks Fly Upward."
          hit 1: docID=688913 title="Into The Open" Exhibition
          hit 2: docID=1648 title="Love and Theft"
          hit 3: docID=599545 title="Repent, Harlequin!" Said the Ticktockman
          hit 4: docID=349499 title="The Spaghetti Incident?"
          

          but STRING_ORD gets this:

          hit 0: docID=599545 title="Repent, Harlequin!" Said the Ticktockman
          hit 1: docID=688913 title="Into The Open" Exhibition
          hit 2: docID=1974521 title="Born into Trouble as the Sparks Fly Upward."
          hit 3: docID=992439 title='Abd al-Malik II
          hit 4: docID=1951563 title='Auhelawa language
          

          I haven't tried to track it down yet...

          Show
          Michael McCandless added a comment - There's something wrong w/ the SortField.STRING_ORD case – I'm trying "sort by title" w/ a Wikipedia index, and while I see the same results in a clean checkout vs STRING_VAL, I get different (wrong) results for STRING_ORD. Clean checkout & STRING_VAL get: hit 0: docID=1974521 title= "Born into Trouble as the Sparks Fly Upward." hit 1: docID=688913 title= "Into The Open" Exhibition hit 2: docID=1648 title= "Love and Theft" hit 3: docID=599545 title= "Repent, Harlequin!" Said the Ticktockman hit 4: docID=349499 title= "The Spaghetti Incident?" but STRING_ORD gets this: hit 0: docID=599545 title= "Repent, Harlequin!" Said the Ticktockman hit 1: docID=688913 title= "Into The Open" Exhibition hit 2: docID=1974521 title= "Born into Trouble as the Sparks Fly Upward." hit 3: docID=992439 title='Abd al-Malik II hit 4: docID=1951563 title='Auhelawa language I haven't tried to track it down yet...
          Hide
          Mark Miller added a comment -

          haven't tried to track it down yet...

          I wouldn't. Well the method I am using passes the tests, its not likely viable (maybe even on a smaller scale than i would have guessed based on what your seeing). But since its close, I figure a benchmark of it against using values should tell us a lot about whether it makes sense to keep pushing with ord. Ord will no doubt end up a little slower if its to work properly, but comparing them now should give us a gauge of using values instead. Thats what we were looking for right?

          I still have tons I want to look at in this patch (and hopefully some ideas/suggestions). I havn't looked at it at all in that context yet though. I merely sat down and made the tests pass one by one with little consideration for anything else.

          Show
          Mark Miller added a comment - haven't tried to track it down yet... I wouldn't. Well the method I am using passes the tests, its not likely viable (maybe even on a smaller scale than i would have guessed based on what your seeing). But since its close, I figure a benchmark of it against using values should tell us a lot about whether it makes sense to keep pushing with ord. Ord will no doubt end up a little slower if its to work properly, but comparing them now should give us a gauge of using values instead. Thats what we were looking for right? I still have tons I want to look at in this patch (and hopefully some ideas/suggestions). I havn't looked at it at all in that context yet though. I merely sat down and made the tests pass one by one with little consideration for anything else.
          Hide
          Mark Miller added a comment -

          Another thing I'll do is added another sort test - the tests may not hit all of the edge cases - i dont think they hit compare(ord, doc, score) at all for one (if i am remembering right).

          Show
          Mark Miller added a comment - Another thing I'll do is added another sort test - the tests may not hit all of the edge cases - i dont think they hit compare(ord, doc, score) at all for one (if i am remembering right).
          Hide
          Mark Miller added a comment -

          I just made a quick new test for what im doing with ords - it seems once i add more than about 500 docs, one or two are out of order - that problem compounds as the number goes up. Special case I am missing, or precision issues.

          Show
          Mark Miller added a comment - I just made a quick new test for what im doing with ords - it seems once i add more than about 500 docs, one or two are out of order - that problem compounds as the number goes up. Special case I am missing, or precision issues.
          Hide
          Michael McCandless added a comment -

          OK I ran an initial test, though since the ord approach is a "bit"
          buggy we can't be sure how well to trust these results.

          I indexed first 2M docs from Wikipedia, into 101 segment index, then
          search for "text" (hits 97K results), sort by title, pulling best 100
          hits. I do the search 1000 times in each round.

          Current trunk (best 107.1 searches/sec):

          Operation            round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
          XSearchWarm              0        1            1          0.0       93.64   463,373,760  1,029,046,272
          XSearchWithSort_1000 -   0 -  -   1 -  -  - 1000 -  -   100.6 -  -   9.94 - 463,373,760  1,029,046,272
          XSearchWithSort_1000     1        1         1000        107.1        9.34   572,969,344  1,029,046,272
          XSearchWithSort_1000 -   2 -  -   1 -  -  - 1000 -  -   105.5 -  -   9.48 - 572,969,344  1,029,046,272
          XSearchWithSort_1000     3        1         1000        106.2        9.41   587,068,928  1,029,046,272
          

          Patch STRING_ORD (best 102.0 searches/sec):

          Operation            round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
          XSearchWarm              0        1            1          0.5        2.16   384,153,600  1,029,046,272
          XSearchWithSort_1000 -   0 -  -   1 -  -  - 1000 -  -  - 94.1 -  -  10.63 - 439,173,824  1,029,046,272
          XSearchWithSort_1000     1        1         1000        100.7        9.93   439,173,824  1,029,046,272
          XSearchWithSort_1000 -   2 -  -   1 -  -  - 1000 -  -   101.9 -  -   9.81 - 573,822,208  1,029,046,272
          XSearchWithSort_1000     3        1         1000        102.0        9.81   573,822,208  1,029,046,272
          

          Patch STRING_VAL (best 34.6 searches/sec):

          XSearchWarm              0        1            1          0.4        2.24   368,201,088  1,029,046,272
          XSearchWithSort_1000 -   0 -  -   1 -  -  - 1000 -  -  - 34.6 -  -  28.94 - 415,107,648  1,029,046,272
          XSearchWithSort_1000     1        1         1000         33.9       29.54   415,107,648  1,029,046,272
          XSearchWithSort_1000 -   2 -  -   1 -  -  - 1000 -  -  - 33.9 -  -  29.46 - 545,339,904  1,029,046,272
          XSearchWithSort_1000     3        1         1000         34.0       29.40   545,339,904  1,029,046,272
          

          Notes:

          • Populating the field cache on trunk for MultiReader is
            fantastically costly (94 sec). The IO cache was already hot so
            this isn't IO latency. I think MultiTermEnum/Docs behaves badly
            for this use case (single unique term (title) per doc). We really
            need to switch to column-stride fields, not un-invert, for this.
          • For this case at least STRING_ORD is still quite a bit faster than
            STRING_VAL; however, it's still buggy. Maybe a smaller queue size
            (eg 10 or 20) would make them closer.
          • STRING_ORD is still a bit slower than trunk's sort; hopefully once
            tuned it'll be closer.

          I think we now need to fix the STRING_ORD bug & retest.

          Show
          Michael McCandless added a comment - OK I ran an initial test, though since the ord approach is a "bit" buggy we can't be sure how well to trust these results. I indexed first 2M docs from Wikipedia, into 101 segment index, then search for "text" (hits 97K results), sort by title, pulling best 100 hits. I do the search 1000 times in each round. Current trunk (best 107.1 searches/sec): Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem XSearchWarm 0 1 1 0.0 93.64 463,373,760 1,029,046,272 XSearchWithSort_1000 - 0 - - 1 - - - 1000 - - 100.6 - - 9.94 - 463,373,760 1,029,046,272 XSearchWithSort_1000 1 1 1000 107.1 9.34 572,969,344 1,029,046,272 XSearchWithSort_1000 - 2 - - 1 - - - 1000 - - 105.5 - - 9.48 - 572,969,344 1,029,046,272 XSearchWithSort_1000 3 1 1000 106.2 9.41 587,068,928 1,029,046,272 Patch STRING_ORD (best 102.0 searches/sec): Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem XSearchWarm 0 1 1 0.5 2.16 384,153,600 1,029,046,272 XSearchWithSort_1000 - 0 - - 1 - - - 1000 - - - 94.1 - - 10.63 - 439,173,824 1,029,046,272 XSearchWithSort_1000 1 1 1000 100.7 9.93 439,173,824 1,029,046,272 XSearchWithSort_1000 - 2 - - 1 - - - 1000 - - 101.9 - - 9.81 - 573,822,208 1,029,046,272 XSearchWithSort_1000 3 1 1000 102.0 9.81 573,822,208 1,029,046,272 Patch STRING_VAL (best 34.6 searches/sec): XSearchWarm 0 1 1 0.4 2.24 368,201,088 1,029,046,272 XSearchWithSort_1000 - 0 - - 1 - - - 1000 - - - 34.6 - - 28.94 - 415,107,648 1,029,046,272 XSearchWithSort_1000 1 1 1000 33.9 29.54 415,107,648 1,029,046,272 XSearchWithSort_1000 - 2 - - 1 - - - 1000 - - - 33.9 - - 29.46 - 545,339,904 1,029,046,272 XSearchWithSort_1000 3 1 1000 34.0 29.40 545,339,904 1,029,046,272 Notes: Populating the field cache on trunk for MultiReader is fantastically costly (94 sec). The IO cache was already hot so this isn't IO latency. I think MultiTermEnum/Docs behaves badly for this use case (single unique term (title) per doc). We really need to switch to column-stride fields, not un-invert, for this. For this case at least STRING_ORD is still quite a bit faster than STRING_VAL; however, it's still buggy. Maybe a smaller queue size (eg 10 or 20) would make them closer. STRING_ORD is still a bit slower than trunk's sort; hopefully once tuned it'll be closer. I think we now need to fix the STRING_ORD bug & retest.
          Hide
          Mark Miller added a comment -

          Thanks Mike. Yup - that says enough for me. We have to get ords working. I don't think ords will get faster though (you wouldnt surprise me though), I think they will get slower. Certainly they should stay much faster than value though - what a dog. But who knows - we can test fall back to value compare instead of subords as well. My naive ords attempt now is keeping previous mapping ords order by multiplying it as we move to new readers - thats going to explode into some very large numbers pretty fast, and I don't expect we can get by so easily. Either fall back to value will be good enough, or we will prob have to map to new ords rather than simple multiplying to retain each stages ording.

          I'll keep playing with the ord on my end though - i only got it to pass those tests moments before that patch went up. I try to keep a frantic pace because I never know if my spare cycles will go away - I have juggled a defensive plate for a while No doubt I'll squeeze some more hours tonight though.

          Show
          Mark Miller added a comment - Thanks Mike. Yup - that says enough for me. We have to get ords working. I don't think ords will get faster though (you wouldnt surprise me though), I think they will get slower. Certainly they should stay much faster than value though - what a dog. But who knows - we can test fall back to value compare instead of subords as well. My naive ords attempt now is keeping previous mapping ords order by multiplying it as we move to new readers - thats going to explode into some very large numbers pretty fast, and I don't expect we can get by so easily. Either fall back to value will be good enough, or we will prob have to map to new ords rather than simple multiplying to retain each stages ording. I'll keep playing with the ord on my end though - i only got it to pass those tests moments before that patch went up. I try to keep a frantic pace because I never know if my spare cycles will go away - I have juggled a defensive plate for a while No doubt I'll squeeze some more hours tonight though.
          Hide
          Mark Miller added a comment -

          I'm starting to think fall back to value wont be so bad. I'll give you another cut that does fall back to value and whatever I have to do to get correct subord stuff right.

          Show
          Mark Miller added a comment - I'm starting to think fall back to value wont be so bad. I'll give you another cut that does fall back to value and whatever I have to do to get correct subord stuff right.
          Hide
          Michael McCandless added a comment -

          We should definitely try fallback compare-by-value.

          But, sort by title (presumably unique key) is actually a worst case
          for us, because all values in the queue will not exist in the next
          segment. So it's a good test We should also test sorting by an
          enum field ("country", "state").

          Thinking more about how to compute subords... I think we could store
          ord & subord each as int, and then efficiently translate them to the
          next segment with a single pass through the queue, in sort key order.
          This would ensure we hit all the dups (different Strings that map to
          the same ord in the next segment, but different subords) in one
          cluster. And, the subord could be easily computed by simply
          incrementing (starting with 1) in key sort order, until the cluster is
          done.

          It should be simple to step through the pqueue's heap in sort order
          min->max (w/o removing the entries which is the "normal" heapsort way
          to sort the elements); you'd need to maintain some sort of queue to
          keep track of the "frontier" as you walk down the heap. But I haven't
          found a cookbook example yet... It should be fast since we can use
          the ord/subords in the queue for all within-queue comparisons.

          We could also save time on the binary search by bounding the search by
          where we just found the last key. It may be worth tracking the max
          value in the queue, to bound the other end of the search. For a big
          search the queue should have a fairly tight bound.

          Show
          Michael McCandless added a comment - We should definitely try fallback compare-by-value. But, sort by title (presumably unique key) is actually a worst case for us, because all values in the queue will not exist in the next segment. So it's a good test We should also test sorting by an enum field ("country", "state"). Thinking more about how to compute subords... I think we could store ord & subord each as int, and then efficiently translate them to the next segment with a single pass through the queue, in sort key order. This would ensure we hit all the dups (different Strings that map to the same ord in the next segment, but different subords) in one cluster. And, the subord could be easily computed by simply incrementing (starting with 1) in key sort order, until the cluster is done. It should be simple to step through the pqueue's heap in sort order min->max (w/o removing the entries which is the "normal" heapsort way to sort the elements); you'd need to maintain some sort of queue to keep track of the "frontier" as you walk down the heap. But I haven't found a cookbook example yet... It should be fast since we can use the ord/subords in the queue for all within-queue comparisons. We could also save time on the binary search by bounding the search by where we just found the last key. It may be worth tracking the max value in the queue, to bound the other end of the search. For a big search the queue should have a fairly tight bound.
          Hide
          Mark Miller added a comment -

          We should also test sorting by an enum field ("country", "state").

          I'm actually trying on random data that I am creating, and I'm getting different results. Oddly, many cases where Values seem to beat Trunk.

          We could also save time on the binary search by bounding the search by

          where we just found the last key. It may be worth tracking the max
          value in the queue, to bound the other end of the search. For a big
          search the queue should have a fairly tight bound.

          Right, I've been thinking of ways to cut that down too. Its def binary searching much more then it needs to. That seems to be a big ord slowdown from what I can tell - fallback to compare by value is actually appearing slower than by value. I've made a couple small optimizations, but theirs def more.

          Thinking more about how to compute subords...

          Cool. Great idea to think about. My main search has been to get subord as an int. Getting both to int would certainly be optimal. Everything I've come up with seems too expensive though - I'll try to run with that idea.

          Show
          Mark Miller added a comment - We should also test sorting by an enum field ("country", "state"). I'm actually trying on random data that I am creating, and I'm getting different results. Oddly, many cases where Values seem to beat Trunk. We could also save time on the binary search by bounding the search by where we just found the last key. It may be worth tracking the max value in the queue, to bound the other end of the search. For a big search the queue should have a fairly tight bound. Right, I've been thinking of ways to cut that down too. Its def binary searching much more then it needs to. That seems to be a big ord slowdown from what I can tell - fallback to compare by value is actually appearing slower than by value. I've made a couple small optimizations, but theirs def more. Thinking more about how to compute subords... Cool. Great idea to think about. My main search has been to get subord as an int. Getting both to int would certainly be optimal. Everything I've come up with seems too expensive though - I'll try to run with that idea.
          Hide
          Michael McCandless added a comment -

          Ugh – that approach won't work, because the pqueue in the collector
          is not necessarily sorted primarily by our field (eg if the String
          sort field is not the first SortField). So we don't have a fast way
          to visit the keys in sorted order.

          Show
          Michael McCandless added a comment - Ugh – that approach won't work, because the pqueue in the collector is not necessarily sorted primarily by our field (eg if the String sort field is not the first SortField). So we don't have a fast way to visit the keys in sorted order.
          Hide
          Mark Miller added a comment -

          Okay, I just full on did it inefficiently, and maybe I can work backwards a little. Seems to be solid.

          I just collect the old ords by making a map with the mapped to index as the key. The map has List values and when a list gets more than one entry, the entry is added to a morethanone set. After mapping all the ords, i go through the morethanone set and sort each list - each subord is then set based on its index in the sorted list.

          We already knew that was easy enough - I just think its probably on the terribly inefficient side. Now to think about whacking pieces off. It just makes me not very hopeful to start at something so slow. And still the double ords Perhaps the negative int could still come into play though.

          Way too many little objects being made...

          Show
          Mark Miller added a comment - Okay, I just full on did it inefficiently, and maybe I can work backwards a little. Seems to be solid. I just collect the old ords by making a map with the mapped to index as the key. The map has List values and when a list gets more than one entry, the entry is added to a morethanone set. After mapping all the ords, i go through the morethanone set and sort each list - each subord is then set based on its index in the sorted list. We already knew that was easy enough - I just think its probably on the terribly inefficient side. Now to think about whacking pieces off. It just makes me not very hopeful to start at something so slow. And still the double ords Perhaps the negative int could still come into play though. Way too many little objects being made...
          Hide
          Michael McCandless added a comment -

          Can you post your inefficient remapping version?

          I too created the fallback version (just set subord to -1 when value
          isn't found in the new index & respect that in the two compare
          methods). I confirmed it gives correct top 50 results. This then
          brings ord perf to 97.6 searches/sec (vs trunk 107.1 searches/sec), so
          that's our number to beat since it seems to be bug-free.

          Then I ran "MatchAllDocsQuery" (to test a much larger result set –
          this returns 2M hits but previous query "text" returned ~97K hits),
          sorting by title, queue size=100. Trunk still has unbelievely slow
          warming (95 sec), and then gets 7.6 searches/sec. Patch ord search
          (with fallback) gets 30.7 searches/sec.

          This is very interesting and odd – I can't explain why ord searching
          w/ fallback is so much faster than current trunk when the number of
          hits is large (2M). I think this is very important because it's the
          big slow queries that are most important to improve here, even if it's
          at some cost to the queries that are already fast.

          Ie, we still need to do more tests, but if this result holds (and we
          need to explain the difference), I think it's a strong vote for the
          ord+fallback approach. Not to mention, it also sidesteps the absurdly
          slow warming time of FieldCache.StringIndex on a Multi*Reader.

          Show
          Michael McCandless added a comment - Can you post your inefficient remapping version? I too created the fallback version (just set subord to -1 when value isn't found in the new index & respect that in the two compare methods). I confirmed it gives correct top 50 results. This then brings ord perf to 97.6 searches/sec (vs trunk 107.1 searches/sec), so that's our number to beat since it seems to be bug-free. Then I ran "MatchAllDocsQuery" (to test a much larger result set – this returns 2M hits but previous query "text" returned ~97K hits), sorting by title, queue size=100. Trunk still has unbelievely slow warming (95 sec), and then gets 7.6 searches/sec. Patch ord search (with fallback) gets 30.7 searches/sec. This is very interesting and odd – I can't explain why ord searching w/ fallback is so much faster than current trunk when the number of hits is large (2M). I think this is very important because it's the big slow queries that are most important to improve here, even if it's at some cost to the queries that are already fast. Ie, we still need to do more tests, but if this result holds (and we need to explain the difference), I think it's a strong vote for the ord+fallback approach. Not to mention, it also sidesteps the absurdly slow warming time of FieldCache.StringIndex on a Multi*Reader.
          Hide
          Michael McCandless added a comment -

          When I run MatchAllDocsQuery (2M hits), with a queue size of 10 instead of 100, trunk still gets 7.6 searches/sec and ord w/ fallback gets 33.1 searches/sec.

          Show
          Michael McCandless added a comment - When I run MatchAllDocsQuery (2M hits), with a queue size of 10 instead of 100, trunk still gets 7.6 searches/sec and ord w/ fallback gets 33.1 searches/sec.
          Hide
          Michael McCandless added a comment -

          I tested the query "1", which gets 384K hits. Trunk gets 48.1 searches/sec and ord w/ fallback gets 55.0.

          So somehow as the result set gets larger, with a crossover somewhere between 97K hits (query "text") and 384K hits (query "1"), the ord w/ fallback becomes faster and then gets much faster as the result set gets quite large (2M hits).

          Show
          Michael McCandless added a comment - I tested the query "1", which gets 384K hits. Trunk gets 48.1 searches/sec and ord w/ fallback gets 55.0. So somehow as the result set gets larger, with a crossover somewhere between 97K hits (query "text") and 384K hits (query "1"), the ord w/ fallback becomes faster and then gets much faster as the result set gets quite large (2M hits).
          Hide
          Mark Miller added a comment -

          Okay, here is the super inefficient ords version.

          SortField.STRING_VAL: sort by val
          SortField.STRING_ORD: sort by ord and subord
          SortField.STRING_ORD_VAL: sort by ord fallback to val

          Those multireader fieldcache loading times blow me away...crazy.

          Show
          Mark Miller added a comment - Okay, here is the super inefficient ords version. SortField.STRING_VAL: sort by val SortField.STRING_ORD: sort by ord and subord SortField.STRING_ORD_VAL: sort by ord fallback to val Those multireader fieldcache loading times blow me away...crazy.
          Hide
          Mark Miller added a comment -

          I too created the fallback version (just set subord to -1 when value

          isn't found in the new index & respect that in the two compare
          methods).

          Yours may be better than my then - i got rid of the subord array for fallback, and if they are equal ords, i do a compare on the values.

          Show
          Mark Miller added a comment - I too created the fallback version (just set subord to -1 when value isn't found in the new index & respect that in the two compare methods). Yours may be better than my then - i got rid of the subord array for fallback, and if they are equal ords, i do a compare on the values.
          Hide
          Mark Miller added a comment - - edited

          This is very interesting and odd - I can't explain why ord searching w/ fallback is so much faster than current trunk when the number of hits is large (2M).

          I was testing with randomly created data (just random number of digits (2-8), each digit randomly 0-9) and a lot of what I was doing, straight values seemed to handily beat straight ords! It depended on how many docs I was making and how segmented I made it I think. Wasn't very official, and the test was somewhat short, but I ran it over and over...seemed odd.

          Show
          Mark Miller added a comment - - edited This is very interesting and odd - I can't explain why ord searching w/ fallback is so much faster than current trunk when the number of hits is large (2M). I was testing with randomly created data (just random number of digits (2-8), each digit randomly 0-9) and a lot of what I was doing, straight values seemed to handily beat straight ords! It depended on how many docs I was making and how segmented I made it I think. Wasn't very official, and the test was somewhat short, but I ran it over and over...seemed odd.
          Hide
          Mark Miller added a comment - - edited

          I guess I can't get away with my new index calculation either: ords[i] = ((-index << 1) - 3) / 2.0d;

          Index is an int, so its going to overflow.

          EDIT

          Wait...thats based on unique terms per reader not possible number of docs...guess it can stay.

          Show
          Mark Miller added a comment - - edited I guess I can't get away with my new index calculation either: ords [i] = ((-index << 1) - 3) / 2.0d; Index is an int, so its going to overflow. EDIT Wait...thats based on unique terms per reader not possible number of docs...guess it can stay.
          Hide
          Mark Miller added a comment - - edited

          Or can it? Over int.max/2 unique ids and then sort on id would be broke right? Okay, it would be kind of nuts to try and sort on that many unique terms, but in the future?...

          EDIT

          Actually, one seg would need int.max /2, but you know what i mean...

          EDIT

          Okay, I guess my argument with JIRA cleared up - you'd have to have the second segment or later with over int.max/2 terms. Do we care about such an insane possibility?

          Show
          Mark Miller added a comment - - edited Or can it? Over int.max/2 unique ids and then sort on id would be broke right? Okay, it would be kind of nuts to try and sort on that many unique terms, but in the future?... EDIT Actually, one seg would need int.max /2, but you know what i mean... EDIT Okay, I guess my argument with JIRA cleared up - you'd have to have the second segment or later with over int.max/2 terms. Do we care about such an insane possibility?
          Hide
          Michael McCandless added a comment -

          > I was testing with randomly created data (just random number of digits (2-8), each digit randomly 0-9)

          Can you post a patch for this? Seems handy to fix contrib/benchmark to be able to generate such a field...

          > straight values seemed to handily beat straight ords!

          I'll try to test this case too; we need to understand why natural data (Wiki titles) shows one thing but synthetic data shows the opposite. And we still need to test the enum case.

          Show
          Michael McCandless added a comment - > I was testing with randomly created data (just random number of digits (2-8), each digit randomly 0-9) Can you post a patch for this? Seems handy to fix contrib/benchmark to be able to generate such a field... > straight values seemed to handily beat straight ords! I'll try to test this case too; we need to understand why natural data (Wiki titles) shows one thing but synthetic data shows the opposite. And we still need to test the enum case.
          Hide
          Michael McCandless added a comment -

          > I guess I can't get away with my new index calculation either: ords[i] = ((-index << 1) - 3) / 2.0d;

          I think you can use an int for the ords? Now that we have subord,
          when you get negative index back from binary search, you can set ord
          to -index-1 which is the "lower bound", and then as long as subord is
          at least 1 it should compare correctly.

          Also, in your 2nd pass, if the list is length 1 then you can
          immediately set subord to 1 and move on.

          In your first pass, in the "else" clause (when the value was found in
          the next segment) don't you need to set subord to 0?

          Show
          Michael McCandless added a comment - > I guess I can't get away with my new index calculation either: ords [i] = ((-index << 1) - 3) / 2.0d; I think you can use an int for the ords? Now that we have subord, when you get negative index back from binary search, you can set ord to -index-1 which is the "lower bound", and then as long as subord is at least 1 it should compare correctly. Also, in your 2nd pass, if the list is length 1 then you can immediately set subord to 1 and move on. In your first pass, in the "else" clause (when the value was found in the next segment) don't you need to set subord to 0?
          Hide
          Mark Miller added a comment - - edited

          I think you can use an int for the ords? Now that we have subord, when you get negative index back from binary search, you can set ord to -index-1 which is the "lower bound", and then as long as subord is at least 1 it should compare correctly.

          Ah, indeed - no need for in the middle if you can fall to the subord. I think I may have been thinking it would be nice not to fall through the ords so much, but surly its worth losing double and doing the average. Seems to like -2 not -1 though, then I start the subords at 1 rather than 0...I'll bring real thought to it later, but thats generating some good looking results.

          EDIT

          Arg - with -2 one tests fails, with -1 it passes, but reverse sort fails Prob is 1 one then, and I've got a small issue elsewhere.

          In your first pass, in the "else" clause (when the value was found in the next segment) don't you need to set subord to 0?

          hmm...I had that commented out as I was playing...let me think - if you don't set it and multiple numbers don't map to clean new ords and they are the same, it will fall to subords...so right, you wouldn't want an old subord around.

          I'm hoping there is a lot more optimization we can do to the pure ords case, but frankly I have low hopes for it - it should be a bit more competitive though.

          I'm going to take some time to finish up some other pieces and then come back again I think. I've polished up the comparators a bit, so once I get some other work in ill put up another rev.

          Show
          Mark Miller added a comment - - edited I think you can use an int for the ords? Now that we have subord, when you get negative index back from binary search, you can set ord to -index-1 which is the "lower bound", and then as long as subord is at least 1 it should compare correctly. Ah, indeed - no need for in the middle if you can fall to the subord. I think I may have been thinking it would be nice not to fall through the ords so much, but surly its worth losing double and doing the average. Seems to like -2 not -1 though, then I start the subords at 1 rather than 0...I'll bring real thought to it later, but thats generating some good looking results. EDIT Arg - with -2 one tests fails, with -1 it passes, but reverse sort fails Prob is 1 one then, and I've got a small issue elsewhere. In your first pass, in the "else" clause (when the value was found in the next segment) don't you need to set subord to 0? hmm...I had that commented out as I was playing...let me think - if you don't set it and multiple numbers don't map to clean new ords and they are the same, it will fall to subords...so right, you wouldn't want an old subord around. I'm hoping there is a lot more optimization we can do to the pure ords case, but frankly I have low hopes for it - it should be a bit more competitive though. I'm going to take some time to finish up some other pieces and then come back again I think. I've polished up the comparators a bit, so once I get some other work in ill put up another rev.
          Hide
          Mark Miller added a comment -

          Admittedly, this is a very small (tiny) cost, and I do agree that making HitCollector know about docBase is really an abstraction violation...

          I'm not sold either way. Push to scorer?

          Show
          Mark Miller added a comment - Admittedly, this is a very small (tiny) cost, and I do agree that making HitCollector know about docBase is really an abstraction violation... I'm not sold either way. Push to scorer?
          Hide
          Michael McCandless added a comment -

          But, with this new approach (single pqueue that "knows" when we transition to the next segment) aren't we back to HitCollector not only knowing the doc base but also the IndexReader we are now advancing to? (We should remove the setDocBase call).

          Or I guess we could pre-add the doc (in Scorer) so that collect is called with the full docID.

          Still that "tiny" performance cost nags at me Most of the time the add would not have been necessary since the number of inserts into the pqueue should be a small percentage for a large number of hits. And this is the hotspot of searching for Lucene, so maybe we should not add on this cost even if it's tiny? And we can always wrap a collector that doesn't implement setIndexReader and pre-add the docId for it. It's like an "expert" DocCollector API vs the normal one.

          Show
          Michael McCandless added a comment - But, with this new approach (single pqueue that "knows" when we transition to the next segment) aren't we back to HitCollector not only knowing the doc base but also the IndexReader we are now advancing to? (We should remove the setDocBase call). Or I guess we could pre-add the doc (in Scorer) so that collect is called with the full docID. Still that "tiny" performance cost nags at me Most of the time the add would not have been necessary since the number of inserts into the pqueue should be a small percentage for a large number of hits. And this is the hotspot of searching for Lucene, so maybe we should not add on this cost even if it's tiny? And we can always wrap a collector that doesn't implement setIndexReader and pre-add the docId for it. It's like an "expert" DocCollector API vs the normal one.
          Hide
          Michael McCandless added a comment -

          > Prob is 1 one then, and I've got a small issue elsewhere.

          I think we want ord to be the lower bound. EG if seg 1 has:

          apple -> 0
          banana -> 1
          orange -> 2

          and then seg 2 has just apple & orange, then banana should map to ord 0 subord 1, meaning it's between ord 0 & 1, I think?

          And an exact match (apple & orange in this case) should have subord 0.

          > I'm hoping there is a lot more optimization we can do to the pure ords case, but frankly I have low hopes for it - it should be a bit more competitive though.

          I think the perf gains are very compelling, already, for the ord fallback case & title sorting. Small result sets are slower, but large result sets are substantially faster, than current trunk. Not to mention much faster warming time (side stepping the weirdness with Multi*Reader, and, only loading FieldCache for new segments).

          Show
          Michael McCandless added a comment - > Prob is 1 one then, and I've got a small issue elsewhere. I think we want ord to be the lower bound. EG if seg 1 has: apple -> 0 banana -> 1 orange -> 2 and then seg 2 has just apple & orange, then banana should map to ord 0 subord 1, meaning it's between ord 0 & 1, I think? And an exact match (apple & orange in this case) should have subord 0. > I'm hoping there is a lot more optimization we can do to the pure ords case, but frankly I have low hopes for it - it should be a bit more competitive though. I think the perf gains are very compelling, already, for the ord fallback case & title sorting. Small result sets are slower, but large result sets are substantially faster, than current trunk. Not to mention much faster warming time (side stepping the weirdness with Multi*Reader, and, only loading FieldCache for new segments).
          Hide
          Mark Miller added a comment - - edited

          Okay, I guess thats fair enough. I was going to push down the fall back sorted search (with the old Custom's) to a single reader too, but thats not actually worth keeping setdocbase for (or needed now that I think about it). So setNextReader brings the same abstraction argument though. But what can you do I guess - the benefits are clearly worth it, and those comparators need access to current subreader.

          Show
          Mark Miller added a comment - - edited Okay, I guess thats fair enough. I was going to push down the fall back sorted search (with the old Custom's) to a single reader too, but thats not actually worth keeping setdocbase for (or needed now that I think about it). So setNextReader brings the same abstraction argument though. But what can you do I guess - the benefits are clearly worth it, and those comparators need access to current subreader.
          Hide
          Mark Miller added a comment -

          I think the perf gains are very compelling, already, for the ord fallback case & title sorting. Small result sets are slower, but large result sets are substantially faster, than current trunk.

          Oh, I agree there - I think this patch still makes perfect sense - its brings lot of gains. I just don't think that ords without fallback is going to get very good. I'm wondering if we should even try too hard if ord with val fallback does so well.

          Show
          Mark Miller added a comment - I think the perf gains are very compelling, already, for the ord fallback case & title sorting. Small result sets are slower, but large result sets are substantially faster, than current trunk. Oh, I agree there - I think this patch still makes perfect sense - its brings lot of gains. I just don't think that ords without fallback is going to get very good. I'm wondering if we should even try too hard if ord with val fallback does so well.
          Hide
          Mark Miller added a comment -

          and then seg 2 has just apple & orange, then banana should map to ord 0 subord 1, meaning it's between ord 0 & 1, I think?

          apple -> 0
          orange ->1

          the binary search gives back -insertionpoint - 1, the insertion point for banana is 1, so -1 -1 = -2. So I reverse that and subtract 2 to get 0 right? It lands on apple. Then on sort, apple comes first for 0, 1 and then orange is 0, 2.

          (I dont remember off hand why subord has to start at 1 not 0, but i remember it didnt work otherwise)

          Show
          Mark Miller added a comment - and then seg 2 has just apple & orange, then banana should map to ord 0 subord 1, meaning it's between ord 0 & 1, I think? apple -> 0 orange ->1 the binary search gives back -insertionpoint - 1, the insertion point for banana is 1, so -1 -1 = -2. So I reverse that and subtract 2 to get 0 right? It lands on apple. Then on sort, apple comes first for 0, 1 and then orange is 0, 2. (I dont remember off hand why subord has to start at 1 not 0, but i remember it didnt work otherwise)
          Hide
          Michael McCandless added a comment -

          > the binary search gives back -insertionpoint - 1, the insertion point for banana is 1, so -1 -1 = -2. So I reverse that and subtract 2 to get 0 right? It lands on apple.

          Hmm – I didn't realize binarySearch is returning the insertion point on a miss. So your logic (negate then subtract 2) makes perfect sense now.

          Just be sure... maybe you should temporarily add asserts when a negative index is returned that values[-index-2].compareTo(newValue) < 0 and values[-index-1] > 0 (making sure those array accesses are in bounds)?

          > (I dont remember off hand why subord has to start at 1 not 0, but i remember it didnt work otherwise)

          This is very important – that 1 is "equivalent" to the original 0.5 proposal, ie, think of subord as the 2nd digit in a 2-digit number. That 2nd digit being non zero is how we know that even though banana's ord landed on apple's, banana is in fact not equal to apple (because the subord for banana is > 0) and is instead between apple and orange.

          Show
          Michael McCandless added a comment - > the binary search gives back -insertionpoint - 1, the insertion point for banana is 1, so -1 -1 = -2. So I reverse that and subtract 2 to get 0 right? It lands on apple. Hmm – I didn't realize binarySearch is returning the insertion point on a miss. So your logic (negate then subtract 2) makes perfect sense now. Just be sure... maybe you should temporarily add asserts when a negative index is returned that values [-index-2] .compareTo(newValue) < 0 and values [-index-1] > 0 (making sure those array accesses are in bounds)? > (I dont remember off hand why subord has to start at 1 not 0, but i remember it didnt work otherwise) This is very important – that 1 is "equivalent" to the original 0.5 proposal, ie, think of subord as the 2nd digit in a 2-digit number. That 2nd digit being non zero is how we know that even though banana's ord landed on apple's, banana is in fact not equal to apple (because the subord for banana is > 0) and is instead between apple and orange.
          Hide
          Michael McCandless added a comment -

          > I just don't think that ords without fallback is going to get very good. I'm wondering if we should even try too hard if ord with val fallback does so well.

          Maybe we can try a bit more (I'll run perf tests on your next iteration here?) and then start wrapping things up? Progress not perfection! We can further improve this later.

          Show
          Michael McCandless added a comment - > I just don't think that ords without fallback is going to get very good. I'm wondering if we should even try too hard if ord with val fallback does so well. Maybe we can try a bit more (I'll run perf tests on your next iteration here?) and then start wrapping things up? Progress not perfection! We can further improve this later.
          Hide
          Mark Miller added a comment -

          I'm on board with whatever you think is best.

          I'll keep playing with ords.

          I spent some time last night putting in most of the rest of the cleaup/finishup that was left outside of the comparators. Theres a handful of non SortTest classes tests that still fail though, so I still have to fix those. I'll do that, give ords a little play time, and then I think the patch will be fairly close. Then we can take it in and bench on a fairly close to done version.

          Show
          Mark Miller added a comment - I'm on board with whatever you think is best. I'll keep playing with ords. I spent some time last night putting in most of the rest of the cleaup/finishup that was left outside of the comparators. Theres a handful of non SortTest classes tests that still fail though, so I still have to fix those. I'll do that, give ords a little play time, and then I think the patch will be fairly close. Then we can take it in and bench on a fairly close to done version.
          Hide
          Uwe Schindler added a comment -

          I have still one question: Why do we need the new DocCollector? Is this really needed? Would it be not OK to just add the offset before calling collect()?

          Show
          Uwe Schindler added a comment - I have still one question: Why do we need the new DocCollector? Is this really needed? Would it be not OK to just add the offset before calling collect()?
          Hide
          Mark Miller added a comment -

          I have still one question: Why do we need the new DocCollector? Is this really needed? Would it be not OK to just add the offset before calling collect()?

          If its not needed, lets get rid of it. We don't want to deprecate HitCollector if we don't have to. The main reason I can see that we are doing it at the moment is that the TopFieldValueDocCollector needs that hook so that it can set the next IndexReader for each Comparator. The Comparator needs it to create the fieldcaches and map ords from one reader to the next. Also, it lets us do the docBase stuff, which is nice because you add the docBase less often if done in the collector.

          Show
          Mark Miller added a comment - I have still one question: Why do we need the new DocCollector? Is this really needed? Would it be not OK to just add the offset before calling collect()? If its not needed, lets get rid of it. We don't want to deprecate HitCollector if we don't have to. The main reason I can see that we are doing it at the moment is that the TopFieldValueDocCollector needs that hook so that it can set the next IndexReader for each Comparator. The Comparator needs it to create the fieldcaches and map ords from one reader to the next. Also, it lets us do the docBase stuff, which is nice because you add the docBase less often if done in the collector.
          Hide
          Michael McCandless added a comment -

          > Why do we need the new DocCollector? Is this really needed? Would it be not OK to just add the offset before calling collect()?

          I'd like to allow for 'expert' cases, where the collector is told when
          we advance to the next sequential reader and can do something at that
          point (like our sort-by-field collector does).

          But then still allow for 'normal' cases, where the collector is
          unchanged with what we have today (ie it receives the "real" docID).

          The core collectors would use the expert API to eke out all
          performance; external collectors can use either, but the 'normal' one
          would be simplest (and match back compat).

          So then how to "implement" this approach... I would actually be fine
          with keeping HitCollector, adding a default "setNextReader" method,
          that either throws UOE or (if we are strongly against exceptions)
          returns "false" indicating it cannot handle sequential readers.

          Then when we run searches we simply check if the collector is an
          "expert" one (does not throw UOE or return false from setNextReader)
          and if it isn't we wrap it with DocBaseCollector (which adds the doc
          base for every collect() call).

          Show
          Michael McCandless added a comment - > Why do we need the new DocCollector? Is this really needed? Would it be not OK to just add the offset before calling collect()? I'd like to allow for 'expert' cases, where the collector is told when we advance to the next sequential reader and can do something at that point (like our sort-by-field collector does). But then still allow for 'normal' cases, where the collector is unchanged with what we have today (ie it receives the "real" docID). The core collectors would use the expert API to eke out all performance; external collectors can use either, but the 'normal' one would be simplest (and match back compat). So then how to "implement" this approach... I would actually be fine with keeping HitCollector, adding a default "setNextReader" method, that either throws UOE or (if we are strongly against exceptions) returns "false" indicating it cannot handle sequential readers. Then when we run searches we simply check if the collector is an "expert" one (does not throw UOE or return false from setNextReader) and if it isn't we wrap it with DocBaseCollector (which adds the doc base for every collect() call).
          Hide
          Mark Miller added a comment -

          Hmmm...we had a reason for deprecating HitCollector though. At first it was to do the capability check (instance of HitCollector would be wrapped), but that didn't pan out. I think we also liked it because people got deprecation warnings though - so that they would know to implement that method for 3.0 when we would take out the wrapper.

          Show
          Mark Miller added a comment - Hmmm...we had a reason for deprecating HitCollector though. At first it was to do the capability check (instance of HitCollector would be wrapped), but that didn't pan out. I think we also liked it because people got deprecation warnings though - so that they would know to implement that method for 3.0 when we would take out the wrapper.
          Hide
          Michael McCandless added a comment -

          > so that they would know to implement that method for 3.0 when we would take out the wrapper.

          Right but the new insight (for me at least) is it's OK for external collectors to not code to the expert API.

          Ie previously we wanted to force migration to the expert API, but now I think it's OK to allow normal API and expert API to exist together.

          Show
          Michael McCandless added a comment - > so that they would know to implement that method for 3.0 when we would take out the wrapper. Right but the new insight (for me at least) is it's OK for external collectors to not code to the expert API. Ie previously we wanted to force migration to the expert API, but now I think it's OK to allow normal API and expert API to exist together.
          Hide
          Mark Miller added a comment -

          Okay, I hate the idea of leaving in the wrapper, but it is true thats too difficult of a method for HitCollector (to be required anyway). setReader is a jump in understanding above setDocBase, which was bad enough.

          Show
          Mark Miller added a comment - Okay, I hate the idea of leaving in the wrapper, but it is true thats too difficult of a method for HitCollector (to be required anyway). setReader is a jump in understanding above setDocBase, which was bad enough.
          Hide
          Mark Miller added a comment - - edited

          Hey Mike how about this one? BooleanScorer can collect hits out of order if you force it (against the contract). I think its an issue with basedoc type stuff.

          Actually I'll clarify that - I think its an issue with the multple reader mojo - didnt mean to put it solely on adding bases in particular yet.

          Show
          Mark Miller added a comment - - edited Hey Mike how about this one? BooleanScorer can collect hits out of order if you force it (against the contract). I think its an issue with basedoc type stuff. Actually I'll clarify that - I think its an issue with the multple reader mojo - didnt mean to put it solely on adding bases in particular yet.
          Hide
          Michael McCandless added a comment -

          > BooleanScorer can collect hits out of order if you force it (against the contract).

          Hmmm... right. You mean if you pass in allowDocsOutOfOrder=true (defaults to false).

          I think this should not be a problem? (Though, I really don't fully understand BooleanScorer!). Since we are running scoring per-segment, each segment might collect its docIDs out of order, but all such docs are still within the current segment. Then when we advance to the new segment, the collector can do something if it needs to, and then collection proceeds again on the next segment's docs, possibly out of order. Ie, the out-of-orderness never jumps across a segment and then back again?

          But this is a challenge for LUCENE-831, if we go with a primarily iterator-driven API.

          Show
          Michael McCandless added a comment - > BooleanScorer can collect hits out of order if you force it (against the contract). Hmmm... right. You mean if you pass in allowDocsOutOfOrder=true (defaults to false). I think this should not be a problem? (Though, I really don't fully understand BooleanScorer!). Since we are running scoring per-segment, each segment might collect its docIDs out of order, but all such docs are still within the current segment. Then when we advance to the new segment, the collector can do something if it needs to, and then collection proceeds again on the next segment's docs, possibly out of order. Ie, the out-of-orderness never jumps across a segment and then back again? But this is a challenge for LUCENE-831 , if we go with a primarily iterator-driven API.
          Hide
          Mark Miller added a comment -

          I didnt think it should be a problem either, since we just push everything to one reader; But it seems to be - the only test not passing involves allowDocsOutOfOrder=true. Do the search with it true, do the same search with it false, gets 3 and 4 docs. 2 or 3 tests involving that fail. I don't have time to dig in till tonight though - thought you might shortcut me to the answer

          Show
          Mark Miller added a comment - I didnt think it should be a problem either, since we just push everything to one reader; But it seems to be - the only test not passing involves allowDocsOutOfOrder=true. Do the search with it true, do the same search with it false, gets 3 and 4 docs. 2 or 3 tests involving that fail. I don't have time to dig in till tonight though - thought you might shortcut me to the answer
          Hide
          Doug Cutting added a comment -

          I would actually be fine with keeping HitCollector, adding a default "setNextReader" method, that either throws UOE or (if we are strongly against exceptions) returns "false" indicating it cannot handle sequential readers.

          Could we instead add a new HitCollector subclass, that adds the setNextReader, then use 'instanceof' to decide whether to wrap or not?

          I really don't fully understand BooleanScorer!

          The original version of BooleanScorer uses a ~16k array to score windows of docs. So it scores docs 0-16k first, then docs 16-32k, etc. For each window it iterates through all query terms and accumulates a score in table[doc%16k]. It also stores in the table a bitmask representing which terms contributed to the score. Non-zero scores are chained in a linked list. At the end of scoring each window it then iterates through the linked list and, if the bitmask matches the boolean constraints, collects a hit. For boolean queries with lots of frequent terms this can be much faster, since it does not need to update a priority queue for each posting, instead performing constant-time operations per posting. The only downside is that it results in hits being delivered out-of-order within the window, which means it cannot be nested within other scorers. But it works well as a top-level scorer. The new BooleanScorer2 implementation instead works by merging priority queues of postings, albeit with some clever tricks. For example, a pure conjunction (all terms required) does not require a priority queue. Instead it sorts the posting streams at the start, then repeatedly skips the first to to the last. If the first ever equals the last, then there's a hit. When some terms are required and some terms are optional, the conjunction can be evaluated first, then the optional terms can all skip to the match and be added to the score. Thus the conjunction can reduce the number of priority queue updates for the optional terms. Does that help any?

          Show
          Doug Cutting added a comment - I would actually be fine with keeping HitCollector, adding a default "setNextReader" method, that either throws UOE or (if we are strongly against exceptions) returns "false" indicating it cannot handle sequential readers. Could we instead add a new HitCollector subclass, that adds the setNextReader, then use 'instanceof' to decide whether to wrap or not? I really don't fully understand BooleanScorer! The original version of BooleanScorer uses a ~16k array to score windows of docs. So it scores docs 0-16k first, then docs 16-32k, etc. For each window it iterates through all query terms and accumulates a score in table [doc%16k] . It also stores in the table a bitmask representing which terms contributed to the score. Non-zero scores are chained in a linked list. At the end of scoring each window it then iterates through the linked list and, if the bitmask matches the boolean constraints, collects a hit. For boolean queries with lots of frequent terms this can be much faster, since it does not need to update a priority queue for each posting, instead performing constant-time operations per posting. The only downside is that it results in hits being delivered out-of-order within the window, which means it cannot be nested within other scorers. But it works well as a top-level scorer. The new BooleanScorer2 implementation instead works by merging priority queues of postings, albeit with some clever tricks. For example, a pure conjunction (all terms required) does not require a priority queue. Instead it sorts the posting streams at the start, then repeatedly skips the first to to the last. If the first ever equals the last, then there's a hit. When some terms are required and some terms are optional, the conjunction can be evaluated first, then the optional terms can all skip to the match and be added to the score. Thus the conjunction can reduce the number of priority queue updates for the optional terms. Does that help any?
          Hide
          Mark Miller added a comment - - edited

          Could we instead add a new HitCollector subclass, that adds the setNextReader, then use 'instanceof' to decide whether to wrap or not?

          Woah! Don't make me switch all that again! I've got wrist injuries here The reason I lost the instanceof is that we would have to deprecate the HitCollector implementations because they need to extend HitCollector. Mike seemed against deprecating those if we could get away with it, so I've since dropped that. I've already gone back and forth - whats it going to be ? Ill admit I don't like using the exception trap I am now, but I dont much like the return true/false method either...

          Edit

          Ah, I see, you have a new tweak on this time. Extend HitCollector rather then HitCollector extending the new type...

          Nice, I think this is the way to go.

          Show
          Mark Miller added a comment - - edited Could we instead add a new HitCollector subclass, that adds the setNextReader, then use 'instanceof' to decide whether to wrap or not? Woah! Don't make me switch all that again! I've got wrist injuries here The reason I lost the instanceof is that we would have to deprecate the HitCollector implementations because they need to extend HitCollector. Mike seemed against deprecating those if we could get away with it, so I've since dropped that. I've already gone back and forth - whats it going to be ? Ill admit I don't like using the exception trap I am now, but I dont much like the return true/false method either... Edit Ah, I see, you have a new tweak on this time. Extend HitCollector rather then HitCollector extending the new type... Nice, I think this is the way to go.
          Hide
          Doug Cutting added a comment -

          > Woah! Don't make me switch all that again!

          Sorry, I'm just tossing out ideas. Don't take me too seriously...

          > The reason I lost the instanceof is that we would have to deprecate the HitCollector implementations because they need to extend HitCollector.

          Would we? I was suggesting that, if we're going to have two APIs, one expert and one non-expert, then we could make the expert API a subclass and not deprecate or otherwise alter HitCollector. I do not like using exceptions for normal control flow. Instanceof is better, but not ideal. A default implementation of an expert method that returns 'false', as Mike suggested, isn't bad and might be best. It requires neither deprecation, exceptions nor instanceof. Would we have a subclass that overrides this that's used as a base class for optimized implementations?

          Show
          Doug Cutting added a comment - > Woah! Don't make me switch all that again! Sorry, I'm just tossing out ideas. Don't take me too seriously... > The reason I lost the instanceof is that we would have to deprecate the HitCollector implementations because they need to extend HitCollector. Would we? I was suggesting that, if we're going to have two APIs, one expert and one non-expert, then we could make the expert API a subclass and not deprecate or otherwise alter HitCollector. I do not like using exceptions for normal control flow. Instanceof is better, but not ideal. A default implementation of an expert method that returns 'false', as Mike suggested, isn't bad and might be best. It requires neither deprecation, exceptions nor instanceof. Would we have a subclass that overrides this that's used as a base class for optimized implementations?
          Hide
          Michael McCandless added a comment -

          > Would we have a subclass that overrides this that's used as a base class for optimized implementations?

          If we do this, I don't think we need a new base class for "expert" collectors; they can simply subclass HitCollector & override the setNextReader method?

          Though one downside of this approach is the "simple" HitCollector API is polluted with this advanced method, and HitCollector's collect method gets different args depending on what that method returns. It's a somewhat confusing API.

          I guess Id' actually prefer subclassing HitCollector (SequentialHitCollector? AdvancedHitCollector? SegmentedHitCollector?), adding setNextReader only to that subclass, and using instanceof to wrap HitCollector subclasses.

          Show
          Michael McCandless added a comment - > Would we have a subclass that overrides this that's used as a base class for optimized implementations? If we do this, I don't think we need a new base class for "expert" collectors; they can simply subclass HitCollector & override the setNextReader method? Though one downside of this approach is the "simple" HitCollector API is polluted with this advanced method, and HitCollector's collect method gets different args depending on what that method returns. It's a somewhat confusing API. I guess Id' actually prefer subclassing HitCollector (SequentialHitCollector? AdvancedHitCollector? SegmentedHitCollector?), adding setNextReader only to that subclass, and using instanceof to wrap HitCollector subclasses.
          Hide
          Mark Miller added a comment -

          >> Woah! Don't make me switch all that again!

          >Sorry, I'm just tossing out ideas. Don't take me too seriously...

          Same here. If you guys have a 100 ideas, id do it 100 times. No worries. Just wrist frustration I misunderstood you anyways.

          It requires neither deprecation, exceptions nor instanceof.

          Okay, fair points. I guess my main dislike was having to call it, see what it returns, and then maybe call it again. That turned me off as much as instanceof. I'm still liking the suggestion you just made myself...

          Mike?

          Show
          Mark Miller added a comment - >> Woah! Don't make me switch all that again! >Sorry, I'm just tossing out ideas. Don't take me too seriously... Same here. If you guys have a 100 ideas, id do it 100 times. No worries. Just wrist frustration I misunderstood you anyways. It requires neither deprecation, exceptions nor instanceof. Okay, fair points. I guess my main dislike was having to call it, see what it returns, and then maybe call it again. That turned me off as much as instanceof. I'm still liking the suggestion you just made myself... Mike?
          Hide
          Michael McCandless added a comment -

          > Does that help any?

          Yes, thanks! So much so that I'm going to go add that blurb to the javadocs...

          Show
          Michael McCandless added a comment - > Does that help any? Yes, thanks! So much so that I'm going to go add that blurb to the javadocs...
          Hide
          Mark Miller added a comment -

          I guess Id' actually prefer subclassing HitCollector (SequentialHitCollector? AdvancedHitCollector? SegmentedHitCollector?), adding setNextReader only to that subclass, and using instanceof to wrap HitCollector subclasses.

          Thats actually what I prefer as well (and what I tried). I used MultiReaderHitCollector. Still thinking about the name...

          Show
          Mark Miller added a comment - I guess Id' actually prefer subclassing HitCollector (SequentialHitCollector? AdvancedHitCollector? SegmentedHitCollector?), adding setNextReader only to that subclass, and using instanceof to wrap HitCollector subclasses. Thats actually what I prefer as well (and what I tried). I used MultiReaderHitCollector. Still thinking about the name...
          Hide
          Michael McCandless added a comment -

          I like MultiReaderHitCollector!

          Show
          Michael McCandless added a comment - I like MultiReaderHitCollector!
          Hide
          Mark Miller added a comment -

          Hmmm... right. You mean if you pass in allowDocsOutOfOrder=true (defaults to false).

          I think this should not be a problem? (Though, I really don't fully understand BooleanScorer!). Since we are running scoring per-segment, each segment might collect its docIDs out of order, but all such docs are still within the current segment. Then when we advance to the new segment, the collector can do something if it needs to, and then collection proceeds again on the next segment's docs, possibly out of order. Ie, the out-of-orderness never jumps across a segment and then back again?

          I was off base with my guess - its actually only using one reader for that test (3 or 4 docs). Gotto be the HitCollector that the out of order scorer uses needs to be tweaked. Last tests to fix.

          Show
          Mark Miller added a comment - Hmmm... right. You mean if you pass in allowDocsOutOfOrder=true (defaults to false). I think this should not be a problem? (Though, I really don't fully understand BooleanScorer!). Since we are running scoring per-segment, each segment might collect its docIDs out of order, but all such docs are still within the current segment. Then when we advance to the new segment, the collector can do something if it needs to, and then collection proceeds again on the next segment's docs, possibly out of order. Ie, the out-of-orderness never jumps across a segment and then back again? I was off base with my guess - its actually only using one reader for that test (3 or 4 docs). Gotto be the HitCollector that the out of order scorer uses needs to be tweaked. Last tests to fix.
          Hide
          Mark Miller added a comment -

          Hmmm...working more with ints as ords rather than double...it gives us ints but it complicates things a bit. Before, the only ords that had to be sorted and suborded were ones that didn't map on the new Reader exactly. With an int ord, everything you add is going to collide, and you need the ords in the queue added to the double lists and you need to fall down to the subord much more often...

          interesting...

          I guess I'll go with it for now though...

          Show
          Mark Miller added a comment - Hmmm...working more with ints as ords rather than double...it gives us ints but it complicates things a bit. Before, the only ords that had to be sorted and suborded were ones that didn't map on the new Reader exactly. With an int ord, everything you add is going to collide, and you need the ords in the queue added to the double lists and you need to fall down to the subord much more often... interesting... I guess I'll go with it for now though...
          Hide
          Michael McCandless added a comment -

          Hang on – if the value carries over to the new segment (and you set subord to 0) then you don't need to add those ords to the double lists?

          Show
          Michael McCandless added a comment - Hang on – if the value carries over to the new segment (and you set subord to 0) then you don't need to add those ords to the double lists?
          Hide
          Mark Miller added a comment -

          Yeah - sorry. I actually realized that as I just finished it off, but I'm trying not too spam the dev list so much (not winning that war). But it does drop more often Ignore those concerns. Ill put up a patch in a minute.

          Show
          Mark Miller added a comment - Yeah - sorry. I actually realized that as I just finished it off, but I'm trying not too spam the dev list so much (not winning that war). But it does drop more often Ignore those concerns. Ill put up a patch in a minute.
          Hide
          Mark Miller added a comment - - edited

          This patch is entering the finishing stages I think. This one is pretty much functionally complete and all tests should pass.

          There is still a bunch of polish to be done though.

          There are still the following sort types: SortField.STRING_VAL, STRING_ORD, STRING_ORD_VAL, and STRING is currently set to straight ord.

          I think the ord case is still pretty slow, I'm sure there are still a few optimizations left, but it would be nice to see where its at.

          There is still an issue with custom FieldComparators - they are currently passed the top level reader in the hook - this still needs to be addressed somehow. We also need a test for one.

          • Mark

          (ignore the couple setDocBases you see in contrib - ive got em)

          Show
          Mark Miller added a comment - - edited This patch is entering the finishing stages I think. This one is pretty much functionally complete and all tests should pass. There is still a bunch of polish to be done though. There are still the following sort types: SortField.STRING_VAL, STRING_ORD, STRING_ORD_VAL, and STRING is currently set to straight ord. I think the ord case is still pretty slow, I'm sure there are still a few optimizations left, but it would be nice to see where its at. There is still an issue with custom FieldComparators - they are currently passed the top level reader in the hook - this still needs to be addressed somehow. We also need a test for one. Mark (ignore the couple setDocBases you see in contrib - ive got em)
          Hide
          Mark Miller added a comment -

          Hang on - if the value carries over to the new segment (and you set subord to 0) then you don't need to add those ords to the double lists?

          What was actually happening: I noticed it wasn't quite working right after switching ords to ints from double, and I realized the problem was that there was always going to be a collision for the sort list, whereas before, there was only a sortable collision when more than one mapped-from ord collided. So I thought that out wrong and figured you needed to sort the current ord as well, but in fact, of course you don't: I just needed to assume there is always a collision that adds to the sort list, not wait for 2 mapped-from ords to collide.

          Show
          Mark Miller added a comment - Hang on - if the value carries over to the new segment (and you set subord to 0) then you don't need to add those ords to the double lists? What was actually happening: I noticed it wasn't quite working right after switching ords to ints from double, and I realized the problem was that there was always going to be a collision for the sort list, whereas before, there was only a sortable collision when more than one mapped-from ord collided. So I thought that out wrong and figured you needed to sort the current ord as well, but in fact, of course you don't: I just needed to assume there is always a collision that adds to the sort list, not wait for 2 mapped-from ords to collide.
          Hide
          Michael McCandless added a comment -

          Patch is looking good! All tests, and back-compat tests, pass. I'm
          going to run a round of perf tests...

          Some minor things I noticed:

          • Fix indent on FieldComparatorSource.java to 2 spaces
          • Leftover "check val" print in FieldComparator.java
          • Do we need to track maxScore in TopFieldValueDocCollector (but
            mark as deprecated)? (Because Hits isn't removed yet).
          • Got some "nocommit" comments to resolve still
          • I think StringComparatorLocale should call FieldCache.getStrings
            (as it does on trunk now when you do String sort w/ Locale), not
            getStringIndex. Then the queue should just hold the String[]
            values, not StringIndex[]?
            .
            (Aside: we could fix StringIndex computation to take a Locale,
            which'd give better performance, but that's a separate issue).
          • I think you can improve the ord fallback comparator a bit, by
            keeping separate "equals" array that's just like subord but is
            instead a boolean. Equals is true when the ord is present in the
            new segment and false if the string could not be found in the new
            segment. Then only fallback to String.compareTo when equals is
            false. I think this is important for enum fields because the two
            segments will not have the same String object when they are equal.
          Show
          Michael McCandless added a comment - Patch is looking good! All tests, and back-compat tests, pass. I'm going to run a round of perf tests... Some minor things I noticed: Fix indent on FieldComparatorSource.java to 2 spaces Leftover "check val" print in FieldComparator.java Do we need to track maxScore in TopFieldValueDocCollector (but mark as deprecated)? (Because Hits isn't removed yet). Got some "nocommit" comments to resolve still I think StringComparatorLocale should call FieldCache.getStrings (as it does on trunk now when you do String sort w/ Locale), not getStringIndex. Then the queue should just hold the String[] values, not StringIndex[]? . (Aside: we could fix StringIndex computation to take a Locale, which'd give better performance, but that's a separate issue). I think you can improve the ord fallback comparator a bit, by keeping separate "equals" array that's just like subord but is instead a boolean. Equals is true when the ord is present in the new segment and false if the string could not be found in the new segment. Then only fallback to String.compareTo when equals is false. I think this is important for enum fields because the two segments will not have the same String object when they are equal.
          Hide
          Mark Miller added a comment -

          Fix indent on FieldComparatorSource.java to 2 spaces

          Roger.

          Leftover "check val" print in FieldComparator.java

          I'll be sure to get all of this...i also have a ton of System.outs commented out that I'll remove.

          Do we need to track maxScore in TopFieldValueDocCollector (but mark as deprecated)? (Because Hits isn't removed yet).

          You know it.

          Got some "nocommit" comments to resolve still

          Right - time to start looking at these, names, and design changes that might make sense I think.

          I think StringComparatorLocale should call FieldCache.getStrings (as it does on trunk now when you do String sort w/ Locale), not getStringIndex. Then the queue should just hold the String[] values, not StringIndex[]?

          I swear I looked at trunk twice and saw it was using a StringIndex - just looked again ands its not. I love that.

          I think you can improve the ord fallback comparator a bit...

          Lets see if its in even in the ballpark yet...if it is, ill tweak it all we can.

          Show
          Mark Miller added a comment - Fix indent on FieldComparatorSource.java to 2 spaces Roger. Leftover "check val" print in FieldComparator.java I'll be sure to get all of this...i also have a ton of System.outs commented out that I'll remove. Do we need to track maxScore in TopFieldValueDocCollector (but mark as deprecated)? (Because Hits isn't removed yet). You know it. Got some "nocommit" comments to resolve still Right - time to start looking at these, names, and design changes that might make sense I think. I think StringComparatorLocale should call FieldCache.getStrings (as it does on trunk now when you do String sort w/ Locale), not getStringIndex. Then the queue should just hold the String[] values, not StringIndex[]? I swear I looked at trunk twice and saw it was using a StringIndex - just looked again ands its not. I love that. I think you can improve the ord fallback comparator a bit... Lets see if its in even in the ballpark yet...if it is, ill tweak it all we can.
          Hide
          Michael McCandless added a comment -

          One question: do you think we should provide a simple "legacy fallback" option, deprecated, so that in case we messed something up here, people can force the sort to use the current approach? We already have automatic "legacy" computation (if the old CUSTOM sort is in use) but we could eg add a deprecated "setLegacy" to SortField or some such? As an insurance policy...

          Show
          Michael McCandless added a comment - One question: do you think we should provide a simple "legacy fallback" option, deprecated, so that in case we messed something up here, people can force the sort to use the current approach? We already have automatic "legacy" computation (if the old CUSTOM sort is in use) but we could eg add a deprecated "setLegacy" to SortField or some such? As an insurance policy...
          Hide
          Mark Miller added a comment -

          Yeah, that seems like a great idea. Will help debugging a lot if someone reports an oddity...

          Show
          Mark Miller added a comment - Yeah, that seems like a great idea. Will help debugging a lot if someone reports an oddity...
          Hide
          Mark Miller added a comment -

          How about names?

          TopFieldValueDocCollector

          I'm not digging this one at the moment. I didn't dig TopFieldDocCollector either though. TopDocCollector makes sense, but shouldnt these be more like TopFieldSortedDocCollector or something?

          ByValueFieldSortedHitQueue

          How about maybe FieldValueSortedHitQueue?

          Show
          Mark Miller added a comment - How about names? TopFieldValueDocCollector I'm not digging this one at the moment. I didn't dig TopFieldDocCollector either though. TopDocCollector makes sense, but shouldnt these be more like TopFieldSortedDocCollector or something? ByValueFieldSortedHitQueue How about maybe FieldValueSortedHitQueue?
          Hide
          Michael McCandless added a comment -

          How about ByValueFieldSortedHitQueue --> FieldValueHitQueue (the "sorted" seems redundant?)

          How about TopFieldValueDocCollector --> TopFieldCollector (I actually don't mind the current name TopFieldDocCollector, but, we can't use that name, so I dropped the Doc part).

          Show
          Michael McCandless added a comment - How about ByValueFieldSortedHitQueue --> FieldValueHitQueue (the "sorted" seems redundant?) How about TopFieldValueDocCollector --> TopFieldCollector (I actually don't mind the current name TopFieldDocCollector, but, we can't use that name, so I dropped the Doc part).
          Hide
          Mark Miller added a comment - - edited

          How about ByValueFieldSortedHitQueue --> FieldValueHitQueue (the "sorted" seems redundant?)

          Fair enough, and shorter is better.

          How about TopFieldValueDocCollector --> TopFieldCollector (I actually don't mind the current name TopFieldDocCollector, but, we can't use that name, so I dropped the Doc part).

          Okay...shorter better again, so agreed. I find TopFieldDocCollector confusing myself - I'd rather it be easier to know, this sorts by relevance, this sorts by field value, and those names don't say that to a non lucene user (or more fairly, they didn't say that to me). I think, what the heck is a FieldDoc? Moot point though, I agree with both suggestions.

          edit

          Hmm...I guess that its not just fields, but also doc id and relevance if you want, that complicates things for that name...I guess in that way I also prefer TopFieldDocCollector over TopFieldCollector - ah well.

          Show
          Mark Miller added a comment - - edited How about ByValueFieldSortedHitQueue --> FieldValueHitQueue (the "sorted" seems redundant?) Fair enough, and shorter is better. How about TopFieldValueDocCollector --> TopFieldCollector (I actually don't mind the current name TopFieldDocCollector, but, we can't use that name, so I dropped the Doc part). Okay...shorter better again, so agreed. I find TopFieldDocCollector confusing myself - I'd rather it be easier to know, this sorts by relevance, this sorts by field value, and those names don't say that to a non lucene user (or more fairly, they didn't say that to me). I think, what the heck is a FieldDoc? Moot point though, I agree with both suggestions. edit Hmm...I guess that its not just fields, but also doc id and relevance if you want, that complicates things for that name...I guess in that way I also prefer TopFieldDocCollector over TopFieldCollector - ah well.
          Hide
          Michael McCandless added a comment -

          Can you change DocComparator to just return doc1-doc2? (instead of having if that translates that into -1/1)? I think that eeks performance. (And we should fix javadoc saying "any neg number" and "any pos number" is OK; oh I see a nocommit asking for that already).

          Show
          Michael McCandless added a comment - Can you change DocComparator to just return doc1-doc2? (instead of having if that translates that into -1/1)? I think that eeks performance. (And we should fix javadoc saying "any neg number" and "any pos number" is OK; oh I see a nocommit asking for that already).
          Hide
          Mark Miller added a comment -

          DocComparator? Its not doing the if's...

          Do you mean relevance? That doesn't work right when you can have negatives does it? This is what I have for Doc (I don't think I've touched it from what you did):

            public static final class DocComparator extends FieldComparator {
          
              // nocommit -- maybe "setcurrentscoredoc"?
          
              private final int[] docIDs;
              private int docBase;
              private int readerMaxDoc;
          
              DocComparator(int numHits) {
                docIDs = new int[numHits];
              }
          
              public int compare(int slot1, int slot2) {
                return docIDs[slot1] - docIDs[slot2];
              }
          
              public int compare(int slot, int doc, float score) {
                return docIDs[slot] - docBase - doc;
              }
          
              public void copy(int slot, int doc, float score) {
                docIDs[slot] = docBase + doc;
              }
          
              public void setNextReader(IndexReader reader) {
                // TODO: can we "map" our docIDs to the current
                // reader? saves having to then subtract on every
                // compare call
                docBase += readerMaxDoc;
                readerMaxDoc = reader.maxDoc();
              }
          
              public int sortType() {
                return SortField.DOC;
              }
          
              public Comparable value(int slot) {
                return new Integer(docIDs[slot]);
              }
            };
          
          Show
          Mark Miller added a comment - DocComparator? Its not doing the if's... Do you mean relevance? That doesn't work right when you can have negatives does it? This is what I have for Doc (I don't think I've touched it from what you did): public static final class DocComparator extends FieldComparator { // nocommit -- maybe "setcurrentscoredoc" ? private final int [] docIDs; private int docBase; private int readerMaxDoc; DocComparator( int numHits) { docIDs = new int [numHits]; } public int compare( int slot1, int slot2) { return docIDs[slot1] - docIDs[slot2]; } public int compare( int slot, int doc, float score) { return docIDs[slot] - docBase - doc; } public void copy( int slot, int doc, float score) { docIDs[slot] = docBase + doc; } public void setNextReader(IndexReader reader) { // TODO: can we "map" our docIDs to the current // reader? saves having to then subtract on every // compare call docBase += readerMaxDoc; readerMaxDoc = reader.maxDoc(); } public int sortType() { return SortField.DOC; } public Comparable value( int slot) { return new Integer (docIDs[slot]); } };
          Hide
          Michael McCandless added a comment -

          Woops, you're right, sorry, I was looking at an old version...

          Show
          Michael McCandless added a comment - Woops, you're right, sorry, I was looking at an old version...
          Hide
          Mark Miller added a comment -

          I left a doubled set in the ords comparator thats no longer needed. We can just use map.values() instead. I'm sure there are small win tricks we can do too - if its anywhere near competitive.

          Show
          Mark Miller added a comment - I left a doubled set in the ords comparator thats no longer needed. We can just use map.values() instead. I'm sure there are small win tricks we can do too - if its anywhere near competitive.
          Hide
          Mark Miller added a comment -

          A slightly improved setNextReader for the ords case:

             public void setNextReader(IndexReader reader) throws IOException {
                
                // Map ords in queue to ords in reader
                
                StringIndex currentReaderValues = ExtendedFieldCache.EXT_DEFAULT
                    .getStringIndex(reader, field);
                
                lookup = currentReaderValues.lookup;
                order = currentReaderValues.order;
          
                if (lookup.length == 0) {
                  return;
                }
                
          //      for (int i = 0; i < lookup.length; i++) {
          //        System.out.println("i:" + i + " " + lookup[i]);
          //      }
          
                Map map = new HashMap();
                for (int i = 0; i < slot + 1; i++) {
                  String val = values[i];
                  if (val == null) {
                    continue;
                  }
          
                  int index = binarySearch(lookup, val);
          
                  if (index < 0) {
                    int ord = -index - 2;
                    Integer intOrd = Integer.valueOf(ord);
                    List slotVals = (List) map.get(intOrd);
                    if (slotVals == null) {
                      slotVals = new ArrayList();
                      slotVals.add(new SlotValPair(i, val));
                      map.put(intOrd, slotVals);
                    } else {
                      slotVals.add(new SlotValPair(i, val));
                    }
          
                    ords[i] = ord;
                  } else {
                    ords[i] = index;
                    subords[i] = 0;
                  }
                }
          
                Iterator dit = map.values().iterator();
                while (dit.hasNext()) {
                  List list = (List) dit.next();
                  if (list.size() > 1) {
                    Collections.sort(list);
          
                    for (int i = 0; i < list.size(); i++) {
                      SlotValPair svp = (SlotValPair) list.get(i);
                      subords[svp.i] = i+1;
                    }
                  } else {
                    SlotValPair svp = (SlotValPair) list.get(0);
                    subords[svp.i] = 1;
                  }
                }
          
              }
          
          Show
          Mark Miller added a comment - A slightly improved setNextReader for the ords case: public void setNextReader(IndexReader reader) throws IOException { // Map ords in queue to ords in reader StringIndex currentReaderValues = ExtendedFieldCache.EXT_DEFAULT .getStringIndex(reader, field); lookup = currentReaderValues.lookup; order = currentReaderValues.order; if (lookup.length == 0) { return ; } // for ( int i = 0; i < lookup.length; i++) { // System .out.println( "i:" + i + " " + lookup[i]); // } Map map = new HashMap(); for ( int i = 0; i < slot + 1; i++) { String val = values[i]; if (val == null ) { continue ; } int index = binarySearch(lookup, val); if (index < 0) { int ord = -index - 2; Integer intOrd = Integer .valueOf(ord); List slotVals = (List) map.get(intOrd); if (slotVals == null ) { slotVals = new ArrayList(); slotVals.add( new SlotValPair(i, val)); map.put(intOrd, slotVals); } else { slotVals.add( new SlotValPair(i, val)); } ords[i] = ord; } else { ords[i] = index; subords[i] = 0; } } Iterator dit = map.values().iterator(); while (dit.hasNext()) { List list = (List) dit.next(); if (list.size() > 1) { Collections.sort(list); for ( int i = 0; i < list.size(); i++) { SlotValPair svp = (SlotValPair) list.get(i); subords[svp.i] = i+1; } } else { SlotValPair svp = (SlotValPair) list.get(0); subords[svp.i] = 1; } } }
          Hide
          Michael McCandless added a comment -

          OK I ran a bunch of sort perf tests, on trunk & with the patch.
          (Attached the two Python sources for doing this... though they require
          some small local mods to run properly).

          Each alg is run with "java -Xms1024M -Xmx1024M -Xbatch -server" on OS
          X 10.5.5, java 1.6.0_07-b06-153.

          I use two indexes, each with 2M docs. One is docs from Wikipedia
          (labeled "wiki"), the other is SortableSimpleDocMaker docs augmented
          to include a random country field (labeled "simple"). For each I
          created 1-segment, 10-segment and 100-segment indices. I sort by
          score, doc, string (val, ord = true ord+subord, ordval = ord +
          fallback). Queue size is 10.

          I ran various queries... query "147" hits ~5k docs, query "text" hits
          ~97K docs, query "1" hits 386K docs and alldocs query hits 2M docs. qps
          is queries per sec and warm is time for first warmup query, from
          trunk. qpsnew & warmnew are with patch. pctg shows % gain in
          qps performance:

          numSeg index sortBy query topN meth hits warm qps warmnew qpsnew pctg
          1 wiki score 147 10   4984 0.2 5717.6 0.2 5627.5 -1.6%
          1 wiki score text 10   97191 0.3 340.9 0.3 348.8 2.3%
          1 wiki score 1 10   386435 0.3 86.7 0.3 89.3 3.0%
          1 wiki doc 147 10   4984 0.3 4071.7 0.3 4649.0 14.2%
          1 wiki doc text 10   97191 0.3 225.4 0.3 253.7 12.6%
          1 wiki doc 1 10   386435 0.3 56.9 0.3 65.8 15.6%
          1 wiki doc <all> 10   2000000 0.1 23.0 0.1 38.6 67.8%
          1 simple int text 10   2000000 0.7 10.7 0.7 13.5 26.2%
          1 simple int <all> 10   2000000 0.6 21.1 0.6 34.7 64.5%
          1 simple country text 10 ord 2000000 0.6 10.7 0.6 13.2 23.4%
          1 simple country text 10 ordval 2000000 0.6 10.7 0.6 13.3 24.3%
          1 simple country <all> 10 ord 2000000 0.5 20.7 0.6 32.5 57.0%
          1 simple country <all> 10 ordval 2000000 0.5 20.7 0.6 34.6 67.1%
          1 wiki title 147 10 ord 4984 2.1 3743.8 2.0 4210.5 12.5%
          1 wiki title 147 10 ordval 4984 2.1 3743.8 2.0 4288.2 14.5%
          1 wiki title text 10 ord 97191 2.1 144.2 2.1 160.3 11.2%
          1 wiki title text 10 ordval 97191 2.1 144.2 2.1 163.5 13.4%
          1 wiki title 1 10 ord 386435 2.1 51.2 2.1 63.2 23.4%
          1 wiki title 1 10 ordval 386435 2.1 51.2 2.1 64.6 26.2%
          1 wiki title <all> 10 ord 2000000 2.1 21.1 2.1 33.2 57.3%
          1 wiki title <all> 10 ordval 2000000 2.1 21.1 2.1 35.4 67.8%
          numSeg index sortBy query topN meth hits warm qps warmnew qpsnew pctg
          10 wiki score 147 10   4984 0.3 4228.3 0.3 4510.6 6.7%
          10 wiki score text 10   97191 0.3 294.7 0.3 341.5 15.9%
          10 wiki score 1 10   386435 0.4 75.0 0.4 87.0 16.0%
          10 wiki doc 147 10   4984 0.3 3332.2 0.3 4033.9 21.1%
          10 wiki doc text 10   97191 0.4 217.0 0.4 277.0 27.6%
          10 wiki doc 1 10   386435 0.4 54.6 0.4 70.5 29.1%
          10 wiki doc <all> 10   2000000 0.1 12.7 0.1 38.6 203.9%
          10 simple int text 10   2000000 1.2 10.3 0.6 13.5 31.1%
          10 simple int <all> 10   2000000 1.1 11.8 0.8 34.6 193.2%
          10 simple country text 10 ord 2000000 0.7 10.4 0.5 13.2 26.9%
          10 simple country text 10 ordval 2000000 0.7 10.4 0.5 13.3 27.9%
          10 simple country <all> 10 ord 2000000 0.7 11.5 0.5 32.5 182.6%
          10 simple country <all> 10 ordval 2000000 0.7 11.5 0.5 34.1 196.5%
          10 wiki title 147 10 ord 4984 12.5 3004.5 2.1 3124.0 4.0%
          10 wiki title 147 10 ordval 4984 12.5 3004.5 2.1 3353.5 11.6%
          10 wiki title text 10 ord 97191 12.7 139.4 2.1 156.7 12.4%
          10 wiki title text 10 ordval 97191 12.7 139.4 2.1 160.9 15.4%
          10 wiki title 1 10 ord 386435 12.7 50.3 2.1 62.3 23.9%
          10 wiki title 1 10 ordval 386435 12.7 50.3 2.1 64.1 27.4%
          10 wiki title <all> 10 ord 2000000 12.7 11.4 2.1 33.1 190.4%
          10 wiki title <all> 10 ordval 2000000 12.7 11.4 2.1 35.3 209.6%
          numSeg index sortBy query topN meth hits warm qps warmnew qpsnew pctg
          100 wiki score 147 10   4984 0.3 1282.2 1.7 1162.3 -9.4%
          100 wiki score text 10   97191 0.4 232.4 1.3 275.6 18.6%
          100 wiki score 1 10   386435 0.4 65.1 1.4 80.4 23.5%
          100 wiki doc 147 10   4984 0.4 1170.0 0.4 1132.0 -3.2%
          100 wiki doc text 10   97191 0.4 171.7 0.4 230.1 34.0%
          100 wiki doc 1 10   386435 0.4 46.7 0.4 67.9 45.4%
          100 wiki doc <all> 10   2000000 0.2 7.8 0.1 41.6 433.3%
          100 simple int text 10   2000000 3.3 8.9 4.0 11.0 23.6%
          100 simple int <all> 10   2000000 3.4 7.7 1.1 36.5 374.0%
          100 simple country text 10 ord 2000000 1.0 8.8 0.6 10.8 22.7%
          100 simple country text 10 ordval 2000000 1.0 8.8 0.6 11.0 25.0%
          100 simple country <all> 10 ord 2000000 1.0 7.6 0.5 35.0 360.5%
          100 simple country <all> 10 ordval 2000000 1.0 7.6 0.5 36.3 377.6%
          100 wiki title 147 10 ord 4984 94.6 1066.9 2.1 583.7 -45.3%
          100 wiki title 147 10 ordval 4984 94.6 1066.9 2.1 750.1 -29.7%
          100 wiki title text 10 ord 97191 94.9 110.2 2.1 122.7 11.3%
          100 wiki title text 10 ordval 97191 94.9 110.2 2.1 128.4 16.5%
          100 wiki title 1 10 ord 386435 94.3 47.9 2.1 58.2 21.5%
          100 wiki title 1 10 ordval 386435 94.3 47.9 2.1 60.1 25.5%
          100 wiki title <all> 10 ord 2000000 94.6 7.8 2.5 35.6 356.4%
          100 wiki title <all> 10 ordval 2000000 94.6 7.8 2.4 37.0 374.4%

          It's a ridiculous amount of of data to digest... but here are some
          initial thoughts:

          • These are only single term queries; I'd expect for multi term
            queries the gain would be less since net/net less %tg of the time
            is spent collecting.
          • Ord + val fallback (ordval) is generally faster than pure
            ord/subord. I think for now we should run with ord + val
            fallback? (We can leave ord + subord commented out?).
          • It's great that we see decent speedups for "sort by score" which
            is presumably the most common sort used.
          • We do get slower in certain cases (neg pctg in the rightmost
            column): all not-in-the-noise slowdowns were with query "147" on
            the 100 segment index. This query hits relatively few docs (~5K)
            So, this is expected, because the new approach spends some time
            updating its queue for each subreader. If the time spent
            searching is relatively tiny then this queue update time becomes
            relatively big. I think with larger queue size the slowdown will
            be worse. However, I think this is an acceptable tradeoff.
          • The gains for field sorting on a single segment (optimized) index
            are impressive. Generally, the more hits encountered the better
            the gains. It's amazing that we see ~67.8% gain sorting by docID,
            country, and title for alldocs query. My only guess for this is
            better cache hit rate (because we gain locality by copying values
            to local compact arrays).
          • Across the board the alldocs query shows very sizable (5X faster
            for 100 segment index; 3X faster for 10 segment index)
            improvements.
          • I didn't break out %tg difference, but warming time with the patch
            is waaaay faster than trunk when index has > 1 segment. Reopen
            time should also be fantastically faster (we are sidestepping
            something silly happing w/ FieldCache on a Multi*Reader). Warming
            on trunk takes ~95 seconds with the 100 segment index!
          Show
          Michael McCandless added a comment - OK I ran a bunch of sort perf tests, on trunk & with the patch. (Attached the two Python sources for doing this... though they require some small local mods to run properly). Each alg is run with "java -Xms1024M -Xmx1024M -Xbatch -server" on OS X 10.5.5, java 1.6.0_07-b06-153. I use two indexes, each with 2M docs. One is docs from Wikipedia (labeled "wiki"), the other is SortableSimpleDocMaker docs augmented to include a random country field (labeled "simple"). For each I created 1-segment, 10-segment and 100-segment indices. I sort by score, doc, string (val, ord = true ord+subord, ordval = ord + fallback). Queue size is 10. I ran various queries... query "147" hits ~5k docs, query "text" hits ~97K docs, query "1" hits 386K docs and alldocs query hits 2M docs. qps is queries per sec and warm is time for first warmup query, from trunk. qpsnew & warmnew are with patch. pctg shows % gain in qps performance: numSeg index sortBy query topN meth hits warm qps warmnew qpsnew pctg 1 wiki score 147 10   4984 0.2 5717.6 0.2 5627.5 -1.6% 1 wiki score text 10   97191 0.3 340.9 0.3 348.8 2.3% 1 wiki score 1 10   386435 0.3 86.7 0.3 89.3 3.0% 1 wiki doc 147 10   4984 0.3 4071.7 0.3 4649.0 14.2% 1 wiki doc text 10   97191 0.3 225.4 0.3 253.7 12.6% 1 wiki doc 1 10   386435 0.3 56.9 0.3 65.8 15.6% 1 wiki doc <all> 10   2000000 0.1 23.0 0.1 38.6 67.8% 1 simple int text 10   2000000 0.7 10.7 0.7 13.5 26.2% 1 simple int <all> 10   2000000 0.6 21.1 0.6 34.7 64.5% 1 simple country text 10 ord 2000000 0.6 10.7 0.6 13.2 23.4% 1 simple country text 10 ordval 2000000 0.6 10.7 0.6 13.3 24.3% 1 simple country <all> 10 ord 2000000 0.5 20.7 0.6 32.5 57.0% 1 simple country <all> 10 ordval 2000000 0.5 20.7 0.6 34.6 67.1% 1 wiki title 147 10 ord 4984 2.1 3743.8 2.0 4210.5 12.5% 1 wiki title 147 10 ordval 4984 2.1 3743.8 2.0 4288.2 14.5% 1 wiki title text 10 ord 97191 2.1 144.2 2.1 160.3 11.2% 1 wiki title text 10 ordval 97191 2.1 144.2 2.1 163.5 13.4% 1 wiki title 1 10 ord 386435 2.1 51.2 2.1 63.2 23.4% 1 wiki title 1 10 ordval 386435 2.1 51.2 2.1 64.6 26.2% 1 wiki title <all> 10 ord 2000000 2.1 21.1 2.1 33.2 57.3% 1 wiki title <all> 10 ordval 2000000 2.1 21.1 2.1 35.4 67.8% numSeg index sortBy query topN meth hits warm qps warmnew qpsnew pctg 10 wiki score 147 10   4984 0.3 4228.3 0.3 4510.6 6.7% 10 wiki score text 10   97191 0.3 294.7 0.3 341.5 15.9% 10 wiki score 1 10   386435 0.4 75.0 0.4 87.0 16.0% 10 wiki doc 147 10   4984 0.3 3332.2 0.3 4033.9 21.1% 10 wiki doc text 10   97191 0.4 217.0 0.4 277.0 27.6% 10 wiki doc 1 10   386435 0.4 54.6 0.4 70.5 29.1% 10 wiki doc <all> 10   2000000 0.1 12.7 0.1 38.6 203.9% 10 simple int text 10   2000000 1.2 10.3 0.6 13.5 31.1% 10 simple int <all> 10   2000000 1.1 11.8 0.8 34.6 193.2% 10 simple country text 10 ord 2000000 0.7 10.4 0.5 13.2 26.9% 10 simple country text 10 ordval 2000000 0.7 10.4 0.5 13.3 27.9% 10 simple country <all> 10 ord 2000000 0.7 11.5 0.5 32.5 182.6% 10 simple country <all> 10 ordval 2000000 0.7 11.5 0.5 34.1 196.5% 10 wiki title 147 10 ord 4984 12.5 3004.5 2.1 3124.0 4.0% 10 wiki title 147 10 ordval 4984 12.5 3004.5 2.1 3353.5 11.6% 10 wiki title text 10 ord 97191 12.7 139.4 2.1 156.7 12.4% 10 wiki title text 10 ordval 97191 12.7 139.4 2.1 160.9 15.4% 10 wiki title 1 10 ord 386435 12.7 50.3 2.1 62.3 23.9% 10 wiki title 1 10 ordval 386435 12.7 50.3 2.1 64.1 27.4% 10 wiki title <all> 10 ord 2000000 12.7 11.4 2.1 33.1 190.4% 10 wiki title <all> 10 ordval 2000000 12.7 11.4 2.1 35.3 209.6% numSeg index sortBy query topN meth hits warm qps warmnew qpsnew pctg 100 wiki score 147 10   4984 0.3 1282.2 1.7 1162.3 -9.4% 100 wiki score text 10   97191 0.4 232.4 1.3 275.6 18.6% 100 wiki score 1 10   386435 0.4 65.1 1.4 80.4 23.5% 100 wiki doc 147 10   4984 0.4 1170.0 0.4 1132.0 -3.2% 100 wiki doc text 10   97191 0.4 171.7 0.4 230.1 34.0% 100 wiki doc 1 10   386435 0.4 46.7 0.4 67.9 45.4% 100 wiki doc <all> 10   2000000 0.2 7.8 0.1 41.6 433.3% 100 simple int text 10   2000000 3.3 8.9 4.0 11.0 23.6% 100 simple int <all> 10   2000000 3.4 7.7 1.1 36.5 374.0% 100 simple country text 10 ord 2000000 1.0 8.8 0.6 10.8 22.7% 100 simple country text 10 ordval 2000000 1.0 8.8 0.6 11.0 25.0% 100 simple country <all> 10 ord 2000000 1.0 7.6 0.5 35.0 360.5% 100 simple country <all> 10 ordval 2000000 1.0 7.6 0.5 36.3 377.6% 100 wiki title 147 10 ord 4984 94.6 1066.9 2.1 583.7 -45.3% 100 wiki title 147 10 ordval 4984 94.6 1066.9 2.1 750.1 -29.7% 100 wiki title text 10 ord 97191 94.9 110.2 2.1 122.7 11.3% 100 wiki title text 10 ordval 97191 94.9 110.2 2.1 128.4 16.5% 100 wiki title 1 10 ord 386435 94.3 47.9 2.1 58.2 21.5% 100 wiki title 1 10 ordval 386435 94.3 47.9 2.1 60.1 25.5% 100 wiki title <all> 10 ord 2000000 94.6 7.8 2.5 35.6 356.4% 100 wiki title <all> 10 ordval 2000000 94.6 7.8 2.4 37.0 374.4% It's a ridiculous amount of of data to digest... but here are some initial thoughts: These are only single term queries; I'd expect for multi term queries the gain would be less since net/net less %tg of the time is spent collecting. Ord + val fallback (ordval) is generally faster than pure ord/subord. I think for now we should run with ord + val fallback? (We can leave ord + subord commented out?). It's great that we see decent speedups for "sort by score" which is presumably the most common sort used. We do get slower in certain cases (neg pctg in the rightmost column): all not-in-the-noise slowdowns were with query "147" on the 100 segment index. This query hits relatively few docs (~5K) So, this is expected, because the new approach spends some time updating its queue for each subreader. If the time spent searching is relatively tiny then this queue update time becomes relatively big. I think with larger queue size the slowdown will be worse. However, I think this is an acceptable tradeoff. The gains for field sorting on a single segment (optimized) index are impressive. Generally, the more hits encountered the better the gains. It's amazing that we see ~67.8% gain sorting by docID, country, and title for alldocs query. My only guess for this is better cache hit rate (because we gain locality by copying values to local compact arrays). Across the board the alldocs query shows very sizable (5X faster for 100 segment index; 3X faster for 10 segment index) improvements. I didn't break out %tg difference, but warming time with the patch is waaaay faster than trunk when index has > 1 segment. Reopen time should also be fantastically faster (we are sidestepping something silly happing w/ FieldCache on a Multi*Reader). Warming on trunk takes ~95 seconds with the 100 segment index!
          Hide
          Mark Miller added a comment -

          Awesome results Mike, thanks! You are a wizard.

          Show
          Mark Miller added a comment - Awesome results Mike, thanks! You are a wizard.
          Hide
          Michael McCandless added a comment -

          I set the queue size to 1000 and reran the tests:

          numSeg index sortBy query topN hits warm qps warmnew qpsnew pctg
          1 wiki score 147 1000 4984 0.3 1356.8 0.3 1361.3 0.3%
          1 wiki score text 1000 97191 0.3 224.0 0.3 223.0 -0.4%
          1 wiki score 1 1000 386435 0.3 73.6 0.3 72.8 -1.1%
          1 wiki doc 147 1000 4984 0.3 1527.0 0.3 1475.0 -3.4%
          1 wiki doc text 1000 97191 0.3 182.5 0.3 235.4 29.0%
          1 wiki doc 1 1000 386435 0.3 50.6 0.3 67.7 33.8%
          1 wiki doc <all> 1000 2000000 0.1 22.1 0.1 37.8 71.0%
          1 simple int text 1000 2000000 0.7 10.1 0.7 12.8 26.7%
          1 simple int <all> 1000 2000000 0.6 19.0 0.6 30.5 60.5%
          1 simple country text 1000 2000000 0.9 10.1 0.7 12.5 23.8%
          1 simple country <all> 1000 2000000 0.9 19.5 0.6 29.1 49.2%
          1 wiki title 147 1000 4984 4.0 733.1 2.0 732.2 -0.1%
          1 wiki title text 1000 97191 4.1 109.1 2.1 114.7 5.1%
          1 wiki title 1 1000 386435 4.1 47.1 2.1 55.4 17.6%
          1 wiki title <all> 1000 2000000 4.1 19.4 2.1 30.5 57.2%
          numSeg index sortBy query topN hits warm qps warmnew qpsnew pctg
          10 wiki score 147 1000 4984 0.3 1259.4 0.3 1274.0 1.2%
          10 wiki score text 1000 97191 0.4 215.2 0.4 220.0 2.2%
          10 wiki score 1 1000 386435 0.4 69.6 0.4 72.0 3.4%
          10 wiki doc 147 1000 4984 0.3 1409.0 0.3 1394.7 -1.0%
          10 wiki doc text 1000 97191 0.4 192.0 0.4 232.5 21.1%
          10 wiki doc 1 1000 386435 0.4 53.0 0.4 66.3 25.1%
          10 wiki doc <all> 1000 2000000 0.1 11.9 0.1 37.5 215.1%
          10 simple int text 1000 2000000 1.2 9.8 0.6 12.8 30.6%
          10 simple int <all> 1000 2000000 1.2 11.0 0.8 30.2 174.5%
          10 simple country text 1000 2000000 1.1 9.8 0.6 12.4 26.5%
          10 simple country <all> 1000 2000000 1.1 11.0 0.5 29.1 164.5%
          10 wiki title 147 1000 4984 26.0 655.2 2.1 84.7 -87.1%
          10 wiki title text 1000 97191 26.3 100.4 2.2 77.8 -22.5%
          10 wiki title 1 1000 386435 26.0 42.3 2.6 48.4 14.4%
          10 wiki title <all> 1000 2000000 26.1 10.9 2.6 28.5 161.5%
          numSeg index sortBy query topN hits warm qps warmnew qpsnew pctg
          100 wiki score 147 1000 4984 0.4 704.1 0.5 677.5 -3.8%
          100 wiki score text 1000 97191 0.4 169.5 0.5 186.0 9.7%
          100 wiki score 1 1000 386435 0.4 56.5 0.5 67.9 20.2%
          100 wiki doc 147 1000 4984 0.4 785.0 0.4 724.0 -7.8%
          100 wiki doc text 1000 97191 0.4 159.9 0.4 204.7 28.0%
          100 wiki doc 1 1000 386435 0.4 44.9 0.4 64.8 44.3%
          100 wiki doc <all> 1000 2000000 0.2 7.8 0.1 40.4 417.9%
          100 simple int text 1000 2000000 3.3 8.4 1.4 10.3 22.6%
          100 simple int <all> 1000 2000000 3.4 7.4 1.1 32.4 337.8%
          100 simple country text 1000 2000000 1.4 8.6 0.7 10.0 16.3%
          100 simple country <all> 1000 2000000 1.5 7.3 0.6 28.6 291.8%
          100 wiki title 147 1000 4984 189.0 446.3 2.4 19.8 -95.6%
          100 wiki title text 1000 97191 188.5 87.7 2.3 27.5 -68.6%
          100 wiki title 1 1000 386435 190.4 41.1 2.7 24.6 -40.1%
          100 wiki title <all> 1000 2000000 189.2 7.4 3.0 18.4 148.6%

          Performance clearly gets worse for queries not hitting many docs, and
          with a large queue, against an index with a large number of
          segments, and sorting by a unique String field (like title).

          The slowdown for "147" at 100 segments is quite bad.

          So... I wonder how often users of Lucene set a very large queue size
          (to do some sort of post filtering, which could be more efficiently
          done as a real Filter, but...). I think it may a non-trivial number,
          so... what to do?

          EG we could offer a different collector that's better optimized
          towards collecting a large topN (probably doing the toplevel
          FieldCache that's done today)? Or, we could explore a hybrid approach
          whereby a slot is only switched to the current segment when it's first
          visited again (instead of updating all of them on switching readers)?
          Or... something else?

          Show
          Michael McCandless added a comment - I set the queue size to 1000 and reran the tests: numSeg index sortBy query topN hits warm qps warmnew qpsnew pctg 1 wiki score 147 1000 4984 0.3 1356.8 0.3 1361.3 0.3% 1 wiki score text 1000 97191 0.3 224.0 0.3 223.0 -0.4% 1 wiki score 1 1000 386435 0.3 73.6 0.3 72.8 -1.1% 1 wiki doc 147 1000 4984 0.3 1527.0 0.3 1475.0 -3.4% 1 wiki doc text 1000 97191 0.3 182.5 0.3 235.4 29.0% 1 wiki doc 1 1000 386435 0.3 50.6 0.3 67.7 33.8% 1 wiki doc <all> 1000 2000000 0.1 22.1 0.1 37.8 71.0% 1 simple int text 1000 2000000 0.7 10.1 0.7 12.8 26.7% 1 simple int <all> 1000 2000000 0.6 19.0 0.6 30.5 60.5% 1 simple country text 1000 2000000 0.9 10.1 0.7 12.5 23.8% 1 simple country <all> 1000 2000000 0.9 19.5 0.6 29.1 49.2% 1 wiki title 147 1000 4984 4.0 733.1 2.0 732.2 -0.1% 1 wiki title text 1000 97191 4.1 109.1 2.1 114.7 5.1% 1 wiki title 1 1000 386435 4.1 47.1 2.1 55.4 17.6% 1 wiki title <all> 1000 2000000 4.1 19.4 2.1 30.5 57.2% numSeg index sortBy query topN hits warm qps warmnew qpsnew pctg 10 wiki score 147 1000 4984 0.3 1259.4 0.3 1274.0 1.2% 10 wiki score text 1000 97191 0.4 215.2 0.4 220.0 2.2% 10 wiki score 1 1000 386435 0.4 69.6 0.4 72.0 3.4% 10 wiki doc 147 1000 4984 0.3 1409.0 0.3 1394.7 -1.0% 10 wiki doc text 1000 97191 0.4 192.0 0.4 232.5 21.1% 10 wiki doc 1 1000 386435 0.4 53.0 0.4 66.3 25.1% 10 wiki doc <all> 1000 2000000 0.1 11.9 0.1 37.5 215.1% 10 simple int text 1000 2000000 1.2 9.8 0.6 12.8 30.6% 10 simple int <all> 1000 2000000 1.2 11.0 0.8 30.2 174.5% 10 simple country text 1000 2000000 1.1 9.8 0.6 12.4 26.5% 10 simple country <all> 1000 2000000 1.1 11.0 0.5 29.1 164.5% 10 wiki title 147 1000 4984 26.0 655.2 2.1 84.7 -87.1% 10 wiki title text 1000 97191 26.3 100.4 2.2 77.8 -22.5% 10 wiki title 1 1000 386435 26.0 42.3 2.6 48.4 14.4% 10 wiki title <all> 1000 2000000 26.1 10.9 2.6 28.5 161.5% numSeg index sortBy query topN hits warm qps warmnew qpsnew pctg 100 wiki score 147 1000 4984 0.4 704.1 0.5 677.5 -3.8% 100 wiki score text 1000 97191 0.4 169.5 0.5 186.0 9.7% 100 wiki score 1 1000 386435 0.4 56.5 0.5 67.9 20.2% 100 wiki doc 147 1000 4984 0.4 785.0 0.4 724.0 -7.8% 100 wiki doc text 1000 97191 0.4 159.9 0.4 204.7 28.0% 100 wiki doc 1 1000 386435 0.4 44.9 0.4 64.8 44.3% 100 wiki doc <all> 1000 2000000 0.2 7.8 0.1 40.4 417.9% 100 simple int text 1000 2000000 3.3 8.4 1.4 10.3 22.6% 100 simple int <all> 1000 2000000 3.4 7.4 1.1 32.4 337.8% 100 simple country text 1000 2000000 1.4 8.6 0.7 10.0 16.3% 100 simple country <all> 1000 2000000 1.5 7.3 0.6 28.6 291.8% 100 wiki title 147 1000 4984 189.0 446.3 2.4 19.8 -95.6% 100 wiki title text 1000 97191 188.5 87.7 2.3 27.5 -68.6% 100 wiki title 1 1000 386435 190.4 41.1 2.7 24.6 -40.1% 100 wiki title <all> 1000 2000000 189.2 7.4 3.0 18.4 148.6% Performance clearly gets worse for queries not hitting many docs, and with a large queue, against an index with a large number of segments, and sorting by a unique String field (like title). The slowdown for "147" at 100 segments is quite bad. So... I wonder how often users of Lucene set a very large queue size (to do some sort of post filtering, which could be more efficiently done as a real Filter, but...). I think it may a non-trivial number, so... what to do? EG we could offer a different collector that's better optimized towards collecting a large topN (probably doing the toplevel FieldCache that's done today)? Or, we could explore a hybrid approach whereby a slot is only switched to the current segment when it's first visited again (instead of updating all of them on switching readers)? Or... something else?
          Hide
          Michael McCandless added a comment -

          One more small thing: StringOrdValComparator.compare uses double ord1 and double ord2 when comparing 2 slots, but in fact they should just be ints?

          Show
          Michael McCandless added a comment - One more small thing: StringOrdValComparator.compare uses double ord1 and double ord2 when comparing 2 slots, but in fact they should just be ints?
          Hide
          Mark Miller added a comment -

          Yeah, wow. I've actually fixed that (though I didn't notice doing so). I've made a half dozen little tweaks trying to eek out some speed and maybe trigger some ideas. Nothing thats made much of a difference yet. We can subtract instead of compare and other tiny in the noise stuff - hoping I trigger a bigger idea though.

          Show
          Mark Miller added a comment - Yeah, wow. I've actually fixed that (though I didn't notice doing so). I've made a half dozen little tweaks trying to eek out some speed and maybe trigger some ideas. Nothing thats made much of a difference yet. We can subtract instead of compare and other tiny in the noise stuff - hoping I trigger a bigger idea though.
          Hide
          Michael McCandless added a comment -

          Another fix to do: in StringOrdValComparator.compare (the one that takes slot, doc, score), it does not fallback to String value comparison.

          Show
          Michael McCandless added a comment - Another fix to do: in StringOrdValComparator.compare (the one that takes slot, doc, score), it does not fallback to String value comparison.
          Hide
          Mark Miller added a comment -

          Thanks. I've got a convert on demand version to give a try too.

          Show
          Mark Miller added a comment - Thanks. I've got a convert on demand version to give a try too.
          Hide
          Michael McCandless added a comment -

          OK wanna post a new patch w/ all these fixes? I'll re-test, and add on-demand queue. If perf cost of that one is still too high, we may want to offer a queue that pulls MultiReader's FieldCache values, but uses the slot-based pqueue (we would have to fix this insanely slow warm time if we do that...). I bet for very large queue sizes that'd give the best performance, at the obvious expense of reopen cost.

          Show
          Michael McCandless added a comment - OK wanna post a new patch w/ all these fixes? I'll re-test, and add on-demand queue. If perf cost of that one is still too high, we may want to offer a queue that pulls MultiReader's FieldCache values, but uses the slot-based pqueue (we would have to fix this insanely slow warm time if we do that...). I bet for very large queue sizes that'd give the best performance, at the obvious expense of reopen cost.
          Hide
          Mark Miller added a comment -

          Escaped back to the computer. Here is the latest. I went down a few dead ends trying to improve things, and I don't think I found anything significant.

          There are the following String sort types:

          SortField.STRING_VAL : by value

          SortField.STRING_ORD: by ordinal fall back to subord

          SortField.STRING_ORD_VAL: by ordinal fall back to value

          SortField.STRING_ORD_VAL_DEM: by ordinal fall back to value, convert on demand

          and just for kicks:

          SortField.STRING_ORD_VAL_DEM_WIN: by ordinal fall back to value, convert a window of 20 in each direction of n on demand

          SortField.STRING is set to STRING_ORD_VAL

          Show
          Mark Miller added a comment - Escaped back to the computer. Here is the latest. I went down a few dead ends trying to improve things, and I don't think I found anything significant. There are the following String sort types: SortField.STRING_VAL : by value SortField.STRING_ORD: by ordinal fall back to subord SortField.STRING_ORD_VAL: by ordinal fall back to value SortField.STRING_ORD_VAL_DEM: by ordinal fall back to value, convert on demand and just for kicks: SortField.STRING_ORD_VAL_DEM_WIN: by ordinal fall back to value, convert a window of 20 in each direction of n on demand SortField.STRING is set to STRING_ORD_VAL
          Hide
          Mark Miller added a comment -

          Again with the two $id issues fixed.

          Show
          Mark Miller added a comment - Again with the two $id issues fixed.
          Hide
          Mark Miller added a comment -

          Was just thinking about the window convert (which I doubt is worth considering anyway) - I wouldnt bench it...I forgot to make it check for conversion as its working on the window. It just checks the window pivot point. Creates lots of extra work.

          Show
          Mark Miller added a comment - Was just thinking about the window convert (which I doubt is worth considering anyway) - I wouldnt bench it...I forgot to make it check for conversion as its working on the window. It just checks the window pivot point. Creates lots of extra work.
          Hide
          Michael McCandless added a comment -

          New patch attached:

          • Made IndexWriter.getSegmentCount package protected again
          • Fixed StringOrdComparator to use no doubles
          • Added "API is experimental" warnings to the new APIs
          • Deprecated FieldValueHitQueue.getMaxScore – I think in 3.0 we
            should remove this (with Hits)?
          • Tweaked StringOrdValOnDemComparator to avoid Arrays.fill in
            setNextReader
          • Fixed FieldValueHitQueue to actually pull the right "on demand"
            comparator (it was incorrectly pulling StringOrdValComparator)

          I re-ran perf tests:

          numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg
          1 simple country val text 1000 2000000 0.9 10.0 0.8 11.1 11.0%
          1 simple country ord text 1000 2000000 0.9 10.0 0.7 13.0 30.0%
          1 simple country ordval text 1000 2000000 0.9 10.0 0.7 12.5 25.0%
          1 simple country orddem text 1000 2000000 0.9 10.0 0.7 12.5 25.0%
          1 wiki title val 147 1000 4984 2.1 740.6 2.4 375.7 -49.3%
          1 wiki title ord 147 1000 4984 2.1 740.6 2.1 811.7 9.6%
          1 wiki title ordval 147 1000 4984 2.1 740.6 2.1 806.6 8.9%
          1 wiki title orddem 147 1000 4984 2.1 740.6 2.1 207.6 -72.0%
          1 wiki title val text 1000 97191 2.1 108.7 2.4 32.5 -70.1%
          1 wiki title ord text 1000 97191 2.1 108.7 2.1 121.4 11.7%
          1 wiki title ordval text 1000 97191 2.1 108.7 2.1 119.1 9.6%
          1 wiki title orddem text 1000 97191 2.1 108.7 2.1 90.9 -16.4%
          1 wiki title val 1 1000 386435 2.1 46.2 2.4 12.8 -72.3%
          1 wiki title ord 1 1000 386435 2.1 46.2 2.1 58.6 26.8%
          1 wiki title ordval 1 1000 386435 2.1 46.2 2.1 55.7 20.6%
          1 wiki title orddem 1 1000 386435 2.1 46.2 2.1 50.1 8.4%
          numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg
          10 simple country val text 1000 2000000 0.8 9.7 0.7 11.0 13.4%
          10 simple country ord text 1000 2000000 0.8 9.7 0.6 13.0 34.0%
          10 simple country ordval text 1000 2000000 0.8 9.7 0.6 12.4 27.8%
          10 simple country orddem text 1000 2000000 0.8 9.7 0.6 12.5 28.9%
          10 wiki title val 147 1000 4984 12.7 664.2 2.5 383.8 -42.2%
          10 wiki title ord 147 1000 4984 12.7 664.2 2.1 86.2 -87.0%
          10 wiki title ordval 147 1000 4984 12.7 664.2 2.1 104.0 -84.3%
          10 wiki title orddem 147 1000 4984 12.7 664.2 2.1 77.0 -88.4%
          10 wiki title val text 1000 97191 12.6 100.4 2.4 33.3 -66.8%
          10 wiki title ord text 1000 97191 12.6 100.4 2.2 80.0 -20.3%
          10 wiki title ordval text 1000 97191 12.6 100.4 2.2 90.3 -10.1%
          10 wiki title orddem text 1000 97191 12.6 100.4 2.1 72.1 -28.2%
          10 wiki title val 1 1000 386435 12.7 42.4 2.5 14.7 -65.3%
          10 wiki title ord 1 1000 386435 12.7 42.4 2.6 50.2 18.4%
          10 wiki title ordval 1 1000 386435 12.7 42.4 2.2 51.3 21.0%
          10 wiki title orddem 1 1000 386435 12.7 42.4 2.2 47.3 11.6%
          numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg
          100 simple country val text 1000 2000000 1.0 8.5 2.1 9.2 8.2%
          100 simple country ord text 1000 2000000 1.0 8.5 0.6 10.7 25.9%
          100 simple country ordval text 1000 2000000 1.0 8.5 0.6 10.3 21.2%
          100 simple country orddem text 1000 2000000 1.0 8.5 0.6 10.2 20.0%
          100 wiki title val 147 1000 4984 93.8 442.8 3.6 238.8 -46.1%
          100 wiki title ord 147 1000 4984 93.8 442.8 2.3 19.9 -95.5%
          100 wiki title ordval 147 1000 4984 93.8 442.8 2.2 28.0 -93.7%
          100 wiki title orddem 147 1000 4984 93.8 442.8 2.2 54.1 -87.8%
          100 wiki title val text 1000 97191 93.4 88.0 3.1 33.1 -62.4%
          100 wiki title ord text 1000 97191 93.4 88.0 2.3 27.8 -68.4%
          100 wiki title ordval text 1000 97191 93.4 88.0 2.2 40.9 -53.5%
          100 wiki title orddem text 1000 97191 93.4 88.0 2.2 53.3 -39.4%
          100 wiki title val 1 1000 386435 92.8 41.0 3.2 14.7 -64.1%
          100 wiki title ord 1 1000 386435 92.8 41.0 2.7 25.3 -38.3%
          100 wiki title ordval 1 1000 386435 92.8 41.0 2.2 33.8 -17.6%
          100 wiki title orddem 1 1000 386435 92.8 41.0 2.2 42.7 4.1%

          Haven't digested these results yet...

          Show
          Michael McCandless added a comment - New patch attached: Made IndexWriter.getSegmentCount package protected again Fixed StringOrdComparator to use no doubles Added "API is experimental" warnings to the new APIs Deprecated FieldValueHitQueue.getMaxScore – I think in 3.0 we should remove this (with Hits)? Tweaked StringOrdValOnDemComparator to avoid Arrays.fill in setNextReader Fixed FieldValueHitQueue to actually pull the right "on demand" comparator (it was incorrectly pulling StringOrdValComparator) I re-ran perf tests: numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg 1 simple country val text 1000 2000000 0.9 10.0 0.8 11.1 11.0% 1 simple country ord text 1000 2000000 0.9 10.0 0.7 13.0 30.0% 1 simple country ordval text 1000 2000000 0.9 10.0 0.7 12.5 25.0% 1 simple country orddem text 1000 2000000 0.9 10.0 0.7 12.5 25.0% 1 wiki title val 147 1000 4984 2.1 740.6 2.4 375.7 -49.3% 1 wiki title ord 147 1000 4984 2.1 740.6 2.1 811.7 9.6% 1 wiki title ordval 147 1000 4984 2.1 740.6 2.1 806.6 8.9% 1 wiki title orddem 147 1000 4984 2.1 740.6 2.1 207.6 -72.0% 1 wiki title val text 1000 97191 2.1 108.7 2.4 32.5 -70.1% 1 wiki title ord text 1000 97191 2.1 108.7 2.1 121.4 11.7% 1 wiki title ordval text 1000 97191 2.1 108.7 2.1 119.1 9.6% 1 wiki title orddem text 1000 97191 2.1 108.7 2.1 90.9 -16.4% 1 wiki title val 1 1000 386435 2.1 46.2 2.4 12.8 -72.3% 1 wiki title ord 1 1000 386435 2.1 46.2 2.1 58.6 26.8% 1 wiki title ordval 1 1000 386435 2.1 46.2 2.1 55.7 20.6% 1 wiki title orddem 1 1000 386435 2.1 46.2 2.1 50.1 8.4% numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg 10 simple country val text 1000 2000000 0.8 9.7 0.7 11.0 13.4% 10 simple country ord text 1000 2000000 0.8 9.7 0.6 13.0 34.0% 10 simple country ordval text 1000 2000000 0.8 9.7 0.6 12.4 27.8% 10 simple country orddem text 1000 2000000 0.8 9.7 0.6 12.5 28.9% 10 wiki title val 147 1000 4984 12.7 664.2 2.5 383.8 -42.2% 10 wiki title ord 147 1000 4984 12.7 664.2 2.1 86.2 -87.0% 10 wiki title ordval 147 1000 4984 12.7 664.2 2.1 104.0 -84.3% 10 wiki title orddem 147 1000 4984 12.7 664.2 2.1 77.0 -88.4% 10 wiki title val text 1000 97191 12.6 100.4 2.4 33.3 -66.8% 10 wiki title ord text 1000 97191 12.6 100.4 2.2 80.0 -20.3% 10 wiki title ordval text 1000 97191 12.6 100.4 2.2 90.3 -10.1% 10 wiki title orddem text 1000 97191 12.6 100.4 2.1 72.1 -28.2% 10 wiki title val 1 1000 386435 12.7 42.4 2.5 14.7 -65.3% 10 wiki title ord 1 1000 386435 12.7 42.4 2.6 50.2 18.4% 10 wiki title ordval 1 1000 386435 12.7 42.4 2.2 51.3 21.0% 10 wiki title orddem 1 1000 386435 12.7 42.4 2.2 47.3 11.6% numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg 100 simple country val text 1000 2000000 1.0 8.5 2.1 9.2 8.2% 100 simple country ord text 1000 2000000 1.0 8.5 0.6 10.7 25.9% 100 simple country ordval text 1000 2000000 1.0 8.5 0.6 10.3 21.2% 100 simple country orddem text 1000 2000000 1.0 8.5 0.6 10.2 20.0% 100 wiki title val 147 1000 4984 93.8 442.8 3.6 238.8 -46.1% 100 wiki title ord 147 1000 4984 93.8 442.8 2.3 19.9 -95.5% 100 wiki title ordval 147 1000 4984 93.8 442.8 2.2 28.0 -93.7% 100 wiki title orddem 147 1000 4984 93.8 442.8 2.2 54.1 -87.8% 100 wiki title val text 1000 97191 93.4 88.0 3.1 33.1 -62.4% 100 wiki title ord text 1000 97191 93.4 88.0 2.3 27.8 -68.4% 100 wiki title ordval text 1000 97191 93.4 88.0 2.2 40.9 -53.5% 100 wiki title orddem text 1000 97191 93.4 88.0 2.2 53.3 -39.4% 100 wiki title val 1 1000 386435 92.8 41.0 3.2 14.7 -64.1% 100 wiki title ord 1 1000 386435 92.8 41.0 2.7 25.3 -38.3% 100 wiki title ordval 1 1000 386435 92.8 41.0 2.2 33.8 -17.6% 100 wiki title orddem 1 1000 386435 92.8 41.0 2.2 42.7 4.1% Haven't digested these results yet...
          Hide
          Michael McCandless added a comment -

          Sorry, disregard those results above... I think there's a bug in the on-demand comparator.

          Show
          Michael McCandless added a comment - Sorry, disregard those results above... I think there's a bug in the on-demand comparator.
          Hide
          Michael McCandless added a comment -

          OK new patch attached:

          • Fixed the bug (I had added) in StringOrdValOnDemComparator that
            was doing too much work on first segment (especially skewed
            1-segment index results).
          • Removed "this.slot = slot" from StringOrdComparator,
            StringOrdValComparator and StringOrdValOnDemWinComparator. I
            think this was a bug that was causing setNextReader to not convert
            enough of the queue in my tests. Unfortunately, this makes
            performance worse for these classes.
          • Tweaked null checks for string values to be tiny bit faster if there
            are nulls.

          New benchmark results for topN=1000;

          numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg
          1 simple country val text 1000 2000000 0.9 10.0 0.7 11.1 11.0%
          1 simple country ord text 1000 2000000 0.9 10.0 0.7 13.3 33.0%
          1 simple country ordval text 1000 2000000 0.9 10.0 0.7 12.7 27.0%
          1 simple country orddem text 1000 2000000 0.9 10.0 0.6 12.5 25.0%
          1 wiki title val 147 1000 4984 2.1 740.6 2.3 369.3 -50.1%
          1 wiki title ord 147 1000 4984 2.1 740.6 2.1 808.1 9.1%
          1 wiki title ordval 147 1000 4984 2.1 740.6 2.0 815.7 10.1%
          1 wiki title orddem 147 1000 4984 2.1 740.6 2.1 731.7 -1.2%
          1 wiki title val text 1000 97191 2.1 108.7 2.4 32.6 -70.0%
          1 wiki title ord text 1000 97191 2.1 108.7 2.1 121.1 11.4%
          1 wiki title ordval text 1000 97191 2.1 108.7 2.1 119.0 9.5%
          1 wiki title orddem text 1000 97191 2.1 108.7 2.1 114.8 5.6%
          1 wiki title val 1 1000 386435 2.1 46.2 2.4 12.7 -72.5%
          1 wiki title ord 1 1000 386435 2.1 46.2 2.1 58.5 26.6%
          1 wiki title ordval 1 1000 386435 2.1 46.2 2.1 56.7 22.7%
          1 wiki title orddem 1 1000 386435 2.1 46.2 2.1 55.2 19.5%
          numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg
          10 simple country val text 1000 2000000 0.8 9.7 0.6 11.0 13.4%
          10 simple country ord text 1000 2000000 0.8 9.7 0.6 13.1 35.1%
          10 simple country ordval text 1000 2000000 0.8 9.7 0.6 12.5 28.9%
          10 simple country orddem text 1000 2000000 0.8 9.7 0.6 12.5 28.9%
          10 wiki title val 147 1000 4984 12.7 664.2 2.4 382.9 -42.4%
          10 wiki title ord 147 1000 4984 12.7 664.2 2.2 57.9 -91.3%
          10 wiki title ordval 147 1000 4984 12.7 664.2 2.1 71.5 -89.2%
          10 wiki title orddem 147 1000 4984 12.7 664.2 2.1 91.1 -86.3%
          10 wiki title val text 1000 97191 12.6 100.4 2.4 33.4 -66.7%
          10 wiki title ord text 1000 97191 12.6 100.4 2.2 62.9 -37.4%
          10 wiki title ordval text 1000 97191 12.6 100.4 2.2 75.9 -24.4%
          10 wiki title orddem text 1000 97191 12.6 100.4 2.2 79.6 -20.7%
          10 wiki title val 1 1000 386435 12.7 42.4 2.4 14.7 -65.3%
          10 wiki title ord 1 1000 386435 12.7 42.4 2.7 45.2 6.6%
          10 wiki title ordval 1 1000 386435 12.7 42.4 2.1 48.5 14.4%
          10 wiki title orddem 1 1000 386435 12.7 42.4 2.2 50.2 18.4%
          numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg
          100 simple country val text 1000 2000000 1.0 8.5 0.7 9.2 8.2%
          100 simple country ord text 1000 2000000 1.0 8.5 0.6 10.1 18.8%
          100 simple country ordval text 1000 2000000 1.0 8.5 0.6 9.7 14.1%
          100 simple country orddem text 1000 2000000 1.0 8.5 0.6 10.3 21.2%
          100 wiki title val 147 1000 4984 93.8 442.8 2.3 240.7 -45.6%
          100 wiki title ord 147 1000 4984 93.8 442.8 2.3 12.3 -97.2%
          100 wiki title ordval 147 1000 4984 93.8 442.8 2.2 18.4 -95.8%
          100 wiki title orddem 147 1000 4984 93.8 442.8 2.1 57.7 -87.0%
          100 wiki title val text 1000 97191 93.4 88.0 2.3 33.3 -62.2%
          100 wiki title ord text 1000 97191 93.4 88.0 2.3 17.7 -79.9%
          100 wiki title ordval text 1000 97191 93.4 88.0 2.2 27.8 -68.4%
          100 wiki title orddem text 1000 97191 93.4 88.0 2.2 54.4 -38.2%
          100 wiki title val 1 1000 386435 92.8 41.0 2.4 14.8 -63.9%
          100 wiki title ord 1 1000 386435 92.8 41.0 2.7 16.6 -59.5%
          100 wiki title ordval 1 1000 386435 92.8 41.0 2.2 27.9 -32.0%
          100 wiki title orddem 1 1000 386435 92.8 41.0 2.2 43.2 5.4%
          Show
          Michael McCandless added a comment - OK new patch attached: Fixed the bug (I had added) in StringOrdValOnDemComparator that was doing too much work on first segment (especially skewed 1-segment index results). Removed "this.slot = slot" from StringOrdComparator, StringOrdValComparator and StringOrdValOnDemWinComparator. I think this was a bug that was causing setNextReader to not convert enough of the queue in my tests. Unfortunately, this makes performance worse for these classes. Tweaked null checks for string values to be tiny bit faster if there are nulls. New benchmark results for topN=1000; numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg 1 simple country val text 1000 2000000 0.9 10.0 0.7 11.1 11.0% 1 simple country ord text 1000 2000000 0.9 10.0 0.7 13.3 33.0% 1 simple country ordval text 1000 2000000 0.9 10.0 0.7 12.7 27.0% 1 simple country orddem text 1000 2000000 0.9 10.0 0.6 12.5 25.0% 1 wiki title val 147 1000 4984 2.1 740.6 2.3 369.3 -50.1% 1 wiki title ord 147 1000 4984 2.1 740.6 2.1 808.1 9.1% 1 wiki title ordval 147 1000 4984 2.1 740.6 2.0 815.7 10.1% 1 wiki title orddem 147 1000 4984 2.1 740.6 2.1 731.7 -1.2% 1 wiki title val text 1000 97191 2.1 108.7 2.4 32.6 -70.0% 1 wiki title ord text 1000 97191 2.1 108.7 2.1 121.1 11.4% 1 wiki title ordval text 1000 97191 2.1 108.7 2.1 119.0 9.5% 1 wiki title orddem text 1000 97191 2.1 108.7 2.1 114.8 5.6% 1 wiki title val 1 1000 386435 2.1 46.2 2.4 12.7 -72.5% 1 wiki title ord 1 1000 386435 2.1 46.2 2.1 58.5 26.6% 1 wiki title ordval 1 1000 386435 2.1 46.2 2.1 56.7 22.7% 1 wiki title orddem 1 1000 386435 2.1 46.2 2.1 55.2 19.5% numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg 10 simple country val text 1000 2000000 0.8 9.7 0.6 11.0 13.4% 10 simple country ord text 1000 2000000 0.8 9.7 0.6 13.1 35.1% 10 simple country ordval text 1000 2000000 0.8 9.7 0.6 12.5 28.9% 10 simple country orddem text 1000 2000000 0.8 9.7 0.6 12.5 28.9% 10 wiki title val 147 1000 4984 12.7 664.2 2.4 382.9 -42.4% 10 wiki title ord 147 1000 4984 12.7 664.2 2.2 57.9 -91.3% 10 wiki title ordval 147 1000 4984 12.7 664.2 2.1 71.5 -89.2% 10 wiki title orddem 147 1000 4984 12.7 664.2 2.1 91.1 -86.3% 10 wiki title val text 1000 97191 12.6 100.4 2.4 33.4 -66.7% 10 wiki title ord text 1000 97191 12.6 100.4 2.2 62.9 -37.4% 10 wiki title ordval text 1000 97191 12.6 100.4 2.2 75.9 -24.4% 10 wiki title orddem text 1000 97191 12.6 100.4 2.2 79.6 -20.7% 10 wiki title val 1 1000 386435 12.7 42.4 2.4 14.7 -65.3% 10 wiki title ord 1 1000 386435 12.7 42.4 2.7 45.2 6.6% 10 wiki title ordval 1 1000 386435 12.7 42.4 2.1 48.5 14.4% 10 wiki title orddem 1 1000 386435 12.7 42.4 2.2 50.2 18.4% numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg 100 simple country val text 1000 2000000 1.0 8.5 0.7 9.2 8.2% 100 simple country ord text 1000 2000000 1.0 8.5 0.6 10.1 18.8% 100 simple country ordval text 1000 2000000 1.0 8.5 0.6 9.7 14.1% 100 simple country orddem text 1000 2000000 1.0 8.5 0.6 10.3 21.2% 100 wiki title val 147 1000 4984 93.8 442.8 2.3 240.7 -45.6% 100 wiki title ord 147 1000 4984 93.8 442.8 2.3 12.3 -97.2% 100 wiki title ordval 147 1000 4984 93.8 442.8 2.2 18.4 -95.8% 100 wiki title orddem 147 1000 4984 93.8 442.8 2.1 57.7 -87.0% 100 wiki title val text 1000 97191 93.4 88.0 2.3 33.3 -62.2% 100 wiki title ord text 1000 97191 93.4 88.0 2.3 17.7 -79.9% 100 wiki title ordval text 1000 97191 93.4 88.0 2.2 27.8 -68.4% 100 wiki title orddem text 1000 97191 93.4 88.0 2.2 54.4 -38.2% 100 wiki title val 1 1000 386435 92.8 41.0 2.4 14.8 -63.9% 100 wiki title ord 1 1000 386435 92.8 41.0 2.7 16.6 -59.5% 100 wiki title ordval 1 1000 386435 92.8 41.0 2.2 27.9 -32.0% 100 wiki title orddem 1 1000 386435 92.8 41.0 2.2 43.2 5.4%
          Hide
          Michael McCandless added a comment -

          Results with topN=10:

          numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg
          1 simple country val text 10 2000000 0.6 10.7 0.7 11.7 9.3%
          1 simple country ord text 10 2000000 0.6 10.7 0.6 13.8 29.0%
          1 simple country ordval text 10 2000000 0.6 10.7 0.6 13.3 24.3%
          1 simple country orddem text 10 2000000 0.6 10.7 0.7 13.0 21.5%
          1 wiki title val 147 10 4984 2.1 3743.8 2.3 2441.8 -34.8%
          1 wiki title ord 147 10 4984 2.1 3743.8 2.0 4426.2 18.2%
          1 wiki title ordval 147 10 4984 2.1 3743.8 2.1 4352.7 16.3%
          1 wiki title orddem 147 10 4984 2.1 3743.8 2.0 4063.6 8.5%
          1 wiki title val text 10 97191 2.1 144.2 2.3 39.1 -72.9%
          1 wiki title ord text 10 97191 2.1 144.2 2.1 164.5 14.1%
          1 wiki title ordval text 10 97191 2.1 144.2 2.1 162.6 12.8%
          1 wiki title orddem text 10 97191 2.1 144.2 2.1 157.3 9.1%
          1 wiki title val 1 10 386435 2.1 51.2 2.4 13.6 -73.4%
          1 wiki title ord 1 10 386435 2.1 51.2 2.1 65.7 28.3%
          1 wiki title ordval 1 10 386435 2.1 51.2 2.1 64.7 26.4%
          1 wiki title orddem 1 10 386435 2.1 51.2 2.1 60.4 18.0%
          numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg
          10 simple country val text 10 2000000 0.7 10.4 0.6 11.6 11.5%
          10 simple country ord text 10 2000000 0.7 10.4 0.6 13.6 30.8%
          10 simple country ordval text 10 2000000 0.7 10.4 0.6 13.1 26.0%
          10 simple country orddem text 10 2000000 0.7 10.4 0.6 13.1 26.0%
          10 wiki title val 147 10 4984 12.5 3004.5 2.5 1732.9 -42.3%
          10 wiki title ord 147 10 4984 12.5 3004.5 2.1 3067.2 2.1%
          10 wiki title ordval 147 10 4984 12.5 3004.5 2.1 3283.5 9.3%
          10 wiki title orddem 147 10 4984 12.5 3004.5 2.1 3237.9 7.8%
          10 wiki title val text 10 97191 12.7 139.4 2.4 38.6 -72.3%
          10 wiki title ord text 10 97191 12.7 139.4 2.1 159.5 14.4%
          10 wiki title ordval text 10 97191 12.7 139.4 2.1 160.5 15.1%
          10 wiki title orddem text 10 97191 12.7 139.4 2.1 154.0 10.5%
          10 wiki title val 1 10 386435 12.7 50.3 2.5 15.6 -69.0%
          10 wiki title ord 1 10 386435 12.7 50.3 2.1 64.8 28.8%
          10 wiki title ordval 1 10 386435 12.7 50.3 2.1 64.0 27.2%
          10 wiki title orddem 1 10 386435 12.7 50.3 2.1 59.4 18.1%
          numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg
          100 simple country val text 10 2000000 1.0 8.8 2.8 9.4 6.8%
          100 simple country ord text 10 2000000 1.0 8.8 0.6 11.1 26.1%
          100 simple country ordval text 10 2000000 1.0 8.8 0.6 10.7 21.6%
          100 simple country orddem text 10 2000000 1.0 8.8 0.6 10.7 21.6%
          100 wiki title val 147 10 4984 94.6 1066.9 3.3 454.8 -57.4%
          100 wiki title ord 147 10 4984 94.6 1066.9 2.1 519.2 -51.3%
          100 wiki title ordval 147 10 4984 94.6 1066.9 2.1 692.5 -35.1%
          100 wiki title orddem 147 10 4984 94.6 1066.9 2.1 778.1 -27.1%
          100 wiki title val text 10 97191 94.9 110.2 2.7 38.9 -64.7%
          100 wiki title ord text 10 97191 94.9 110.2 2.1 122.8 11.4%
          100 wiki title ordval text 10 97191 94.9 110.2 2.1 126.3 14.6%
          100 wiki title orddem text 10 97191 94.9 110.2 2.1 124.7 13.2%
          100 wiki title val 1 10 386435 94.3 47.9 2.8 15.8 -67.0%
          100 wiki title ord 1 10 386435 94.3 47.9 2.1 59.1 23.4%
          100 wiki title ordval 1 10 386435 94.3 47.9 2.2 58.9 23.0%
          100 wiki title orddem 1 10 386435 94.3 47.9 2.2 54.9 14.6%
          Show
          Michael McCandless added a comment - Results with topN=10: numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg 1 simple country val text 10 2000000 0.6 10.7 0.7 11.7 9.3% 1 simple country ord text 10 2000000 0.6 10.7 0.6 13.8 29.0% 1 simple country ordval text 10 2000000 0.6 10.7 0.6 13.3 24.3% 1 simple country orddem text 10 2000000 0.6 10.7 0.7 13.0 21.5% 1 wiki title val 147 10 4984 2.1 3743.8 2.3 2441.8 -34.8% 1 wiki title ord 147 10 4984 2.1 3743.8 2.0 4426.2 18.2% 1 wiki title ordval 147 10 4984 2.1 3743.8 2.1 4352.7 16.3% 1 wiki title orddem 147 10 4984 2.1 3743.8 2.0 4063.6 8.5% 1 wiki title val text 10 97191 2.1 144.2 2.3 39.1 -72.9% 1 wiki title ord text 10 97191 2.1 144.2 2.1 164.5 14.1% 1 wiki title ordval text 10 97191 2.1 144.2 2.1 162.6 12.8% 1 wiki title orddem text 10 97191 2.1 144.2 2.1 157.3 9.1% 1 wiki title val 1 10 386435 2.1 51.2 2.4 13.6 -73.4% 1 wiki title ord 1 10 386435 2.1 51.2 2.1 65.7 28.3% 1 wiki title ordval 1 10 386435 2.1 51.2 2.1 64.7 26.4% 1 wiki title orddem 1 10 386435 2.1 51.2 2.1 60.4 18.0% numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg 10 simple country val text 10 2000000 0.7 10.4 0.6 11.6 11.5% 10 simple country ord text 10 2000000 0.7 10.4 0.6 13.6 30.8% 10 simple country ordval text 10 2000000 0.7 10.4 0.6 13.1 26.0% 10 simple country orddem text 10 2000000 0.7 10.4 0.6 13.1 26.0% 10 wiki title val 147 10 4984 12.5 3004.5 2.5 1732.9 -42.3% 10 wiki title ord 147 10 4984 12.5 3004.5 2.1 3067.2 2.1% 10 wiki title ordval 147 10 4984 12.5 3004.5 2.1 3283.5 9.3% 10 wiki title orddem 147 10 4984 12.5 3004.5 2.1 3237.9 7.8% 10 wiki title val text 10 97191 12.7 139.4 2.4 38.6 -72.3% 10 wiki title ord text 10 97191 12.7 139.4 2.1 159.5 14.4% 10 wiki title ordval text 10 97191 12.7 139.4 2.1 160.5 15.1% 10 wiki title orddem text 10 97191 12.7 139.4 2.1 154.0 10.5% 10 wiki title val 1 10 386435 12.7 50.3 2.5 15.6 -69.0% 10 wiki title ord 1 10 386435 12.7 50.3 2.1 64.8 28.8% 10 wiki title ordval 1 10 386435 12.7 50.3 2.1 64.0 27.2% 10 wiki title orddem 1 10 386435 12.7 50.3 2.1 59.4 18.1% numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg 100 simple country val text 10 2000000 1.0 8.8 2.8 9.4 6.8% 100 simple country ord text 10 2000000 1.0 8.8 0.6 11.1 26.1% 100 simple country ordval text 10 2000000 1.0 8.8 0.6 10.7 21.6% 100 simple country orddem text 10 2000000 1.0 8.8 0.6 10.7 21.6% 100 wiki title val 147 10 4984 94.6 1066.9 3.3 454.8 -57.4% 100 wiki title ord 147 10 4984 94.6 1066.9 2.1 519.2 -51.3% 100 wiki title ordval 147 10 4984 94.6 1066.9 2.1 692.5 -35.1% 100 wiki title orddem 147 10 4984 94.6 1066.9 2.1 778.1 -27.1% 100 wiki title val text 10 97191 94.9 110.2 2.7 38.9 -64.7% 100 wiki title ord text 10 97191 94.9 110.2 2.1 122.8 11.4% 100 wiki title ordval text 10 97191 94.9 110.2 2.1 126.3 14.6% 100 wiki title orddem text 10 97191 94.9 110.2 2.1 124.7 13.2% 100 wiki title val 1 10 386435 94.3 47.9 2.8 15.8 -67.0% 100 wiki title ord 1 10 386435 94.3 47.9 2.1 59.1 23.4% 100 wiki title ordval 1 10 386435 94.3 47.9 2.2 58.9 23.0% 100 wiki title orddem 1 10 386435 94.3 47.9 2.2 54.9 14.6%
          Hide
          Mark Miller added a comment -

          Hmm...I'm going to have to break down and think hard about the slot issue. You have switched it to values.length - here is what I'm thinking: In the case where 3 hits come from a Reader, but you ask for 1000 back, that will run through that loop 1000 times, but you only need to convert 3 right? This is what I was attempting - what test are you running to see the failure? The idea for me was that if you only have 20 hits so far on a top 1000 search, you only need to hit that loop to convert 20 values to the new Reader, rather 1000 everytime (though you spin fast hitting the ==null). I'm not sure - xmas shopping has burned my whats left of my brain cells. I'll let it stew, perhaps run some code, and come back.

          Show
          Mark Miller added a comment - Hmm...I'm going to have to break down and think hard about the slot issue. You have switched it to values.length - here is what I'm thinking: In the case where 3 hits come from a Reader, but you ask for 1000 back, that will run through that loop 1000 times, but you only need to convert 3 right? This is what I was attempting - what test are you running to see the failure? The idea for me was that if you only have 20 hits so far on a top 1000 search, you only need to hit that loop to convert 20 values to the new Reader, rather 1000 everytime (though you spin fast hitting the ==null). I'm not sure - xmas shopping has burned my whats left of my brain cells. I'll let it stew, perhaps run some code, and come back.
          Hide
          Michael McCandless added a comment -

          > In the case where 3 hits come from a Reader, but you ask for 1000 back, that will run through that loop 1000 times, but you only need to convert 3 right?

          Well, it's tricky... and indeed all tests pass with the bug (which is
          spooky – I think we need to add cases to TestSort where 1) the index
          has many segments, and 2) the number of hits is much greater than the
          queue size), but I'm pretty sure it's a bug.

          You're right: it'd be nice to only visit the "used" slots in the queue
          on advancing to each reader. During the "startup transient" (when
          collector has not yet seen enough hits to fill its queue), the slot
          indeed increases one at a time, and you could at that point use it to
          efficiently visit only the used slots.

          Howevever, after startup transient, the pqueue will then track the
          weakest entry in the queue, which can occur in any of the slots, and
          when a hit that beats that weakest entry arrives, it will call copy()
          into that slot.

          So the slot passed to copy is now a "relatively random" value. For a
          1000 sized queue whose slots are full, you might get a copy into slot
          242. In this case we were incorrectly setting "this.slot" to 242 and
          then only converting the first 242 entries.

          If we changed to to track the maxSlot it should work... but I'm not
          sure this is worthwhile, since it only speeds up already super-fast
          searches and slightly hurts slow searches.

          Show
          Michael McCandless added a comment - > In the case where 3 hits come from a Reader, but you ask for 1000 back, that will run through that loop 1000 times, but you only need to convert 3 right? Well, it's tricky... and indeed all tests pass with the bug (which is spooky – I think we need to add cases to TestSort where 1) the index has many segments, and 2) the number of hits is much greater than the queue size), but I'm pretty sure it's a bug. You're right: it'd be nice to only visit the "used" slots in the queue on advancing to each reader. During the "startup transient" (when collector has not yet seen enough hits to fill its queue), the slot indeed increases one at a time, and you could at that point use it to efficiently visit only the used slots. Howevever, after startup transient, the pqueue will then track the weakest entry in the queue, which can occur in any of the slots, and when a hit that beats that weakest entry arrives, it will call copy() into that slot. So the slot passed to copy is now a "relatively random" value. For a 1000 sized queue whose slots are full, you might get a copy into slot 242. In this case we were incorrectly setting "this.slot" to 242 and then only converting the first 242 entries. If we changed to to track the maxSlot it should work... but I'm not sure this is worthwhile, since it only speeds up already super-fast searches and slightly hurts slow searches.
          Hide
          Michael McCandless added a comment -

          New patch attached:

          • I moved maxScore tracking up into TopFieldCollector to save a
            method call per collect().
          • Deprecated TopDocs.get/setMaxScore()
          • Inlined the common "only 1 comparator" case
          • Some small optimizations
          • Fixed a bug in reverse sorting & short-circuit testing
          • Renamed a few attrs

          In addition to fixing TestSort to test multi-segment index where
          totalHits is much greater than topN, we should include reverse sorting
          (to tickle the bug I just found) as well as sorting by 2 or more
          fields. Mark do you want to take a stab at this? Given how many
          sneaky bugs we're uncovering just by staring at the code (for long
          enough) I'd like to increase test coverage....

          Here're the topN=10 results with this patch:

          numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg
          1 simple country val text 10 2000000 0.6 10.7 0.7 11.9 11.2%
          1 simple country ord text 10 2000000 0.6 10.7 0.6 14.2 32.7%
          1 simple country ordval text 10 2000000 0.6 10.7 0.6 14.3 33.6%
          1 simple country orddem text 10 2000000 0.6 10.7 0.6 13.4 25.2%
          1 wiki title val 147 10 4984 2.1 3743.8 2.4 2451.8 -34.5%
          1 wiki title ord 147 10 4984 2.1 3743.8 2.1 4459.4 19.1%
          1 wiki title ordval 147 10 4984 2.1 3743.8 2.1 4478.2 19.6%
          1 wiki title orddem 147 10 4984 2.1 3743.8 2.0 4233.9 13.1%
          1 wiki title val text 10 97191 2.1 144.2 2.4 38.9 -73.0%
          1 wiki title ord text 10 97191 2.1 144.2 2.1 165.0 14.4%
          1 wiki title ordval text 10 97191 2.1 144.2 2.1 159.9 10.9%
          1 wiki title orddem text 10 97191 2.1 144.2 2.1 161.4 11.9%
          1 wiki title val 1 10 386435 2.1 51.2 2.4 13.5 -73.6%
          1 wiki title ord 1 10 386435 2.1 51.2 2.1 67.1 31.1%
          1 wiki title ordval 1 10 386435 2.1 51.2 2.1 66.6 30.1%
          1 wiki title orddem 1 10 386435 2.1 51.2 2.1 64.7 26.4%
          numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg
          10 simple country val text 10 2000000 0.7 10.4 0.7 11.6 11.5%
          10 simple country ord text 10 2000000 0.7 10.4 0.5 13.9 33.7%
          10 simple country ordval text 10 2000000 0.7 10.4 0.5 14.0 34.6%
          10 simple country orddem text 10 2000000 0.7 10.4 0.5 13.2 26.9%
          10 wiki title val 147 10 4984 12.5 3004.5 2.6 1695.3 -43.6%
          10 wiki title ord 147 10 4984 12.5 3004.5 2.1 3072.8 2.3%
          10 wiki title ordval 147 10 4984 12.5 3004.5 2.1 3328.7 10.8%
          10 wiki title orddem 147 10 4984 12.5 3004.5 2.1 3295.1 9.7%
          10 wiki title val text 10 97191 12.7 139.4 2.4 38.7 -72.2%
          10 wiki title ord text 10 97191 12.7 139.4 2.1 158.9 14.0%
          10 wiki title ordval text 10 97191 12.7 139.4 2.1 161.7 16.0%
          10 wiki title orddem text 10 97191 12.7 139.4 2.1 157.7 13.1%
          10 wiki title val 1 10 386435 12.7 50.3 2.5 15.6 -69.0%
          10 wiki title ord 1 10 386435 12.7 50.3 2.1 65.4 30.0%
          10 wiki title ordval 1 10 386435 12.7 50.3 2.1 66.4 32.0%
          10 wiki title orddem 1 10 386435 12.7 50.3 2.1 63.5 26.2%
          numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg
          100 simple country val text 10 2000000 1.0 8.8 3.1 9.5 8.0%
          100 simple country ord text 10 2000000 1.0 8.8 0.6 11.4 29.5%
          100 simple country ordval text 10 2000000 1.0 8.8 0.6 11.3 28.4%
          100 simple country orddem text 10 2000000 1.0 8.8 0.6 11.0 25.0%
          100 wiki title val 147 10 4984 94.6 1066.9 3.7 456.8 -57.2%
          100 wiki title ord 147 10 4984 94.6 1066.9 2.1 522.2 -51.1%
          100 wiki title ordval 147 10 4984 94.6 1066.9 2.1 667.1 -37.5%
          100 wiki title orddem 147 10 4984 94.6 1066.9 2.1 781.7 -26.7%
          100 wiki title val text 10 97191 94.9 110.2 2.8 38.4 -65.2%
          100 wiki title ord text 10 97191 94.9 110.2 2.1 123.0 11.6%
          100 wiki title ordval text 10 97191 94.9 110.2 2.2 127.3 15.5%
          100 wiki title orddem text 10 97191 94.9 110.2 2.2 126.8 15.1%
          100 wiki title val 1 10 386435 94.3 47.9 2.8 15.8 -67.0%
          100 wiki title ord 1 10 386435 94.3 47.9 2.1 59.8 24.8%
          100 wiki title ordval 1 10 386435 94.3 47.9 2.2 60.8 26.9%
          100 wiki title orddem 1 10 386435 94.3 47.9 2.2 59.0 23.2%
          Show
          Michael McCandless added a comment - New patch attached: I moved maxScore tracking up into TopFieldCollector to save a method call per collect(). Deprecated TopDocs.get/setMaxScore() Inlined the common "only 1 comparator" case Some small optimizations Fixed a bug in reverse sorting & short-circuit testing Renamed a few attrs In addition to fixing TestSort to test multi-segment index where totalHits is much greater than topN, we should include reverse sorting (to tickle the bug I just found) as well as sorting by 2 or more fields. Mark do you want to take a stab at this? Given how many sneaky bugs we're uncovering just by staring at the code (for long enough) I'd like to increase test coverage.... Here're the topN=10 results with this patch: numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg 1 simple country val text 10 2000000 0.6 10.7 0.7 11.9 11.2% 1 simple country ord text 10 2000000 0.6 10.7 0.6 14.2 32.7% 1 simple country ordval text 10 2000000 0.6 10.7 0.6 14.3 33.6% 1 simple country orddem text 10 2000000 0.6 10.7 0.6 13.4 25.2% 1 wiki title val 147 10 4984 2.1 3743.8 2.4 2451.8 -34.5% 1 wiki title ord 147 10 4984 2.1 3743.8 2.1 4459.4 19.1% 1 wiki title ordval 147 10 4984 2.1 3743.8 2.1 4478.2 19.6% 1 wiki title orddem 147 10 4984 2.1 3743.8 2.0 4233.9 13.1% 1 wiki title val text 10 97191 2.1 144.2 2.4 38.9 -73.0% 1 wiki title ord text 10 97191 2.1 144.2 2.1 165.0 14.4% 1 wiki title ordval text 10 97191 2.1 144.2 2.1 159.9 10.9% 1 wiki title orddem text 10 97191 2.1 144.2 2.1 161.4 11.9% 1 wiki title val 1 10 386435 2.1 51.2 2.4 13.5 -73.6% 1 wiki title ord 1 10 386435 2.1 51.2 2.1 67.1 31.1% 1 wiki title ordval 1 10 386435 2.1 51.2 2.1 66.6 30.1% 1 wiki title orddem 1 10 386435 2.1 51.2 2.1 64.7 26.4% numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg 10 simple country val text 10 2000000 0.7 10.4 0.7 11.6 11.5% 10 simple country ord text 10 2000000 0.7 10.4 0.5 13.9 33.7% 10 simple country ordval text 10 2000000 0.7 10.4 0.5 14.0 34.6% 10 simple country orddem text 10 2000000 0.7 10.4 0.5 13.2 26.9% 10 wiki title val 147 10 4984 12.5 3004.5 2.6 1695.3 -43.6% 10 wiki title ord 147 10 4984 12.5 3004.5 2.1 3072.8 2.3% 10 wiki title ordval 147 10 4984 12.5 3004.5 2.1 3328.7 10.8% 10 wiki title orddem 147 10 4984 12.5 3004.5 2.1 3295.1 9.7% 10 wiki title val text 10 97191 12.7 139.4 2.4 38.7 -72.2% 10 wiki title ord text 10 97191 12.7 139.4 2.1 158.9 14.0% 10 wiki title ordval text 10 97191 12.7 139.4 2.1 161.7 16.0% 10 wiki title orddem text 10 97191 12.7 139.4 2.1 157.7 13.1% 10 wiki title val 1 10 386435 12.7 50.3 2.5 15.6 -69.0% 10 wiki title ord 1 10 386435 12.7 50.3 2.1 65.4 30.0% 10 wiki title ordval 1 10 386435 12.7 50.3 2.1 66.4 32.0% 10 wiki title orddem 1 10 386435 12.7 50.3 2.1 63.5 26.2% numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg 100 simple country val text 10 2000000 1.0 8.8 3.1 9.5 8.0% 100 simple country ord text 10 2000000 1.0 8.8 0.6 11.4 29.5% 100 simple country ordval text 10 2000000 1.0 8.8 0.6 11.3 28.4% 100 simple country orddem text 10 2000000 1.0 8.8 0.6 11.0 25.0% 100 wiki title val 147 10 4984 94.6 1066.9 3.7 456.8 -57.2% 100 wiki title ord 147 10 4984 94.6 1066.9 2.1 522.2 -51.1% 100 wiki title ordval 147 10 4984 94.6 1066.9 2.1 667.1 -37.5% 100 wiki title orddem 147 10 4984 94.6 1066.9 2.1 781.7 -26.7% 100 wiki title val text 10 97191 94.9 110.2 2.8 38.4 -65.2% 100 wiki title ord text 10 97191 94.9 110.2 2.1 123.0 11.6% 100 wiki title ordval text 10 97191 94.9 110.2 2.2 127.3 15.5% 100 wiki title orddem text 10 97191 94.9 110.2 2.2 126.8 15.1% 100 wiki title val 1 10 386435 94.3 47.9 2.8 15.8 -67.0% 100 wiki title ord 1 10 386435 94.3 47.9 2.1 59.8 24.8% 100 wiki title ordval 1 10 386435 94.3 47.9 2.2 60.8 26.9% 100 wiki title orddem 1 10 386435 94.3 47.9 2.2 59.0 23.2%
          Hide
          Mark Miller added a comment -

          Just as a reminder to myself - we also need a custom comparator test.

          Show
          Mark Miller added a comment - Just as a reminder to myself - we also need a custom comparator test.
          Hide
          Michael McCandless added a comment -

          > If we changed to to track the maxSlot it should work... but I'm not
          > sure this is worthwhile, since it only speeds up already super-fast
          > searches and slightly hurts slow searches.

          OK I realized we could do this with very little added cost, by passing in "numFilledSlots" to FieldComparator.setNextReader. I'm attaching the patch with this.

          Show
          Michael McCandless added a comment - > If we changed to to track the maxSlot it should work... but I'm not > sure this is worthwhile, since it only speeds up already super-fast > searches and slightly hurts slow searches. OK I realized we could do this with very little added cost, by passing in "numFilledSlots" to FieldComparator.setNextReader. I'm attaching the patch with this.
          Hide
          Michael McCandless added a comment -

          Current results with topN=1000:

          numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg
          1 simple country val text 1000 2000000 0.9 10.0 0.8 11.4 14.0%
          1 simple country ord text 1000 2000000 0.9 10.0 0.6 13.7 37.0%
          1 simple country ordval text 1000 2000000 0.9 10.0 0.6 13.3 33.0%
          1 simple country orddem text 1000 2000000 0.9 10.0 0.7 13.1 31.0%
          1 wiki title val 147 1000 4984 2.1 740.6 2.3 381.8 -48.4%
          1 wiki title ord 147 1000 4984 2.1 740.6 2.1 905.2 22.2%
          1 wiki title ordval 147 1000 4984 2.1 740.6 2.1 906.4 22.4%
          1 wiki title orddem 147 1000 4984 2.1 740.6 2.1 834.5 12.7%
          1 wiki title val text 1000 97191 2.1 108.7 2.4 32.9 -69.7%
          1 wiki title ord text 1000 97191 2.1 108.7 2.1 124.6 14.6%
          1 wiki title ordval text 1000 97191 2.1 108.7 2.1 123.7 13.8%
          1 wiki title orddem text 1000 97191 2.1 108.7 2.1 119.9 10.3%
          1 wiki title val 1 1000 386435 2.1 46.2 2.4 12.6 -72.7%
          1 wiki title ord 1 1000 386435 2.1 46.2 2.1 60.3 30.5%
          1 wiki title ordval 1 1000 386435 2.1 46.2 2.1 59.3 28.4%
          1 wiki title orddem 1 1000 386435 2.1 46.2 2.1 57.9 25.3%
          numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg
          10 simple country val text 1000 2000000 0.8 9.7 0.7 11.2 15.5%
          10 simple country ord text 1000 2000000 0.8 9.7 0.6 13.5 39.2%
          10 simple country ordval text 1000 2000000 0.8 9.7 0.6 13.0 34.0%
          10 simple country orddem text 1000 2000000 0.8 9.7 0.6 12.8 32.0%
          10 wiki title val 147 1000 4984 12.7 664.2 2.4 417.3 -37.2%
          10 wiki title ord 147 1000 4984 12.7 664.2 2.2 58.3 -91.2%
          10 wiki title ordval 147 1000 4984 12.7 664.2 2.1 72.2 -89.1%
          10 wiki title orddem 147 1000 4984 12.7 664.2 2.1 92.5 -86.1%
          10 wiki title val text 1000 97191 12.6 100.4 2.4 33.5 -66.6%
          10 wiki title ord text 1000 97191 12.6 100.4 2.3 65.3 -35.0%
          10 wiki title ordval text 1000 97191 12.6 100.4 2.2 78.7 -21.6%
          10 wiki title orddem text 1000 97191 12.6 100.4 2.2 79.8 -20.5%
          10 wiki title val 1 1000 386435 12.7 42.4 2.5 14.6 -65.6%
          10 wiki title ord 1 1000 386435 12.7 42.4 2.7 46.2 9.0%
          10 wiki title ordval 1 1000 386435 12.7 42.4 2.1 51.1 20.5%
          10 wiki title orddem 1 1000 386435 12.7 42.4 2.1 51.5 21.5%
          numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg
          100 simple country val text 1000 2000000 1.0 8.5 1.3 9.2 8.2%
          100 simple country ord text 1000 2000000 1.0 8.5 0.6 10.3 21.2%
          100 simple country ordval text 1000 2000000 1.0 8.5 0.6 10.0 17.6%
          100 simple country orddem text 1000 2000000 1.0 8.5 0.6 10.6 24.7%
          100 wiki title val 147 1000 4984 93.8 442.8 2.9 245.2 -44.6%
          100 wiki title ord 147 1000 4984 93.8 442.8 2.3 12.0 -97.3%
          100 wiki title ordval 147 1000 4984 93.8 442.8 2.2 18.0 -95.9%
          100 wiki title orddem 147 1000 4984 93.8 442.8 2.1 58.2 -86.9%
          100 wiki title val text 1000 97191 93.4 88.0 2.5 33.5 -61.9%
          100 wiki title ord text 1000 97191 93.4 88.0 2.3 17.6 -80.0%
          100 wiki title ordval text 1000 97191 93.4 88.0 2.2 29.8 -66.1%
          100 wiki title orddem text 1000 97191 93.4 88.0 2.2 56.6 -35.7%
          100 wiki title val 1 1000 386435 92.8 41.0 2.6 14.9 -63.7%
          100 wiki title ord 1 1000 386435 92.8 41.0 2.7 16.5 -59.8%
          100 wiki title ordval 1 1000 386435 92.8 41.0 2.2 28.6 -30.2%
          100 wiki title orddem 1 1000 386435 92.8 41.0 2.2 44.1 7.6%
          Show
          Michael McCandless added a comment - Current results with topN=1000: numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg 1 simple country val text 1000 2000000 0.9 10.0 0.8 11.4 14.0% 1 simple country ord text 1000 2000000 0.9 10.0 0.6 13.7 37.0% 1 simple country ordval text 1000 2000000 0.9 10.0 0.6 13.3 33.0% 1 simple country orddem text 1000 2000000 0.9 10.0 0.7 13.1 31.0% 1 wiki title val 147 1000 4984 2.1 740.6 2.3 381.8 -48.4% 1 wiki title ord 147 1000 4984 2.1 740.6 2.1 905.2 22.2% 1 wiki title ordval 147 1000 4984 2.1 740.6 2.1 906.4 22.4% 1 wiki title orddem 147 1000 4984 2.1 740.6 2.1 834.5 12.7% 1 wiki title val text 1000 97191 2.1 108.7 2.4 32.9 -69.7% 1 wiki title ord text 1000 97191 2.1 108.7 2.1 124.6 14.6% 1 wiki title ordval text 1000 97191 2.1 108.7 2.1 123.7 13.8% 1 wiki title orddem text 1000 97191 2.1 108.7 2.1 119.9 10.3% 1 wiki title val 1 1000 386435 2.1 46.2 2.4 12.6 -72.7% 1 wiki title ord 1 1000 386435 2.1 46.2 2.1 60.3 30.5% 1 wiki title ordval 1 1000 386435 2.1 46.2 2.1 59.3 28.4% 1 wiki title orddem 1 1000 386435 2.1 46.2 2.1 57.9 25.3% numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg 10 simple country val text 1000 2000000 0.8 9.7 0.7 11.2 15.5% 10 simple country ord text 1000 2000000 0.8 9.7 0.6 13.5 39.2% 10 simple country ordval text 1000 2000000 0.8 9.7 0.6 13.0 34.0% 10 simple country orddem text 1000 2000000 0.8 9.7 0.6 12.8 32.0% 10 wiki title val 147 1000 4984 12.7 664.2 2.4 417.3 -37.2% 10 wiki title ord 147 1000 4984 12.7 664.2 2.2 58.3 -91.2% 10 wiki title ordval 147 1000 4984 12.7 664.2 2.1 72.2 -89.1% 10 wiki title orddem 147 1000 4984 12.7 664.2 2.1 92.5 -86.1% 10 wiki title val text 1000 97191 12.6 100.4 2.4 33.5 -66.6% 10 wiki title ord text 1000 97191 12.6 100.4 2.3 65.3 -35.0% 10 wiki title ordval text 1000 97191 12.6 100.4 2.2 78.7 -21.6% 10 wiki title orddem text 1000 97191 12.6 100.4 2.2 79.8 -20.5% 10 wiki title val 1 1000 386435 12.7 42.4 2.5 14.6 -65.6% 10 wiki title ord 1 1000 386435 12.7 42.4 2.7 46.2 9.0% 10 wiki title ordval 1 1000 386435 12.7 42.4 2.1 51.1 20.5% 10 wiki title orddem 1 1000 386435 12.7 42.4 2.1 51.5 21.5% numSeg index sortBy method query topN hits warm qps warmnew qpsnew pctg 100 simple country val text 1000 2000000 1.0 8.5 1.3 9.2 8.2% 100 simple country ord text 1000 2000000 1.0 8.5 0.6 10.3 21.2% 100 simple country ordval text 1000 2000000 1.0 8.5 0.6 10.0 17.6% 100 simple country orddem text 1000 2000000 1.0 8.5 0.6 10.6 24.7% 100 wiki title val 147 1000 4984 93.8 442.8 2.9 245.2 -44.6% 100 wiki title ord 147 1000 4984 93.8 442.8 2.3 12.0 -97.3% 100 wiki title ordval 147 1000 4984 93.8 442.8 2.2 18.0 -95.9% 100 wiki title orddem 147 1000 4984 93.8 442.8 2.1 58.2 -86.9% 100 wiki title val text 1000 97191 93.4 88.0 2.5 33.5 -61.9% 100 wiki title ord text 1000 97191 93.4 88.0 2.3 17.6 -80.0% 100 wiki title ordval text 1000 97191 93.4 88.0 2.2 29.8 -66.1% 100 wiki title orddem text 1000 97191 93.4 88.0 2.2 56.6 -35.7% 100 wiki title val 1 1000 386435 92.8 41.0 2.6 14.9 -63.7% 100 wiki title ord 1 1000 386435 92.8 41.0 2.7 16.5 -59.8% 100 wiki title ordval 1 1000 386435 92.8 41.0 2.2 28.6 -30.2% 100 wiki title orddem 1 1000 386435 92.8 41.0 2.2 44.1 7.6%
          Hide
          Michael McCandless added a comment -

          Given how different the results are, depending on how many segments
          index has, queue size, how many hits search gets, etc., I think we
          need a dynamic solution, meaning in certain situations (many hits,
          small queue depth, small number of large segments) you use ORD but
          other times you use ORDDEM.

          So I'm thinking setNextReader should return a new comparator? Often
          it would simply return itself, but if it deems it worthwhile to switch
          eg from ORD to ORDDEM it would switch to ORDDEM and return that.

          EG for real-time search we may have a tail of a zillion small
          segments...

          Then I also thought of a wild possible change: when doing searching,
          it'd be best to visit the segments from largest to smallest, doing ORD
          in the beginning and switching to ORDDEM at some point. So, could we
          do this? I think we only "require" in-order docs within a segment, so
          could we switch up segment order. We'd need to fix setNextReader API
          to take in reader & docBase. Would that work?

          Show
          Michael McCandless added a comment - Given how different the results are, depending on how many segments index has, queue size, how many hits search gets, etc., I think we need a dynamic solution, meaning in certain situations (many hits, small queue depth, small number of large segments) you use ORD but other times you use ORDDEM. So I'm thinking setNextReader should return a new comparator? Often it would simply return itself, but if it deems it worthwhile to switch eg from ORD to ORDDEM it would switch to ORDDEM and return that. EG for real-time search we may have a tail of a zillion small segments... Then I also thought of a wild possible change: when doing searching, it'd be best to visit the segments from largest to smallest, doing ORD in the beginning and switching to ORDDEM at some point. So, could we do this? I think we only "require" in-order docs within a segment, so could we switch up segment order. We'd need to fix setNextReader API to take in reader & docBase. Would that work?
          Hide
          Mark Miller added a comment -

          Given how different the results are, depending on how many segments
          index has, queue size, how many hits search gets, etc., I think we
          need a dynamic solution, meaning in certain situations (many hits,
          small queue depth, small number of large segments) you use ORD but
          other times you use ORDDEM.

          Sounds interesting...

          So I'm thinking setNextReader should return a new comparator? Often
          it would simply return itself, but if it deems it worthwhile to switch
          eg from ORD to ORDDEM it would switch to ORDDEM and return that.

          I like that I think. Only other option I see off hand is a comparator that can do both, but not as clean and probably adds a check in tightly looped code.

          Then I also thought of a wild possible change: when doing searching,
          it'd be best to visit the segments from largest to smallest, doing ORD
          in the beginning and switching to ORDDEM at some point. So, could we
          do this? I think we only "require" in-order docs within a segment, so
          could we switch up segment order. We'd need to fix setNextReader API
          to take in reader & docBase. Would that work?

          I think this could work well. Since you are likely to have a few large segments, ord would be fastest, then as you moved through the many small segments, orddem would likely work best. Is largest to smallest best though? You do get to map onto smaller term[] arrays as you go, but that causes more fallback. You are also likely to be carrying more hits in the queue into the nextreader right? From smallest to largest you likely have fewer hits to map as you hit the big segments, and more room to fit in for less fallback. So the question is, what obvious piece am I missing

          Largest to smallest, you fill the queue faster earlier. So more to convert as you hit all the other segments - but I guess that will be heavily mitigated by on demand.. You will convert slot.min, and if nothing beats it, I guess thats it...so not so bad actually. And if you go smallest to largest, I guess the queue wont be full, so there will be more 'wins' into the queue, which will cause more conversions over the small segments...in which case for a ton of them and a big queue, largest to smallest seems better. Still feel like I'm missing something, but I guess I have convinced myself largest to smallest is the way to go.

          I'm probably not the first to wish there were more hours in the day...

          I'll put up a patch with the better testing soon just in case.

          Show
          Mark Miller added a comment - Given how different the results are, depending on how many segments index has, queue size, how many hits search gets, etc., I think we need a dynamic solution, meaning in certain situations (many hits, small queue depth, small number of large segments) you use ORD but other times you use ORDDEM. Sounds interesting... So I'm thinking setNextReader should return a new comparator? Often it would simply return itself, but if it deems it worthwhile to switch eg from ORD to ORDDEM it would switch to ORDDEM and return that. I like that I think. Only other option I see off hand is a comparator that can do both, but not as clean and probably adds a check in tightly looped code. Then I also thought of a wild possible change: when doing searching, it'd be best to visit the segments from largest to smallest, doing ORD in the beginning and switching to ORDDEM at some point. So, could we do this? I think we only "require" in-order docs within a segment, so could we switch up segment order. We'd need to fix setNextReader API to take in reader & docBase. Would that work? I think this could work well. Since you are likely to have a few large segments, ord would be fastest, then as you moved through the many small segments, orddem would likely work best. Is largest to smallest best though? You do get to map onto smaller term[] arrays as you go, but that causes more fallback. You are also likely to be carrying more hits in the queue into the nextreader right? From smallest to largest you likely have fewer hits to map as you hit the big segments, and more room to fit in for less fallback. So the question is, what obvious piece am I missing Largest to smallest, you fill the queue faster earlier. So more to convert as you hit all the other segments - but I guess that will be heavily mitigated by on demand.. You will convert slot.min, and if nothing beats it, I guess thats it...so not so bad actually. And if you go smallest to largest, I guess the queue wont be full, so there will be more 'wins' into the queue, which will cause more conversions over the small segments...in which case for a ton of them and a big queue, largest to smallest seems better. Still feel like I'm missing something, but I guess I have convinced myself largest to smallest is the way to go. I'm probably not the first to wish there were more hours in the day... I'll put up a patch with the better testing soon just in case.
          Hide
          Michael McCandless added a comment -

          > Only other option I see off hand is a comparator that can do both, but not as clean and probably adds a check in tightly looped code.

          Right, I wanted to avoid inner-loop check by swapping out the comparator in between segments. Though, modern CPUs are quite good when an if-statement consistently goes one way, so it could be a single comparator that does internal switching might perform fine. Still, if we fix the API to return a new comparator, we can then allow both options.

          I think in some cases we'd even fall back to VAL comparison.

          > Is largest to smallest best though?

          Good question; it's not obvious. We should try both, and perhaps allow for the collector to optionally specify the order.

          My thinking was the first large segment using ORD is "free" (because ORD is only costly on switching segments). If there are many hits, likely the queue has done most of the work it'll do (ie, the majority of the total # insertions will have been done), unless search is "degenerate". Perhaps the second segment, if large, warrants ORD, but them sometime soonish you'd switch to ORDDEM or VAL.

          The "long tail" of tiny segments would then normally be zipped through w/ hardly any insertions, so a higher insertion cost (with zero segment transition cost) is OK.

          But you're right: if we do the tiny segments first, then the queue would be small so transition cost is lower.

          We should make it simple to override a method to implement your own "search plan", and then provide a default heuristic that decides when to switch comparators. Probably that default heuristic should be based on how often compare was actually invoked for the segment. EG if the String sort is secondary to a numeric sort then even if there are many hits, if the numeric sort mostly wins (doesn't have many compare(...) == 0's) then the String sort should probably immediately switch to VAL after the first segment.

          Show
          Michael McCandless added a comment - > Only other option I see off hand is a comparator that can do both, but not as clean and probably adds a check in tightly looped code. Right, I wanted to avoid inner-loop check by swapping out the comparator in between segments. Though, modern CPUs are quite good when an if-statement consistently goes one way, so it could be a single comparator that does internal switching might perform fine. Still, if we fix the API to return a new comparator, we can then allow both options. I think in some cases we'd even fall back to VAL comparison. > Is largest to smallest best though? Good question; it's not obvious. We should try both, and perhaps allow for the collector to optionally specify the order. My thinking was the first large segment using ORD is "free" (because ORD is only costly on switching segments). If there are many hits, likely the queue has done most of the work it'll do (ie, the majority of the total # insertions will have been done), unless search is "degenerate". Perhaps the second segment, if large, warrants ORD, but them sometime soonish you'd switch to ORDDEM or VAL. The "long tail" of tiny segments would then normally be zipped through w/ hardly any insertions, so a higher insertion cost (with zero segment transition cost) is OK. But you're right: if we do the tiny segments first, then the queue would be small so transition cost is lower. We should make it simple to override a method to implement your own "search plan", and then provide a default heuristic that decides when to switch comparators. Probably that default heuristic should be based on how often compare was actually invoked for the segment. EG if the String sort is secondary to a numeric sort then even if there are many hits, if the numeric sort mostly wins (doesn't have many compare(...) == 0's) then the String sort should probably immediately switch to VAL after the first segment.
          Hide
          Mark Miller added a comment -

          Nice Mike! Definitely what needs to be done and very cool. Pluggable policy sounds great. You should be able to ride the right turns often enough to really reduce some of those losses. Ord with no transition cost on the largest segment should be a nice win on its own. I think thats what I was missing - if you come from smallest, you have to map on the ord switch.

          Show
          Mark Miller added a comment - Nice Mike! Definitely what needs to be done and very cool. Pluggable policy sounds great. You should be able to ride the right turns often enough to really reduce some of those losses. Ord with no transition cost on the largest segment should be a nice win on its own. I think thats what I was missing - if you come from smallest, you have to map on the ord switch.
          Hide
          Mark Miller added a comment -

          I was playing with this late last night and I got some of the work moving. I've got it so you can work the readers in any order (sorting by size at the moment, largest to smallest), but in doing so, straight ord no longer works. This is likely because the testing has gotten a little better, but I am not sure whats up. The other comparators work fine. Got a little of the return a new comparator type stuff in too, but nothing finished.

          Kind of got caught up on ord - when thats fixed I'll put up a patch though.

          Show
          Mark Miller added a comment - I was playing with this late last night and I got some of the work moving. I've got it so you can work the readers in any order (sorting by size at the moment, largest to smallest), but in doing so, straight ord no longer works. This is likely because the testing has gotten a little better, but I am not sure whats up. The other comparators work fine. Got a little of the return a new comparator type stuff in too, but nothing finished. Kind of got caught up on ord - when thats fixed I'll put up a patch though.
          Hide
          Mark Miller added a comment -

          Alright, my new years gift to myself is going to be to work on this a bit.

          I've got part of the straight ord problem - if multiple queue entries clashed on conversion, and they were indeed not just clashes but clashes with the same oringal value, they were still getting an incremented subord when they should get an identical subord. So I've taken care of that, but something still fails. Looking.

          On a side note though, we probably want another straight ords comparator that does not fall back to subords. Then we can use that as the first comparator on a large segment, and equal values wont have to needlessly fallback to subords for a worthless 0 on 0 check.

          Show
          Mark Miller added a comment - Alright, my new years gift to myself is going to be to work on this a bit. I've got part of the straight ord problem - if multiple queue entries clashed on conversion, and they were indeed not just clashes but clashes with the same oringal value, they were still getting an incremented subord when they should get an identical subord. So I've taken care of that, but something still fails. Looking. On a side note though, we probably want another straight ords comparator that does not fall back to subords. Then we can use that as the first comparator on a large segment, and equal values wont have to needlessly fallback to subords for a worthless 0 on 0 check.
          Hide
          Mark Miller added a comment -

          Cool, got the other one. When copying the doc into a slot (FieldComparator.copy), the subord had to be set to 0 so that it didnt retain a stale value. Straight ord looks good now as do the rest, so I'll try and get a custom comparator policy in there so we can benchmark a couple strategies.

          Rather than a new Comparator for a pure ord that doesnt fall to subord, perhaps the policy could return an overridden straight ord if its first and on the first index reader...

          Show
          Mark Miller added a comment - Cool, got the other one. When copying the doc into a slot (FieldComparator.copy), the subord had to be set to 0 so that it didnt retain a stale value. Straight ord looks good now as do the rest, so I'll try and get a custom comparator policy in there so we can benchmark a couple strategies. Rather than a new Comparator for a pure ord that doesnt fall to subord, perhaps the policy could return an overridden straight ord if its first and on the first index reader...
          Hide
          Mark Miller added a comment -

          Still pretty ugly, but this patch has a working ord I think, allows readers in any order (giving largest to smallest now), and has a hackey ComparatorPolicy that can alter the comparators being used. Right now there is an ugly little one for ORD that switches to ORD_DEM after the first segment. The rest just use the same comparator throughout.

          Also added something we may take out - an option to not fill fields (which we cant use for back compat, but a user could). If you are not using a MultiSearcher, its kind of a waste of time to fill fields.

          Much to do, but fully functional I think.

          Show
          Mark Miller added a comment - Still pretty ugly, but this patch has a working ord I think, allows readers in any order (giving largest to smallest now), and has a hackey ComparatorPolicy that can alter the comparators being used. Right now there is an ugly little one for ORD that switches to ORD_DEM after the first segment. The rest just use the same comparator throughout. Also added something we may take out - an option to not fill fields (which we cant use for back compat, but a user could). If you are not using a MultiSearcher, its kind of a waste of time to fill fields. Much to do, but fully functional I think.
          Hide
          Mark Miller added a comment - - edited

          So what looks like a promising strategy?

          Off the top I am thinking something as simple as:

          start with ORD with no fallback on the largest.
          if the next segments are fairly large, use ORD_VAL
          if the segments get somewhat smaller, move to ORD_DEM

          Oddly, I've seen VAL perform well in certain situations, so maybe it has its place, but I don't know where yet.

          edit

          Oh, yeah, queue size should also play a roll in the switching

          Show
          Mark Miller added a comment - - edited So what looks like a promising strategy? Off the top I am thinking something as simple as: start with ORD with no fallback on the largest. if the next segments are fairly large, use ORD_VAL if the segments get somewhat smaller, move to ORD_DEM Oddly, I've seen VAL perform well in certain situations, so maybe it has its place, but I don't know where yet. edit Oh, yeah, queue size should also play a roll in the switching
          Hide
          Michael McCandless added a comment -

          Patch looks great! I will run some perf tests once I'm back at home
          base (still on vacation now, far away from home!). Some comments:

          • The comment "use multiple queues" in IndexSearcher.java isn't
            right – it should be "use single reader"?
          • I think FieldValueHitQueue.getComparatorPolicy should receive
            subreaders? EG when there's only one sub-reader, straight ORD is
            best. And possibly its order in the sort? We may want to move
            getComparatorPolicy into SortField so that one may override for
            "expert" cases? But we could do this later.
          • We should fix sorting of sub-readers by maxDoc() to be done once,
            not per query?
          • I think we should sort sub-readers by numDocs() not maxDoc()?
          • We should fix javadocs of MultiReaderHitCollector to explain that
            sub-readers are visited not-in-order. (And other javadoc fixes,
            but this can wait for the "polishing" phase...).
          • I like the "don't fill fields" option
          Show
          Michael McCandless added a comment - Patch looks great! I will run some perf tests once I'm back at home base (still on vacation now, far away from home!). Some comments: The comment "use multiple queues" in IndexSearcher.java isn't right – it should be "use single reader"? I think FieldValueHitQueue.getComparatorPolicy should receive subreaders? EG when there's only one sub-reader, straight ORD is best. And possibly its order in the sort? We may want to move getComparatorPolicy into SortField so that one may override for "expert" cases? But we could do this later. We should fix sorting of sub-readers by maxDoc() to be done once, not per query? I think we should sort sub-readers by numDocs() not maxDoc()? We should fix javadocs of MultiReaderHitCollector to explain that sub-readers are visited not-in-order. (And other javadoc fixes, but this can wait for the "polishing" phase...). I like the "don't fill fields" option
          Hide
          Mark Miller added a comment -

          I'll put up another patch soon with those changes. All look correct. I wasn't sure about maxdoc or numdoc (not that I spent much time thinking about it) because the fieldcache loads up all the deleted docs - for things like the terms array to be mapped to, that depends on maxdoc. But then how many hits are likely to end up in the queue is more related to numdocs. Switched it to numdocs.

          Still have to wrap up the custom comparator, but a comparatorpolicy changes things there. If you can set a custom comparatorpolicy (move it to sort?), than you can easily put in any custom comparators.

          • Mark
          Show
          Mark Miller added a comment - I'll put up another patch soon with those changes. All look correct. I wasn't sure about maxdoc or numdoc (not that I spent much time thinking about it) because the fieldcache loads up all the deleted docs - for things like the terms array to be mapped to, that depends on maxdoc. But then how many hits are likely to end up in the queue is more related to numdocs. Switched it to numdocs. Still have to wrap up the custom comparator, but a comparatorpolicy changes things there. If you can set a custom comparatorpolicy (move it to sort?), than you can easily put in any custom comparators. Mark
          Hide
          Mark Miller added a comment -

          Less ugly, but still some finishing work to do, including some intelligent ComparatorPolicies.

          Show
          Mark Miller added a comment - Less ugly, but still some finishing work to do, including some intelligent ComparatorPolicies.
          Hide
          Mark Miller added a comment -

          No real work, but some more necessary cleanup.

          Show
          Mark Miller added a comment - No real work, but some more necessary cleanup.
          Hide
          Michael McCandless added a comment -

          Mark, I see 3 testcase failures in TestSort if I "pretend" that SortField.STRING means STRING_ORD – do you see that?

          I think we should fix TestSort so that it runs N times, each time using a different STRING sort method, to make sure we are covering all these methods?

          Show
          Michael McCandless added a comment - Mark, I see 3 testcase failures in TestSort if I "pretend" that SortField.STRING means STRING_ORD – do you see that? I think we should fix TestSort so that it runs N times, each time using a different STRING sort method, to make sure we are covering all these methods?
          Hide
          Michael McCandless added a comment -

          I prototyped a rough change to the FieldComparator API, whereby
          TopFieldCollector calls setBottom to notify the comparator which slot
          is the bottom of the queue (whenever it changes), and then calls
          compareBottom (which replaces compare(int slot, int doc, float
          score)). This seems to offer decent perf. gains so I think we should
          make this change for real?

          I think it gives good gains because 1) compare to bottom is very
          frequent for a search that has many hits, and where the queue fairly
          quickly converges to the top N, 2) it allows the on-demand comparator
          to pre-cache the bottom's ord, and 3) it saves one array deref.

          TopFieldCollector would guarantee that compareBottom is not called
          unless setBottom was called; during the startup transient, setBottom
          is not called until the queue becomes full.

          Show
          Michael McCandless added a comment - I prototyped a rough change to the FieldComparator API, whereby TopFieldCollector calls setBottom to notify the comparator which slot is the bottom of the queue (whenever it changes), and then calls compareBottom (which replaces compare(int slot, int doc, float score)). This seems to offer decent perf. gains so I think we should make this change for real? I think it gives good gains because 1) compare to bottom is very frequent for a search that has many hits, and where the queue fairly quickly converges to the top N, 2) it allows the on-demand comparator to pre-cache the bottom's ord, and 3) it saves one array deref. TopFieldCollector would guarantee that compareBottom is not called unless setBottom was called; during the startup transient, setBottom is not called until the queue becomes full.
          Hide
          Michael McCandless added a comment -

          On what ComparatorPolicy to use by default... I think we should start
          with ORD, but gather counters of number of compares vs number of
          copies, and based on those counters (and comparing to numDocs())
          decide "how aggressively" to switch comparators? That determination
          should also take into account the queue size.

          An optimized index would always use ORD (w/o gathering counters),
          which is fastest.

          In the future... we could imagine allowing the query to dictate the
          order that segments are visited. EG if the query can roughly estimate
          how many hits it'll get on a given segment, we could order by that
          instead of simply numDocs().

          The query could also choose an appropriate ComparatorPolicy, eg, if it
          estimates it'll get very few hits, VAL is best right from the start,
          else start with ORD.

          Another future fix would be to implement ORDSUB with a single pass
          through the queue, using a reused secondary pqueue to do the full sort
          of the queue. This would let us assign subords much faster, I think.

          But I don't think we should pursue these optimizations as part of this
          issue... we need to bring closure here; we already have some solid
          gains to capture. I think we should wrapup now...

          Show
          Michael McCandless added a comment - On what ComparatorPolicy to use by default... I think we should start with ORD, but gather counters of number of compares vs number of copies, and based on those counters (and comparing to numDocs()) decide "how aggressively" to switch comparators? That determination should also take into account the queue size. An optimized index would always use ORD (w/o gathering counters), which is fastest. In the future... we could imagine allowing the query to dictate the order that segments are visited. EG if the query can roughly estimate how many hits it'll get on a given segment, we could order by that instead of simply numDocs(). The query could also choose an appropriate ComparatorPolicy, eg, if it estimates it'll get very few hits, VAL is best right from the start, else start with ORD. Another future fix would be to implement ORDSUB with a single pass through the queue, using a reused secondary pqueue to do the full sort of the queue. This would let us assign subords much faster, I think. But I don't think we should pursue these optimizations as part of this issue... we need to bring closure here; we already have some solid gains to capture.