At first blush, it doesn't look like the lucenebench tests cover sorting and faceting that well.
For example, I tested function queries (ValueSource) and sorting by multiple docvalue fields. Are either of these things tested at all in https://home.apache.org/~mikemccand/lucenebench/ ?
Running diverse fields in the same JVM run is esp important to prevent hotspot from over-optimizing for a single field cardinality (since different cardinalities have different docvalues encodings).
How many different numeric fields are concurrently sorted on for https://home.apache.org/~mikemccand/lucenebench/ ?
The names suggest just one: " TermQuery (date/time sort)"
If that is actually the case, then you're in danger of hotspot over-specializing for that single field/cardinality.
These are all good points, all things that I would like to improve
about Lucene's nightly benchmarks
(https://home.apache.org/~mikemccand/lucenebench/). Patches welcome
I'll try to add some low cardinality faceting/sorting coverage, maybe
using month name and day-of-the-year from the last modified date.
The nightly Wikipedia benchmark facets on Date field as a hierarchy
(year/month/day), and sorts on "last modified" (seconds resolution I
think) and title.
I've also long wanted to add highlighters...
A quick test by hand is still more informative than having no information at all.
I disagree: it's reckless to run an overly synthetic benchmark and
then present the results as if they mean we should make poor API
If one is measuring performance of a faceting change, then isolate it.
In the ideal world, yes, but this is notoriously problematic to do
with java: hotspot, GC, etc. will all behave very differently if you
are testing a very narrow part of the code.
That's an unnecessary personal dig.
I've already put in a lot of effort into benchmarking this, only to have it dismissed with hand waves, for cases that may not even be covered (or may be under stated) by your own benchmarks.
I fully intend to dig into the solr side, but I was waiting until the API stabilizes (
I pointed at specific examples that reside entirely in lucene code (the sorting examples)
My point is that running synthetic benchmarks and mis-representing
them as "meaningful" is borderline reckless, and certainly nowhere
near as helpful as, say, improving our default codec, profiling and
removing slow spots, removing extra legacy wrappers, etc. Those are
more positive ways to move our project forward.
Perhaps you feel you have put in a lot of effort here, but from where
I stand I see lots of complaining about how things got slower and little
effort to actually improve the sources. This issue alone was a
tremendous amount of slogging for me, and I had to switch Solr over
without fully understanding its sources: you or other Solr experts
could have stepped in to help me then.
But why not do that now? I.e. review my Solr changes or function
queries, etc.? I could easily have done something silly: it was just
a "rote" cutover to the iterator API.
I think we could nicely optimize the browse only case, by just using
nextDoc to step through all doc values for a given field. Does Solr
do that today?
Why not test the patch on
LUCENE-7462 to see if that API change helps?
I am not disagreeing that DV access got slower: the Lucene nightly
benchmarks also show that.
Yet look at sort-by-title: at first it got slower, on initial cutover
to iterators, but then thanks to Adrien Grand (thank you Adrien!), it's
now faster than it was before:
With more iterations I expect we can do the same thing for the other
dense cases. An iteration-only API means we can do all sorts of nice
compression improvements not possible with the random access API, we
don't need per-lookup bounds checks, etc. We should adopt from the
many things we do to compress postings, which have been iterators only
forever. And it means the sparse case, as a happy side effect,
get to improve too.
This could lead to a point in the future where the dense cases perform
better than they did with random access API, like sort-by-title does
already. We've only just begun down this path, and in just a few
weeks Adrien Grand has already made big gains.