I haven't been following this...
Thank you for the review Yonik!
I took a quick look at the patch, and at first blush it's hard to tell what changes are cleanup and what changes are cut-over.
It's definitely a big refactoring.
in the QueryComponent, why the change to set the GET_SCORES flag based on the sort(s)?
I assume this is because if you sort or groupSort by a Sort that
contains SortField.SCORE, you need a scorer. Not sure why Solr trunk
gets away with not doing this, though... Martijn do you know? Oh, and
likely because the CachingCollector needs to know up front if scores
I'm not a fan of this new style for matching request parameters to enums... solr does a lot of lookups on a typical request, and a switch to this style everywhere could definitely have an impact (the whole upper-casing the request param so we can match it to the enum name).
The .upper() calls (twice per request, only if the request has
group=true) are negligible here (true, other situations might be
different, though we should fix them to lookup/check once).
And, this approach gives us strong checking of the enum fields.
Contrast that with what's on trunk now:
String format = params.get(GroupParams.GROUP_FORMAT);
Grouping.Format defaultFormat = "simple".equals(format) ? Grouping.Format.Simple : Grouping.Format.Grouped;
If you have a typo in "simple", you'll unexpectedly get
I think we should strongly check all of our request params?
"Accuracy" seems a bit mis-named? It seems to imply an accuracy trade-off, but both methods are 100% accurate here, they just do different things to serve different usecases. At least the name doesn't seem to have made it's way into the external API though.
The parameter "group.totalCount" I would expect to return the total count of something, not control the pre/post faceting thing? (or are the comments just wrong?) If it's to return the number of groups, then perhaps the name should be "group.groupCount" as totalCount is unit-less.
I agree. This setting just controls what the "total hit count" should
count – unique groups vs docs.
How about TotalCount.GROUPS and TotalCount.DOCS?
What does "group.docSet" do? The comments don't quite make sense to me, but the param suggests it's sort of like group.totalCount?
I think we should remove this from this patch (see my comment above –
it applies to
in the interest of reducing the number of parameters, we could dump group.cache and have a single group.cacheMB parameter that uses 0 as no cache, -1 as maximum needed (solr uses -1 in this manner in other places too), and other values as literal number of MB (which I'd discourage people from using personally).
+1, that makes sense.
I'm not sure we should default group.cache to true... there's a downside to the memory usage, and it's fragile: things may be working just fine, and the user may add a few more documents to the index and then the limit is hit and it just stops working (but still consumes much memory and extra log warnings per request).
Enabling caching can make a huge improvement in QPS, especially for
queries that are costly to execute but don't match too many docs.
Maybe instead of a fixed MB we could make it a percentage of the
maxDoc? This would make it less fragile.. So you could say you're
willing to cache up to 50% of the total docs in the index.
FYI: there's a nocommit in there misspelled as "No commit"
Martijn can you fix this one...? And in general try to spell it as
"nocommit" This is what our build catches. (And, fear not, you're
not the only person to have trouble spelling nocommit!!).