I'm seeing a hard to reproduce bug when the following conditions are true:
- Distributed search
- group=true with group.field=xxx and group.facet=true
- facet=true with facet.field and facet.range
On a sharded request (isShard=true, distrib=false) that has requestPurpose=GET_FIELDS, sometimes facet=true but sometimes it isn't. Apparently, sometimes the earlier GET_FACETS phase satisfies the faceting alone and sometimes more is done in GET_FIELDS. So if facet=true on such a request and facet.range is set (or perhaps facet.query), then the bug will hit. Specifically both the facet.range and facet.query logic will conditionally call SimpleFacets.getGroupedFacetQueryCount, and both will conditionally do so when they detect that "group.field" has been set. BUT, for a GET_FIELDS shard request, the "group" parameter flag is explicitly removed from the request by StoredFieldsShardRequestFactory, effectively disabling grouping. So SimpleFacets.getGroupedFacetQueryCount will throw an error. The error is that "group.field" hasn't been set which technically isn't true.
It's 100% reproducible on my customer's system. Reproducing it is tricky because it's not going to happen if the faceting logic doesn't happen on GET_FIELDS, which doesn't seem to happen often. I found that for a test query if I changed the facet.limit to a sufficiently high number then it trips, but if it's low then it doesn't. I presume this has something to do with refining faceting counts such that a higher facet.limit increases the chance that the coordinating node (what do we call that?) will need to ask a shard for more counts beyond which was provided on the initial GET_FACETS phase.
If anyone has pointers on how to reproduce this (such as in a test!), then that would help.
Even though I don't have 100% understanding of the bug and have yet to reproduce it with test data, it seems to me the bug might be as simple as having SimpleFacets.getGroupedFacetQueryCount retrieve the group.field parameter directly from parameters instead of possibly failing to find it in rb.GetGroupingSpec() (because "group" wasn't set). After all, that is how the callers of this method determine wether or not they want to get a grouped query count.