[SOLR-15836] Address counterintuitive behavior of JSON "terms" subfacet refinement - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 9.0, 8.11
Fix Version/s: None
Component/s: Facet Module
Labels:
None

Description

In distributed faceting, uneven distribution of terms across different shards can artificially include or exclude terms (this discussion will focus on JSON Facet "terms" faceting).

This is inevitable, and can be mitigated via overrequest and overrefine parameters – respectively casting a "wider net" for "phase#1" (determining the set of "terms of interest") and "phase#2" (cross-checking "terms of interest" against terms that did not initially report them).

It is possible to devise artificial situations that push the limit of what overrefine is capable of mitigating, resulting in counterintuitive behavior. But despite such edge cases, in general it is relatively straightforward to reason about how the simple JSON Facet refinement method works for "flat" (i.e., non-hierarchical) terms facets.

This issue discusses some ways in which subfacets (hierarchical or nested facets) can more readily behave counterintuitively in practical usage, and possible ways to address/mitigate such behavior.

---------------------

AFAICT, the simple (default, currently the only) refinement method has two defining requirements:

there is at most one refinement request issued to each shard, and
any buckets returned are guaranteed to have accurate counts (or perhaps more generally, stats?) reflecting contributions from all shards. (this makes no guarantees about buckets not returned that would in principle be eligible to be returned).

The simplest counterintuitive case is when refinement of higher-level facets uncovers more subfacets on shards that have no opportunity to influence results/refinement of the child facet. I'm pretty sure it's this situation that's described in this comment (by hossman?):

    //   - or at the very least, if the purpose of "_l" is to give other buckets a chance to "bubble up"
    //     in phase#2, then shouldn't a "_l" refinement requests still include the buckets choosen in
    //     phase#1, and request that the shard fill them in in addition to returning its own top buckets?

The proposal in the above linked comment would work iff the "own top buckets" returned in phase#2 did not introduce any new/unseen values (and note, the only case in which returning "own top buckets" would be significant would be the case in which it would introduce new/unseen values). If new values were returned in phase#2, the only way to ensure that requirement2 is respected would be to violate requirement1 (i.e., by issuing another refinement request to determine whether any other shards have anything to contribute to the previously unseen value).

This counterintuitive behavior can't exactly be called a "bug", because IIUC the intuitive behavior is fundamentally incompatible with the current default/only simple refinement method.

Attachments

Issue Links

fixes

SOLR-12556 JSON Field Facet refinement can return incorrect counts/stats for sorted buckets -- when using processEmpty

Open

links to

GitHub Pull Request #448

Activity

People

Assignee:: Michael Gibney

Reporter:: Michael Gibney

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Dec/21 03:00

Updated:: 23/Feb/24 00:00

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h