[SOLR-12343] JSON Field Facet refinement can return incorrect counts/stats for sorted buckets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 7.5, 8.0
Component/s: None
Labels:
None

Description

The way JSON Facet's simple refinement "re-sorts" buckets after refinement can cause refined buckets to be "bumped out" of the topN based on the refined counts/stats depending on the sort - causing unrefined buckets originally discounted in phase#2 to bubble up into the topN and be returned to clients with inaccurate counts/stats

The simplest way to demonstrate this bug (in some data sets) is with a sort: 'count asc' facet:

assume shard1 returns termX & termY in phase#1 because they have very low shard1 counts
- but not returned at all by shard2, because these terms both have very high shard2 counts.
Assume termX has a slightly lower shard1 count then termY, such that:
- termX "makes the cut" off for the limit=N topN buckets
- termY does not make the cut, and is the "N+1" known bucket at the end of phase#1
termX then gets included in the phase#2 refinement request against shard2
- termX now has a much higher known total count then termY
- the coordinator now sorts termX "worse" in the sorted list of buckets then termY
- which causes termY to bubble up into the topN
termY is ultimately included in the final result with incomplete count/stat/sub-facet data instead of termX
- this is all indepenent of the possibility that termY may actually have a significantly higher total count then termX across the entire collection
- the key problem is that all/most of the other terms returned to the client have counts/stats that are the cumulation of all shards, but termY only has the contributions from shard1

Important Notes:

This scenerio can happen regardless of the amount of overrequest used. Additional overrequest just increases the number of "extra" terms needed in the index with "better" sort values then termX & termY in shard2
sort: 'count asc' is not just an exceptional/pathelogical case:
- any function sort where additional data provided shards during refinement can cause a bucket to "sort worse" can also cause this problem.
- Examples: sum(price_i) asc , min(price_i) desc , avg(price_i) asc|desc , etc...

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

__incomplete_processEmpty_microfix.patch
11/Jul/18 18:38
14 kB
Chris M. Hostetter
SOLR-12343.patch
16/Jul/18 21:54
50 kB
Chris M. Hostetter
SOLR-12343.patch
16/Jul/18 18:45
49 kB
Chris M. Hostetter
SOLR-12343.patch
11/Jul/18 21:37
47 kB
Chris M. Hostetter
SOLR-12343.patch
09/Jul/18 18:11
37 kB
Chris M. Hostetter
SOLR-12343.patch
06/Jul/18 19:02
33 kB
Chris M. Hostetter
SOLR-12343.patch
05/Jul/18 17:33
30 kB
Yonik Seeley
SOLR-12343.patch
19/Jun/18 00:53
30 kB
Chris M. Hostetter
SOLR-12343.patch
23/May/18 18:48
19 kB
Chris M. Hostetter
SOLR-12343.patch
10/May/18 21:36
5 kB
Chris M. Hostetter

Issue Links

is related to

SOLR-12556 JSON Field Facet refinement can return incorrect counts/stats for sorted buckets -- when using processEmpty

Open

SOLR-12516 JSON "range" facets can incorrectly refine subfacets for buckets

Closed

SOLR-11733 add an option make json.facet refinement more "optimistic" like facet.field/facet.pivot so that long tail have a change to bubble up

Open

Activity

People

Assignee:: Yonik Seeley

Reporter:: Chris M. Hostetter

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/May/18 21:35

Updated:: 08/Jun/19 15:13

Resolved:: 19/Jul/18 17:31