[SOLR-14167] Exact unique counts when shards contain disjoint values - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Facet Module
Labels:
None

Description

Currently when dealing with fields with high cardinality the facet module offers two implementations (unique, hll) that give approximate results. There is one corner case where a distributed search against a high cardinality field should still be able to efficiently provide an exact result, that is when the shards are known to contain disjoint values i.e. there are duplicates within a shard, but no value exists on more than 1 shard.

That happens to be the case in the collection I have, but this feels to me like a very niche use case. Is this functionality too niche for inclusion into the Facet module?

I attach a naive (untested) example implementation. It could be made slightly more efficient if SlotAcc implementations that didn't populate the first 100 values were used (or if this behaviour was made configurable, perhaps via the FacetContext?).

Slightly off topic, but the documentation currently says of unique "Beyond 100 values it yields not exact estimate". My understanding is that this is actually only true when doing distributed facetting, and that it is exact for the non-distrubuted case.

UniqueAgg calculates sumUnique, but does not appear to actually use it.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

UniqueSumPerShard.java
05/Jan/20 18:09
2 kB
Daniel Lowe

Issue Links

is superceded by

SOLR-14518 Add support for partitioned unique agg to JSON facets

Open

Activity

People

Assignee:: Unassigned

Reporter:: Daniel Lowe

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 05/Jan/20 18:33

Updated:: 04/Jun/20 02:38