James: sorry, i overlooked your comments until now...
Regarding the hyperloglog approach:
please raise HLL related discussions in the comments of
SOLR-6968 - been making lots of progress there, would love your opinions & help testing, but i don't want to have the conversation fractured in two diff places for other people watching that issue.
Unfortunately, for me, being able to get exact (or very nearly exact) distinct counts is important for the application I'm building (it's an analytics / stats tool so correct numbers are a must have feature)
I understand ... the reason i popped back up in this issue today was to comment that:
1) as I work more with HLL and really think about the internals, i realized that for people who care a lot about absolutely accurate precision, especially in low cardinality situations, i think you are correct – fixing this issue (countdistinct w/o distinctValues) would be a big win – even once HLL support is added. The simple reason being that HLL is fundamentally based on counting unique hashed values, so even if someone tries to tune it to be as accurate as possible, there's still the risk that two distinct values in a users's set will have the same hash value and prevent 100% accurate cardinality results.
2) as i worked on refactoring the tests in
SOLR-6968 it occurred to me that what you're requesting here would actually be fairly easy to implement and support in a backcompat way – in particular because of the changes already made in SOLR-6349 (distinguishing per-shard dependent stats) and the fact that the current param is "calcdistinct" but the stats currently returned are "countDistinct" and "distinctValues"
In a nutshell, what i think we should do is...
- replace calcdistinct(true) value in the Stat enum with distinctvals(true) and countdistinct(false,distinctvals) (so countdistinct indicates that in a distributed setup it requires distinctvals to be returned by each shard)
- remove all the calcDistinct logic from StatsValuesFactory and instead treat countdistinct and distinctvals as independently requested stats that have a dependent distributed computation relationship (just like "mean" vs "sum / count")
- update the existing special syntactic sugar parsing for the top level "calcdistinct" param to also include syntatic sugar parsing for the calcdistinct local param - in both cases they become sugar for countdistinct=true distinctvals=true
the end result being that users who want the countDistinct response info, w/o the list of all distinctVals would just specify countdistinct=true as a localparam (no other new top level params needed)
How does that sound?
these changes should already be fairly trivial to make if you want to take a stab at a patch – but there may be some patch collision in the code with the work i'm doing in
SOLR-6968. There will definitely be patch collisions in the existing test cases, but once SOLR-6968 is commited the test changes should actually be much easier ... and i'm happy to try and tackle this as soon as i'm done with that one (if nothing else, it will help me do a better apples to apples benchmark of HLL vs countDistinct w/o the added distincatVals in the final result)