really simple straw man implementation using java-hll...
The bulk of the current patch is in test refactoring because all the special case conditionals in StatsComponentTest.testIndividualStatLocalParams were driving me insane.
Currently only cardinality of numeric fields is supported (and even then, only long fields really work "correctly"). Current syntax is...
...but i'm thinking that should change ... there's at least two types of knobs we should support, i'm just not sure which is more important, or if either should be mandatory:
- An indication of wether or not hte input is already hashed
- reading up more on HLL i'm realizing how important it is that the values be hashed (into longs).
- We should certainly support on the fly hashing, but for people who plan to compute cardinalities a lot, particularly over large sets or strings, we should also have both:
- an easy way for them to compute those long hashes at index time (simple UpdateProcessor)
- a stats localparam indicate that the field they are computing cardinality over is already hashed
- precisions / size tunning
- similar to how we have an optional "tdigestCompression" param we could have an "hllOptions" param for overriding the "log2m" and "regwidth" options
- or we could require that the value of the "cardinality" param be a value indicating how much the user cares about accuracy vs ram (ie: a float between 0 and 1 indicating min ram vs max accurace) and compute log2m+regwidth from those ("false" or negative values could disable complete, while "true" could be shorthand for some default)
- this would have the benefit of being something we could continue to support even if a better cardinality algorithm comes along in the future
My next steps are to focus on more concrete tests & then refactoring to work with other field types, and think about the knobs/configuration as i go.