Details
Description
Let's consider a table that has string, char and varchar columns and some of the values in these columns are empty strings.
select * from strings; +-----+------------+-----+ | s | c | v | +-----+------------+-----+ | | | | | abc | abc | abc | | | | | +-----+------------+-----+
If I query the # of distinct values by DataSketches HLL then the empty string add +1 to the end result.
select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), ds_hll_estimate(ds_hll_sketch(v)) from strings; +------------+----------+-------------+ | hll_string | hll_char | hll_varchar | +------------+----------+-------------+ | 2 | 2 | 2 | +------------+----------+-------------+
However, Hive's implementation omits empty strings so for this particular example above Hive would return 1 for each column.
I assume omits empty strings because of this line:
https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351
First step of this task would be to decide which approach is the correct one, and as a second step do the adjustment in Impala if we decide that way.
Btw, in Impala this functions updates string to the HLL sketches:
https://github.com/apache/impala/commit/7e456dfa9d932bcdb317ad6477abc3c399abacf2#diff-cb22c62db38ee853b857c3b2302244dfR1661
Attachments
Issue Links
- Is contained by
-
IMPALA-9593 Implement count(distinct) function (DataSketches/HLL)
- In Progress