[IMPALA-9942] DataSketches HLL shouldn't take empty strings as distinct values - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 4.0.0
Fix Version/s: None
Component/s: Backend
Labels:
- newbie
- ramp-up

Epic Color:
ghx-label-7

Description

Let's consider a table that has string, char and varchar columns and some of the values in these columns are empty strings.

select * from strings;
+-----+------------+-----+
| s   | c          | v   |
+-----+------------+-----+
|     |            |     |
| abc | abc        | abc |
|     |            |     |
+-----+------------+-----+

If I query the # of distinct values by DataSketches HLL then the empty string add +1 to the end result.

select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), ds_hll_estimate(ds_hll_sketch(v)) from strings;
+------------+----------+-------------+
| hll_string | hll_char | hll_varchar |
+------------+----------+-------------+
| 2          | 2        | 2           |
+------------+----------+-------------+

However, Hive's implementation omits empty strings so for this particular example above Hive would return 1 for each column.

I assume omits empty strings because of this line:
https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351

First step of this task would be to decide which approach is the correct one, and as a second step do the adjustment in Impala if we decide that way.

Btw, in Impala this functions updates string to the HLL sketches:
https://github.com/apache/impala/commit/7e456dfa9d932bcdb317ad6477abc3c399abacf2#diff-cb22c62db38ee853b857c3b2302244dfR1661

Attachments

Issue Links

Is contained by

IMPALA-9593 Implement count(distinct) function (DataSketches/HLL)

In Progress

Activity

People

Assignee:: Adam Tamas

Reporter:: Gabor Kaszab

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 10/Jul/20 08:33

Updated:: 31/Jul/20 11:04

Resolved:: 31/Jul/20 11:04