Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-9942

DataSketches HLL shouldn't take empty strings as distinct values

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • Impala 4.0.0
    • None
    • Backend
    • ghx-label-7

    Description

      Let's consider a table that has string, char and varchar columns and some of the values in these columns are empty strings.

      select * from strings;
      +-----+------------+-----+
      | s   | c          | v   |
      +-----+------------+-----+
      |     |            |     |
      | abc | abc        | abc |
      |     |            |     |
      +-----+------------+-----+
      

      If I query the # of distinct values by DataSketches HLL then the empty string add +1 to the end result.

      select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), ds_hll_estimate(ds_hll_sketch(v)) from strings;
      +------------+----------+-------------+
      | hll_string | hll_char | hll_varchar |
      +------------+----------+-------------+
      | 2          | 2        | 2           |
      +------------+----------+-------------+
      

      However, Hive's implementation omits empty strings so for this particular example above Hive would return 1 for each column.

      I assume omits empty strings because of this line:
      https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351

      First step of this task would be to decide which approach is the correct one, and as a second step do the adjustment in Impala if we decide that way.

      Btw, in Impala this functions updates string to the HLL sketches:
      https://github.com/apache/impala/commit/7e456dfa9d932bcdb317ad6477abc3c399abacf2#diff-cb22c62db38ee853b857c3b2302244dfR1661

      Attachments

        Issue Links

          Activity

            People

              tadam Adam Tamas
              gaborkaszab Gabor Kaszab
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: