Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-9942

DataSketches HLL shouldn't take empty strings as distinct values

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Impala 4.0.0
    • Fix Version/s: None
    • Component/s: Backend
    • Labels:
    • Epic Color:
      ghx-label-7

      Description

      Let's consider a table that has string, char and varchar columns and some of the values in these columns are empty strings.

      select * from strings;
      +-----+------------+-----+
      | s   | c          | v   |
      +-----+------------+-----+
      |     |            |     |
      | abc | abc        | abc |
      |     |            |     |
      +-----+------------+-----+
      

      If I query the # of distinct values by DataSketches HLL then the empty string add +1 to the end result.

      select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), ds_hll_estimate(ds_hll_sketch(v)) from strings;
      +------------+----------+-------------+
      | hll_string | hll_char | hll_varchar |
      +------------+----------+-------------+
      | 2          | 2        | 2           |
      +------------+----------+-------------+
      

      However, Hive's implementation omits empty strings so for this particular example above Hive would return 1 for each column.

      I assume omits empty strings because of this line:
      https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351

      First step of this task would be to decide which approach is the correct one, and as a second step do the adjustment in Impala if we decide that way.

      Btw, in Impala this functions updates string to the HLL sketches:
      https://github.com/apache/impala/commit/7e456dfa9d932bcdb317ad6477abc3c399abacf2#diff-cb22c62db38ee853b857c3b2302244dfR1661

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tadam Adam Tamas
                Reporter:
                gaborkaszab Gabor Kaszab
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: