Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-15872

The PERCENTILE_APPROX UDAF does not work with empty set

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1.2, 2.2.0
    • Component/s: UDF
    • Labels:
      None
    • Flags:
      Patch

      Description

      1. Original SQL:

      select
          percentile_approx(
              column0,
              array(0.50, 0.70, 0.90, 0.95, 0.99)
          )
      from
          my_table
      where
          date = '20170207'
          and column1 = 'value1'
          and column2 = 'value2'
          and column3 = 'value3'
          and column4 = 'value4'
          and column5 = 'value5'
      

      2. Exception StackTrace:

      Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":[0.0,10000.0]}} at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:256) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:401) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":[0.0,10000.0]}} at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:244) ... 7 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2 at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:766) at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:235) ... 7 more Caused by: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at org.apache.hadoop.hive.ql.udf.generic.NumericHistogram.merge(NumericHistogram.java:134) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox$GenericUDAFPercentileApproxEvaluator.merge(GenericUDAFPercentileApprox.java:318) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:188) at org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:612) at org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(GroupByOperator.java:851) at org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:695) at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:761) ... 8 more
      

      3. review data:

      select
          column0
      from
          my_table
      where
          date = '20170207'
          and column1 = 'value1'
          and column2 = 'value2'
          and column3 = 'value3'
          and column4 = 'value4'
          and column5 = 'value5'
      

      After run this sql, we found the result is NULL.

      4. what's the meaning of [0.0, 10000.0] in stacktrace?

      In GenericUDAFPercentileApproxEvaluator, the method `merge` should process an ArrayList which name is partialHistogram. Normally, the basic structure of partialHistogram is [npercentiles, percentile0, percentile1..., nbins, bin0.x, bin0.y, bin1.x, bin1.y,...]. However, if we process NULL(empty set) column values, the partialHistoram will only contains [npercentiles(0), nbins(10000)]. That's the reason why the stacktrace shows a strange row data: {"key":{},"value":{"_col0":[0.0,10000.0]}}

      Before we call histogram#merge (on-line hisgoram algorithm from paper: http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf ), the partialHistogram should remove elements which store percentiles like `partialHistogram.subList(0, nquantiles+1).clear();`. In the case of empty set, GenericUDAFPercentileApproxEvaluator will not remove percentiles. Consequently, NumericHistogram will merge a list which contains only 2 elements([0, 10000.0]) and throws IndexOutOfBoundsException.

        Attachments

        1. HIVE-15872.2.patch
          1 kB
          Chaozhong Yang
        2. HIVE-15872.3.patch
          2 kB
          Chaozhong Yang
        3. HIVE-15872.4.patch
          2 kB
          Chaozhong Yang
        4. HIVE-15872.patch
          0.7 kB
          Chaozhong Yang

          Activity

            People

            • Assignee:
              debugger87 Chaozhong Yang
              Reporter:
              debugger87 Chaozhong Yang
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: