Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3891

Support Hive Percentile UDAF with array of percentile values

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.2.0
    • 1.3.0
    • SQL
    • None
    • Spark 1.2.0 trunk (ac302052870a650d56f2d3131c27755bb2960ad7) on
      CDH 5.1.0
      Centos 6.5
      8x 2GHz, 24GB RAM

    Description

      Spark PR 2620 brings in the support of Hive percentile UDAF.
      However Hive percentile and percentile_approx UDAFs also support returning an array of percentile values with the syntax
      percentile(BIGINT col, array(p1 [, p2]...)) or
      percentile_approx(DOUBLE col, array(p1 [, p2]...) [, B])

      These queries are failing with the below error:

      0: jdbc:hive2://dev-uuppala.sfohi.philips.com> select name, percentile(turnaroundtime,array(0,0.25,0.5,0.75,1)) from exam group by name;

      Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 25.0 failed 4 times, most recent failure: Lost task 1.3 in stage 25.0 (TID 305, Dev-uuppala.sfohi.philips.com): java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to [Ljava.lang.Object;
      org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getListLength(StandardListObjectInspector.java:83)
      org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$ListConverter.convert(ObjectInspectorConverters.java:259)
      org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ConversionHelper.convertIfNecessary(GenericUDFUtils.java:349)
      org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge$GenericUDAFBridgeEvaluator.iterate(GenericUDAFBridge.java:170)
      org.apache.spark.sql.hive.HiveUdafFunction.update(hiveUdfs.scala:342)
      org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167)
      org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
      org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
      org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
      org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
      org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
      org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
      org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
      org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
      org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
      org.apache.spark.scheduler.Task.run(Task.scala:56)
      org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      java.lang.Thread.run(Thread.java:745)
      Driver stacktrace: (state=,code=0)

      Attachments

        Issue Links

          Activity

            People

              gvramana Venkata Gollamudi
              chinnitv Anand Mohan Tumuluri
              Michael Armbrust Michael Armbrust
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: