Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25125

Spark SQL percentile_approx takes longer than Hive version for large datasets

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 2.3.1
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:

      Description

      The percentile_approx function in Spark SQL takes much longer than the previous Hive implementation for large data sets (7B rows grouped into 200k buckets, percentile is on each bucket). Tested with Spark 2.3.1 vs Spark 2.1.0.

      The below code finishes in around 24 minutes on spark 2.1.0, on spark 2.3.1, this does not finish at all in more than 2 hours. Also tried this with different accuracy values 5000,1000,500, the timing does get better with smaller datasets with the new version, but the speed difference is insignificant

       

      Infrastructure used:

      AWS EMR -> Spark 2.1.0

      vs

      AWS EMR  -> Spark 2.3.1

       

      spark-shell --conf spark.driver.memory=12g --conf spark.executor.memory=10g --conf spark.sql.shuffle.partitions=2000 --conf spark.default.parallelism=2000 --num-executors=75 --executor-cores=2

      import org.apache.spark.sql.functions._ 
      import org.apache.spark.sql.types._ 
      
      val df=spark.range(7000000000L).withColumn("some_grouping_id", round(rand()*200000L).cast(LongType)) 
      df.createOrReplaceTempView("tab")   
      val percentile_query = """ select some_grouping_id, percentile_approx(id, array(0,0.25,0.5,0.75,1)) from tab group by some_grouping_id """ 
      
      spark.sql(percentile_query).collect()
      

       

       

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              myali Mir Ali
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: