[SPARK-23791] Sub-optimal generated code for sum aggregating - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.2.0, 2.3.0
Fix Version/s: None
Component/s: Optimizer, SQL
Labels:
- performance

Flags:

Important

Description

It appears to be that with wholeStage codegen enabled simple spark job performing sum aggregation of 50 columns runs ~4 timer slower than without wholeStage codegen.

Please check test case code. Please note that udf is only to prevent elimination optimizations that could be applied to literals.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_CODEGEN_ENABLED

object SPARK_23791 {

  def main(args: Array[String]): Unit = {

    val spark = SparkSession
      .builder()
      .master("local[4]")
      .appName("test")
      .getOrCreate()

    def addConstColumns(prefix: String, cnt: Int, value: Column)(inputDF: DataFrame) =
      (0 until cnt).foldLeft(inputDF)((df, idx) => df.withColumn(s"$prefix$idx", value))

    val dummy = udf(() => Option.empty[Int])

    def test(cnt: Int = 50, rows: Int = 5000000, grps: Int = 1000): Double = {
      val t0 = System.nanoTime()
      spark.range(rows).toDF()
        .withColumn("grp", col("id").mod(grps))
        .transform(addConstColumns("null_", cnt, dummy()))
        .groupBy("grp")
        .agg(sum("null_0"), (1 until cnt).map(idx => sum(s"null_$idx")): _*)
        .collect()
      val t1 = System.nanoTime()
      (t1 - t0) / 1e9
    }

    val timings = for (i <- 1 to 3) yield {
      spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, true)
      val with_wholestage = test()
      spark.sessionState.conf.setConf(WHOLESTAGE_CODEGEN_ENABLED, false)
      val without_wholestage = test()
      (with_wholestage, without_wholestage)
    }

    timings.foreach(println)

    println("Press enter ...")
    System.in.read()
  }
}

Attachments

Issue Links

duplicates

SPARK-21870 Split codegen'd aggregation code into small functions for the HotSpot

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Valentin Nikotin

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/Mar/18 20:14

Updated:: 17/May/20 17:58

Resolved:: 03/Apr/18 00:14

Time Tracking

Estimated:

24h

Remaining:

24h

Logged:

Not Specified