Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20479

Performance degradation for large number of hash-aggregated columns

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.0
    • None
    • SQL

    Description

      In comment of SPARK-20184, maropu revealed that performance is degraded when # of aggregated columns get large with whole-stage codegen.

      ./bin/spark-shell --master local[1] --conf spark.driver.memory=2g --conf spark.sql.shuffle.partitions=1 -v
      
      def timer[R](f: => {}): Unit = {
        val count = 9
        val iters = (0 until count).map { i =>
          val t0 = System.nanoTime()
          f
          val t1 = System.nanoTime()
          val elapsed = t1 - t0 + 0.0
          println(s"#$i: ${elapsed / 1000000000.0}")
          elapsed
        }
        println("Elapsed time: " + ((iters.sum / count) / 1000000000.0) + "s")
      }
      
      val numCols = 80
      val t = s"(SELECT id AS key1, id AS key2, ${((0 until numCols).map(i => s"id AS c$i")).mkString(", ")} FROM range(0, 100000, 1, 1))"
      val sqlStr = s"SELECT key1, key2, ${((0 until numCols).map(i => s"SUM(c$i)")).mkString(", ")} FROM $t GROUP BY key1, key2 LIMIT 100"
      
      // Elapsed time: 2.3084404905555553s
      sql("SET spark.sql.codegen.wholeStage=true")
      timer { sql(sqlStr).collect }
      
      // Elapsed time: 0.527486733s
      sql("SET spark.sql.codegen.wholeStage=false")
      timer { sql(sqlStr).collect }
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            kiszk Kazuaki Ishizaki
            Votes:
            2 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: