Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20184

performance regression for complex/long sql when enable whole stage codegen

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.6.0, 2.1.0
    • None
    • SQL

    Description

      The performance of following SQL get much worse in spark 2.x in contrast with codegen off.

      SELECT
      sum(COUNTER_57)
      ,sum(COUNTER_71)
      ,sum(COUNTER_3)
      ,sum(COUNTER_70)
      ,sum(COUNTER_66)
      ,sum(COUNTER_75)
      ,sum(COUNTER_69)
      ,sum(COUNTER_55)
      ,sum(COUNTER_63)
      ,sum(COUNTER_68)
      ,sum(COUNTER_56)
      ,sum(COUNTER_37)
      ,sum(COUNTER_51)
      ,sum(COUNTER_42)
      ,sum(COUNTER_43)
      ,sum(COUNTER_1)
      ,sum(COUNTER_76)
      ,sum(COUNTER_54)
      ,sum(COUNTER_44)
      ,sum(COUNTER_46)
      ,DIM_1
      ,DIM_2
      ,DIM_3
      FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;

      Num of rows of aggtable is about 35000000.

      whole stage codegen on(spark.sql.codegen.wholeStage = true): 40s
      whole stage codegen off(spark.sql.codegen.wholeStage = false): 6s

      After some analysis i think this is related to the huge java method(a java method of thousand lines) which generated by codegen.
      And If i config -XX:-DontCompileHugeMethods the performance get much better(about 7s).

      Attachments

        Activity

          People

            Unassigned Unassigned
            scwf Fei Wang
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: