Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
1.6.0, 2.1.0
-
None
Description
The performance of following SQL get much worse in spark 2.x in contrast with codegen off.
SELECT
sum(COUNTER_57)
,sum(COUNTER_71)
,sum(COUNTER_3)
,sum(COUNTER_70)
,sum(COUNTER_66)
,sum(COUNTER_75)
,sum(COUNTER_69)
,sum(COUNTER_55)
,sum(COUNTER_63)
,sum(COUNTER_68)
,sum(COUNTER_56)
,sum(COUNTER_37)
,sum(COUNTER_51)
,sum(COUNTER_42)
,sum(COUNTER_43)
,sum(COUNTER_1)
,sum(COUNTER_76)
,sum(COUNTER_54)
,sum(COUNTER_44)
,sum(COUNTER_46)
,DIM_1
,DIM_2
,DIM_3
FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
Num of rows of aggtable is about 35000000.
whole stage codegen on(spark.sql.codegen.wholeStage = true): 40s
whole stage codegen off(spark.sql.codegen.wholeStage = false): 6s
After some analysis i think this is related to the huge java method(a java method of thousand lines) which generated by codegen.
And If i config -XX:-DontCompileHugeMethods the performance get much better(about 7s).