Description
When a SortMergeJoin is followed by a Project with aliases, the outputOrdering is not propogated properly and in some cases, it leads to unrequired Sort operation:
spark.range(10).repartition($"id").createTempView("t1") spark.range(20).repartition($"id").createTempView("t2") spark.range(30).repartition($"id").createTempView("t3") val planned = sql( """ |SELECT t2id, t3.id as t3id |FROM ( | SELECT t1.id as t1id, t2.id as t2id | FROM t1, t2 | WHERE t1.id = t2.id |) t12, t3 |WHERE t1id = t3.id """.stripMargin).queryExecution.executedPlan *(8) Project [t2id#1059L, id#1004L AS t3id#1060L] +- *(8) SortMergeJoin [t2id#1059L], [id#1004L], Inner :- *(5) Sort [t2id#1059L ASC NULLS FIRST ], false, 0 <----------------------- : +- *(5) Project [id#1000L AS t2id#1059L] : +- *(5) SortMergeJoin [id#996L], [id#1000L], Inner : :- *(2) Sort [id#996L ASC NULLS FIRST ], false, 0 : : +- Exchange hashpartitioning(id#996L, 5), true, [id=#1426] : : +- *(1) Range (0, 10, step=1, splits=2) : +- *(4) Sort [id#1000L ASC NULLS FIRST ], false, 0 : +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1432] : +- *(3) Range (0, 20, step=1, splits=2) +- *(7) Sort [id#1004L ASC NULLS FIRST ], false, 0 +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1443] +- *(6) Range (0, 30, step=1, splits=2)
The above marked Sort node could have been avoided.
Attachments
Issue Links
- relates to
-
SPARK-33399 Normalize output partitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes
- Resolved
- links to