[SPARK-33400] Normalize sameOrderExpressions in SortOrder - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.7, 3.0.0, 3.0.1
Fix Version/s: 3.1.0
Component/s: SQL
Labels:
None

Description

When a SortMergeJoin is followed by a Project with aliases, the outputOrdering is not propogated properly and in some cases, it leads to unrequired Sort operation:

spark.range(10).repartition($"id").createTempView("t1")
spark.range(20).repartition($"id").createTempView("t2")
spark.range(30).repartition($"id").createTempView("t3")

val planned = sql(
   """
     |SELECT t2id, t3.id as t3id
     |FROM (
     |    SELECT t1.id as t1id, t2.id as t2id
     |    FROM t1, t2
     |    WHERE t1.id = t2.id
     |) t12, t3
     |WHERE t1id = t3.id
   """.stripMargin).queryExecution.executedPlan



*(8) Project [t2id#1059L, id#1004L AS t3id#1060L]
+- *(8) SortMergeJoin [t2id#1059L], [id#1004L], Inner
   :- *(5) Sort [t2id#1059L ASC NULLS FIRST ], false, 0  <-----------------------
   :  +- *(5) Project [id#1000L AS t2id#1059L]
   :     +- *(5) SortMergeJoin [id#996L], [id#1000L], Inner
   :        :- *(2) Sort [id#996L ASC NULLS FIRST ], false, 0
   :        :  +- Exchange hashpartitioning(id#996L, 5), true, [id=#1426]
   :        :     +- *(1) Range (0, 10, step=1, splits=2)
   :        +- *(4) Sort [id#1000L ASC NULLS FIRST ], false, 0
   :           +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1432]
   :              +- *(3) Range (0, 20, step=1, splits=2)
   +- *(7) Sort [id#1004L ASC NULLS FIRST ], false, 0
      +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1443]
         +- *(6) Range (0, 30, step=1, splits=2)

The above marked Sort node could have been avoided.