Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.3, 3.1.3, 3.3.1, 3.2.3, 3.4.0
Description
The following query should return a single row as all values for id except for the largest will be eliminated by the anti-join:
val ids = Seq(1, 2, 3).toDF("id").distinct() val result = ids.withColumn("id", $"id" + 1).join(ids, Seq("id"), "left_anti").collect() assert(result.length == 1)
Without the distinct(), the assertion is true. With distinct(), the assertion should still hold but is false.
Rule PushDownLeftSemiAntiJoin pushes the Join below the left Aggregate with join condition (id#750 + 1) = id#750, which can never be true.
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin === !Join LeftAnti, (id#752 = id#750) 'Aggregate [id#750], [(id#750 + 1) AS id#752] !:- Aggregate [id#750], [(id#750 + 1) AS id#752] +- 'Join LeftAnti, ((id#750 + 1) = id#750) !: +- LocalRelation [id#750] :- LocalRelation [id#750] !+- Aggregate [id#750], [id#750] +- Aggregate [id#750], [id#750] ! +- LocalRelation [id#750] +- LocalRelation [id#750]
The optimizer then rightly removes the left-anti join altogether, returning the left child only.
Rule PushDownLeftSemiAntiJoin should not push down predicates that reference left and right child.
Attachments
Issue Links
- links to
(2 links to)