Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.2.1, 2.3.0
-
None
-
None
Description
When trying to join two DataFrames with the same origin DataFrame and later selecting columns from the join, Spark can't distinguish between the columns and gives a wrong (or at least very surprising) result. One can work around this using expr.
Here is a minimal example:
import spark.implicits._ val edf = Seq((1), (2), (3), (4), (5)).toDF("num") val big = edf.where(edf("num") > 2).alias("big") val small = edf.where(edf("num") < 4).alias("small") small.join(big, expr("big.num == (small.num + 1)")).select(small("num"), big("num")).show() // +---+---+ // |num|num| // +---+---+ // | 2| 2| // | 3| 3| // +—+—+ small.join(big, expr("big.num == (small.num + 1)")).select(expr("small.num"), expr("big.num")).show() // +---+---+ // |num|num| // +---+---+ // | 2| 3| // | 3| 4| // +---+---+
Attachments
Issue Links
- duplicates
-
SPARK-14948 Exception when joining DataFrames derived form the same DataFrame
- In Progress
-
SPARK-10892 Join with Data Frame returns wrong results
- Closed