Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.0.0
-
None
-
Spark 2.0.0, Mac, Local
Description
test.scala
scala> val df1 = sc.parallelize(Seq((1, 2, 3), (3, 3, 3))).toDF("a", "b", "c") df1: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] scala> val df2 = sc.parallelize(Seq((1, 2, 4), (4, 4, 4))).toDF("a", "b", "d") df2: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] scala> val df3 = df1.join(df2, Seq("a", "b"), "outer") df3: org.apache.spark.sql.DataFrame = [a: int, b: int ... 2 more fields] scala> df3.show() +---+---+----+----+ | a| b| c| d| +---+---+----+----+ | 1| 2| 3| 4| | 3| 3| 3|null| | 4| 4|null| 4| +---+---+----+----+ scala> val df4 = sc.parallelize(Seq((1, 2, 5), (3, 3, 5), (4, 4, 5))).toDF("a", "b", "e") df4: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] scala> df4.show() +---+---+---+ | a| b| e| +---+---+---+ | 1| 2| 5| | 3| 3| 5| | 4| 4| 5| +---+---+---+ scala> df3.join(df4, Seq("a", "b"), "inner").show() +---+---+---+---+---+ | a| b| c| d| e| +---+---+---+---+---+ | 1| 2| 3| 4| 5| +---+---+---+---+---+
If call persist on df3, the output is correct
test2.scala
scala> df3.persist res32: df3.type = [a: int, b: int ... 2 more fields] scala> df3.join(df4, Seq("a", "b"), "inner").show() +---+---+----+----+---+ | a| b| c| d| e| +---+---+----+----+---+ | 1| 2| 3| 4| 5| | 3| 3| 3|null| 5| | 4| 4|null| 4| 5| +---+---+----+----+---+
Attachments
Issue Links
- duplicates
-
SPARK-16991 Full outer join followed by inner join produces wrong results
- Resolved