Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Duplicate
-
3.2.0
-
None
-
None
-
OS: Ubuntu 18.04.5 LTS
Scala version: 2.12.15
Description
When joining two DataFrames and then aliasing the result, selecting columns from the resulting Dataset by a qualified star produces duplicates of the joined columns.
scala> val df1 = Seq((1, 10), (2, 20)).toDF("a", "x") df1: org.apache.spark.sql.DataFrame = [a: int, x: int] scala> val df2 = Seq((2, 200), (3, 300)).toDF("a", "y") df2: org.apache.spark.sql.DataFrame = [a: int, y: int] scala> val joined = df1.join(df2, "a").alias("joined") joined: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, x: int ... 1 more field] scala> joined.select("*").show() +---+---+---+ | a| x| y| +---+---+---+ | 2| 20|200| +---+---+---+ scala> joined.select("joined.*").show() +---+---+---+---+ | a| a| x| y| +---+---+---+---+ | 2| 2| 20|200| +---+---+---+---+ scala> joined.select("*").select("joined.*").show() +---+---+---+ | a| x| y| +---+---+---+ | 2| 20|200| +---+---+---+
This appears to be introduced by SPARK-34527, leading to some surprising behaviour. Using an earlier version, such as Spark 3.0.2, produces the same output for all three show()s.
Attachments
Issue Links
- duplicates
-
SPARK-39376 Do not output duplicated columns in star expansion of subquery alias of NATURAL/USING JOIN
- Resolved