[SPARK-38603] Qualified star selection produces duplicated common columns after join then alias - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 3.2.0
Fix Version/s: None
Component/s: SQL
Labels:
None
Environment:

OS: Ubuntu 18.04.5 LTS
Scala version: 2.12.15

Description

When joining two DataFrames and then aliasing the result, selecting columns from the resulting Dataset by a qualified star produces duplicates of the joined columns.

scala> val df1 = Seq((1, 10), (2, 20)).toDF("a", "x")
df1: org.apache.spark.sql.DataFrame = [a: int, x: int]

scala> val df2 = Seq((2, 200), (3, 300)).toDF("a", "y")
df2: org.apache.spark.sql.DataFrame = [a: int, y: int]

scala> val joined = df1.join(df2, "a").alias("joined")
joined: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, x: int ... 1 more field]

scala> joined.select("*").show()
+---+---+---+
|  a|  x|  y|
+---+---+---+
|  2| 20|200|
+---+---+---+

scala> joined.select("joined.*").show()
+---+---+---+---+
|  a|  a|  x|  y|
+---+---+---+---+
|  2|  2| 20|200|
+---+---+---+---+

scala> joined.select("*").select("joined.*").show()
+---+---+---+
|  a|  x|  y|
+---+---+---+
|  2| 20|200|
+---+---+---+

This appears to be introduced by ~~SPARK-34527~~, leading to some surprising behaviour. Using an earlier version, such as Spark 3.0.2, produces the same output for all three show()s.

Attachments

Issue Links

duplicates

SPARK-39376 Do not output duplicated columns in star expansion of subquery alias of NATURAL/USING JOIN

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Yves Li

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19/Mar/22 00:43

Updated:: 30/Aug/22 15:49

Resolved:: 30/Aug/22 15:49