Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38603

Qualified star selection produces duplicated common columns after join then alias

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 3.2.0
    • None
    • SQL
    • None
    • OS: Ubuntu 18.04.5 LTS
      Scala version: 2.12.15

    Description

      When joining two DataFrames and then aliasing the result, selecting columns from the resulting Dataset by a qualified star produces duplicates of the joined columns.

      scala> val df1 = Seq((1, 10), (2, 20)).toDF("a", "x")
      df1: org.apache.spark.sql.DataFrame = [a: int, x: int]
      
      scala> val df2 = Seq((2, 200), (3, 300)).toDF("a", "y")
      df2: org.apache.spark.sql.DataFrame = [a: int, y: int]
      
      scala> val joined = df1.join(df2, "a").alias("joined")
      joined: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, x: int ... 1 more field]
      
      scala> joined.select("*").show()
      +---+---+---+
      |  a|  x|  y|
      +---+---+---+
      |  2| 20|200|
      +---+---+---+
      
      scala> joined.select("joined.*").show()
      +---+---+---+---+
      |  a|  a|  x|  y|
      +---+---+---+---+
      |  2|  2| 20|200|
      +---+---+---+---+
      
      scala> joined.select("*").select("joined.*").show()
      +---+---+---+
      |  a|  x|  y|
      +---+---+---+
      |  2| 20|200|
      +---+---+---+ 

      This appears to be introduced by SPARK-34527, leading to some surprising behaviour. Using an earlier version, such as Spark 3.0.2, produces the same output for all three show()s.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yvesli Yves Li
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: