Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23677

Selecting columns from joined DataFrames with the same origin yields wrong results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.2.1, 2.3.0
    • None
    • Spark Core, SQL
    • None

    Description

      When trying to join two DataFrames with the same origin DataFrame and later selecting columns from the join, Spark can't distinguish between the columns and gives a wrong (or at least very surprising) result. One can work around this using expr.

      Here is a minimal example:

       

      import spark.implicits._
      val edf = Seq((1), (2), (3), (4), (5)).toDF("num")
      val big = edf.where(edf("num") > 2).alias("big")
      val small = edf.where(edf("num") < 4).alias("small")
      small.join(big, expr("big.num == (small.num + 1)")).select(small("num"), big("num")).show()
      // +---+---+
      // |num|num|
      // +---+---+
      // | 2| 2|
      // | 3| 3|
      // +—+—+
      small.join(big, expr("big.num == (small.num + 1)")).select(expr("small.num"), expr("big.num")).show()
      // +---+---+
      // |num|num|
      // +---+---+
      // | 2| 3|
      // | 3| 4|
      // +---+---+
      

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              martin.mauch Martin Mauch
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: