[SPARK-23677] Selecting columns from joined DataFrames with the same origin yields wrong results - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.2.1, 2.3.0
Fix Version/s: None
Component/s: Spark Core, SQL
Labels:
None

Description

When trying to join two DataFrames with the same origin DataFrame and later selecting columns from the join, Spark can't distinguish between the columns and gives a wrong (or at least very surprising) result. One can work around this using expr.

Here is a minimal example:

import spark.implicits._
val edf = Seq((1), (2), (3), (4), (5)).toDF("num")
val big = edf.where(edf("num") > 2).alias("big")
val small = edf.where(edf("num") < 4).alias("small")
small.join(big, expr("big.num == (small.num + 1)")).select(small("num"), big("num")).show()
// +---+---+
// |num|num|
// +---+---+
// | 2| 2|
// | 3| 3|
// +—+—+
small.join(big, expr("big.num == (small.num + 1)")).select(expr("small.num"), expr("big.num")).show()
// +---+---+
// |num|num|
// +---+---+
// | 2| 3|
// | 3| 4|
// +---+---+

Attachments

Issue Links

duplicates

SPARK-14948 Exception when joining DataFrames derived form the same DataFrame

In Progress

SPARK-10892 Join with Data Frame returns wrong results

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Martin Mauch

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 14/Mar/18 13:04

Updated:: 16/Mar/18 17:26

Resolved:: 16/Mar/18 17:24