[SPARK-26782] Wrong column resolved when joining twice with the same dataframe - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.3.1
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

Execute the following code:

{
 val events = Seq(("a", 0)).toDF("id", "ts")
 val dim = Seq(("a", 0, 24), ("a", 24, 48)).toDF("id", "start", "end")
 
 val dimOriginal = dim.as("dim")
 val dimShifted = dim.as("dimShifted")
val r = events
 .join(dimOriginal, "id")
 .where(dimOriginal("start") <= $"ts" && $"ts" < dimOriginal("end"))
val r2 = r 
 .join(dimShifted, "id")
 .where(dimShifted("start") <= $"ts" + 24 && $"ts" + 24 < dimShifted("end"))
 
 r2.show() 
 r2.explain(true)
}

Expected effect:
- One row is shown
- Logical plan shows two independent joints with "dim" and "dimShifted"
Observed effect:
- No rows are printed.
- Logical plan shows two filters are applied:
  - 'Filter ((start#17 <= ('ts + 24)) && (('ts + 24) < end#18))'
  - Filter ((start#17 <= ts#6) && (ts#6 < end#18))
- Both these filters refer to the same start#17 and start#18 columns, so they are applied to the same dataframe, not two different ones.
- It appears that dimShifted("start") is resolved to be identical to dimOriginal("start")
I get the desired effect if I replace the second where with

.where($"dimShifted.start" <= $"ts" + 24 && $"ts" + 24 < $"dimShifted.end")

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Vladimir Prus

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 30/Jan/19 10:52

Updated:: 30/Jan/19 10:58

Resolved:: 30/Jan/19 10:58