Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26782

Wrong column resolved when joining twice with the same dataframe

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.3.1
    • None
    • Spark Core
    • None

    Description

      1. Execute the following code:

       

      {
       val events = Seq(("a", 0)).toDF("id", "ts")
       val dim = Seq(("a", 0, 24), ("a", 24, 48)).toDF("id", "start", "end")
       
       val dimOriginal = dim.as("dim")
       val dimShifted = dim.as("dimShifted")
      val r = events
       .join(dimOriginal, "id")
       .where(dimOriginal("start") <= $"ts" && $"ts" < dimOriginal("end"))
      val r2 = r 
       .join(dimShifted, "id")
       .where(dimShifted("start") <= $"ts" + 24 && $"ts" + 24 < dimShifted("end"))
       
       r2.show() 
       r2.explain(true)
      }
      

       

      1. Expected effect:
        • One row is shown
        • Logical plan shows two independent joints with "dim" and "dimShifted"
      2. Observed effect:
        • No rows are printed.
        • Logical plan shows two filters are applied:
          • 'Filter ((start#17 <= ('ts + 24)) && (('ts + 24) < end#18))'
          • Filter ((start#17 <= ts#6) && (ts#6 < end#18))
        • Both these filters refer to the same start#17 and start#18 columns, so they are applied to the same dataframe, not two different ones.
        • It appears that dimShifted("start") is resolved to be identical to dimOriginal("start")
      3. I get the desired effect if I replace the second where with 
      .where($"dimShifted.start" <= $"ts" + 24 && $"ts" + 24 < $"dimShifted.end")
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            vladimir.prus Vladimir Prus
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: