Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30218

Columns used in inequality conditions for joins not resolved correctly in case of common lineage

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.3.4, 2.4.4
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:

      Description

      When columns from different data-frames that have a common lineage are used in inequality conditions in joins, they are not resolved correctly. In particular, both the column from the left DF and the one from the right DF are resolved to the same column, thus making the inequality condition either always satisfied or always not-satisfied.

      Minimal example to reproduce follows.

      import pyspark.sql.functions as F
      
      data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], ["id", "kind", "timestamp"])
      
      df_left = data.where(F.col("kind") == "A").alias("left")
      df_right = data.where(F.col("kind") == "B").alias("right")
      
      conds = [df_left["id"] == df_right["id"]]
      conds.append(df_right["timestamp"].between(df_left["timestamp"], df_left["timestamp"] + 2))
      
      res = df_left.join(df_right, conds, how="left")
      

      The result is:

      id kind timestamp id kind timestamp
      id1 A 0 id1 B 1
      id1 A 0 id1 B 5
      id1 A 1 id1 B 1
      id1 A 1 id1 B 5
      id2 A 2 id2 B 10
      id2 A 3 id2 B 10

      which violates the condition that the timestamp from the right DF should be between df_left["timestamp"] and df_left["timestamp"] + 2.

      The plan shows the problem in the column resolution.

      == Parsed Logical Plan ==
      Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && (timestamp#2L <= (timestamp#2L + cast(2 as bigint)))))
      :- SubqueryAlias `left`
      :  +- Filter (kind#1 = A)
      :     +- LogicalRDD [id#0, kind#1, timestamp#2L], false
      +- SubqueryAlias `right`
         +- Filter (kind#37 = B)
            +- LogicalRDD [id#36, kind#37, timestamp#38L], false
      

      Note, the columns used in the equality condition of the join have been correctly resolved.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                FC Francesco Cavrini
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: