Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26057

Table joining is broken in Spark 2.4

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 2.4.1, 3.0.0
    • SQL
    • None

    Description

      This sample works in spark-shell 2.3.1 and throws an exception in 2.4.0

      import java.util.Arrays.asList
      import org.apache.spark.sql.Row
      import org.apache.spark.sql.types._
      
      spark.createDataFrame(
        asList(
          Row("1-1", "sp", 6),
          Row("1-1", "pc", 5),
          Row("1-2", "pc", 4),
          Row("2-1", "sp", 3),
          Row("2-2", "pc", 2),
          Row("2-2", "sp", 1)
        ),
        StructType(List(StructField("id", StringType), StructField("layout", StringType), StructField("n", IntegerType)))
      ).createOrReplaceTempView("cc")
      
      spark.createDataFrame(
        asList(
          Row("sp", 1),
          Row("sp", 1),
          Row("sp", 2),
          Row("sp", 3),
          Row("sp", 3),
          Row("sp", 4),
          Row("sp", 5),
          Row("sp", 5),
          Row("pc", 1),
          Row("pc", 2),
          Row("pc", 2),
          Row("pc", 3),
          Row("pc", 4),
          Row("pc", 4),
          Row("pc", 5)
        ),
        StructType(List(StructField("layout", StringType), StructField("ts", IntegerType)))
      ).createOrReplaceTempView("p")
      
      spark.createDataFrame(
       asList(
          Row("1-1", "sp", 1),
          Row("1-1", "sp", 2),
          Row("1-1", "pc", 3),
          Row("1-2", "pc", 3),
          Row("1-2", "pc", 4),
          Row("2-1", "sp", 4),
          Row("2-1", "sp", 5),
          Row("2-2", "pc", 6),
          Row("2-2", "sp", 6)
        ),
        StructType(List(StructField("id", StringType), StructField("layout", StringType), StructField("ts", IntegerType)))
      ).createOrReplaceTempView("c")
      
      spark.sql("""
      SELECT cc.id, cc.layout, count(*) as m
        FROM cc
        JOIN p USING(layout)
        WHERE EXISTS(SELECT 1 FROM c WHERE c.id = cc.id AND c.layout = cc.layout AND c.ts > p.ts)
        GROUP BY cc.id, cc.layout
      """).createOrReplaceTempView("pcc")
      
      spark.sql("SELECT * FROM pcc ORDER BY id, layout").show
      
      spark.sql("""
      SELECT cc.id, cc.layout, n, m
        FROM cc
        LEFT OUTER JOIN pcc ON pcc.id = cc.id AND pcc.layout = cc.layout
      """).createOrReplaceTempView("k")
      
      spark.sql("SELECT * FROM k ORDER BY id, layout").show
      
      

      Actually I tried to catch another bug: similar calculations with joins and nested queries have different results in Spark 2.3.1 and 2.4.0, but when I tried to create a minimal example I received exception

      java.lang.RuntimeException: Couldn't find id#0 in [id#38,layout#39,ts#7,id#10,layout#11,ts#12]
      

      Attachments

        Issue Links

          Activity

            People

              mgaido Marco Gaido
              legba Pavel Parkhomenko
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: