[SPARK-26057] Table joining is broken in Spark 2.4 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 2.4.1, 3.0.0
Component/s: SQL
Labels:
None

Description

This sample works in spark-shell 2.3.1 and throws an exception in 2.4.0

import java.util.Arrays.asList
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

spark.createDataFrame(
  asList(
    Row("1-1", "sp", 6),
    Row("1-1", "pc", 5),
    Row("1-2", "pc", 4),
    Row("2-1", "sp", 3),
    Row("2-2", "pc", 2),
    Row("2-2", "sp", 1)
  ),
  StructType(List(StructField("id", StringType), StructField("layout", StringType), StructField("n", IntegerType)))
).createOrReplaceTempView("cc")

spark.createDataFrame(
  asList(
    Row("sp", 1),
    Row("sp", 1),
    Row("sp", 2),
    Row("sp", 3),
    Row("sp", 3),
    Row("sp", 4),
    Row("sp", 5),
    Row("sp", 5),
    Row("pc", 1),
    Row("pc", 2),
    Row("pc", 2),
    Row("pc", 3),
    Row("pc", 4),
    Row("pc", 4),
    Row("pc", 5)
  ),
  StructType(List(StructField("layout", StringType), StructField("ts", IntegerType)))
).createOrReplaceTempView("p")

spark.createDataFrame(
 asList(
    Row("1-1", "sp", 1),
    Row("1-1", "sp", 2),
    Row("1-1", "pc", 3),
    Row("1-2", "pc", 3),
    Row("1-2", "pc", 4),
    Row("2-1", "sp", 4),
    Row("2-1", "sp", 5),
    Row("2-2", "pc", 6),
    Row("2-2", "sp", 6)
  ),
  StructType(List(StructField("id", StringType), StructField("layout", StringType), StructField("ts", IntegerType)))
).createOrReplaceTempView("c")

spark.sql("""
SELECT cc.id, cc.layout, count(*) as m
  FROM cc
  JOIN p USING(layout)
  WHERE EXISTS(SELECT 1 FROM c WHERE c.id = cc.id AND c.layout = cc.layout AND c.ts > p.ts)
  GROUP BY cc.id, cc.layout
""").createOrReplaceTempView("pcc")

spark.sql("SELECT * FROM pcc ORDER BY id, layout").show

spark.sql("""
SELECT cc.id, cc.layout, n, m
  FROM cc
  LEFT OUTER JOIN pcc ON pcc.id = cc.id AND pcc.layout = cc.layout
""").createOrReplaceTempView("k")

spark.sql("SELECT * FROM k ORDER BY id, layout").show

Actually I tried to catch another bug: similar calculations with joins and nested queries have different results in Spark 2.3.1 and 2.4.0, but when I tried to create a minimal example I received exception

java.lang.RuntimeException: Couldn't find id#0 in [id#38,layout#39,ts#7,id#10,layout#11,ts#12]

Attachments

Issue Links

is related to

SPARK-26041 catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

Resolved

links to

[Github] Pull Request #23035 (mgaido91)

Activity

People

Assignee:: Marco Gaido

Reporter:: Pavel Parkhomenko

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 14/Nov/18 08:02

Updated:: 05/Apr/19 02:02

Resolved:: 15/Nov/18 12:11