[SPARK-23855] Performing a Join after a CrossJoin can lead to data corruption - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.2.0, 2.2.1
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

The following tests produces the wrong result for the join operation. The error only occurs when joining on the first column of the crossed dataframe. However, a subsequent select fixes the data (which is of course not a solution).

It works on 2.3.0 though. It would be nice to get this fixed on the 2.2.x releases, too. Maybe someone can point me to the issue that has been fixed? Would be nice to see the solution in code.

it("should correctly perform a join after a cross") {
    val df1 = sparkSession.createDataFrame(Seq(Tuple1(0L)))
      .toDF("a")

    val df2 = sparkSession.createDataFrame(Seq(Tuple1(1L)))
      .toDF("b")

    val df3 = sparkSession.createDataFrame(Seq(Tuple1(0L)))
      .toDF("c")

    val cross = df1.crossJoin(df2)
    cross.show()

    val joined = cross
      .join(df3, cross.col("a") === df3.col("c"))

    joined.show()

    val selected = joined.select("*")
    selected.show
  }

prints:

+---+---+
|  a|  b|
+---+---+
|  0|  1|
+---+---+

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  0|  0|  1|
+---+---+---+

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  0|  1|  0|
+---+---+---+

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Martin Junghanns

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 03/Apr/18 09:14

Updated:: 21/May/19 04:14

Resolved:: 21/May/19 04:14