[SPARK-35652] Different Behaviour join vs joinWith in self joining - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.1.2
Fix Version/s: 3.2.0, 3.1.3, 3.0.4
Component/s: SQL
Labels:
None
Environment:

Spark 3.1.2

Scala 2.12

Description

It seems like spark inner join is performing a cartesian join in self joining using `joinWith` and an inner join using `join`

Snippet:

scala> val df = spark.range(0,5) 
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.show 
+---+ 
| id| 
+---+ 
| 0|
| 1| 
| 2| 
| 3| 
| 4| 
+---+ 

scala> df.join(df, df("id") === df("id")).count 
21/06/04 16:01:39 WARN Column: Constructing trivially true equals predicate, 'id#1649L = id#1649L'. Perhaps you need to use aliases. 
res21: Long = 5

scala> df.joinWith(df, df("id") === df("id")).count
21/06/04 16:01:47 WARN Column: Constructing trivially true equals predicate, 'id#1649L = id#1649L'. Perhaps you need to use aliases. 
res22: Long = 25

According to the comment in code source, joinWith is expected to manage this case, right?

def joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)] = {
    // Creates a Join node and resolve it first, to get join condition resolved, self-join resolved,
    // etc.

I find it weird that join and joinWith haven't the same behaviour.

Attachments

Issue Links

links to

[Github] Pull Request #32863 (dgd-contributor)

[Github] Pull Request #32899 (dgd-contributor)

Activity

People

Assignee:: dgd_contributor

Reporter:: Wassim Almaaoui

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Jun/21 14:19

Updated:: 19/Jun/21 00:52

Resolved:: 11/Jun/21 12:42