Details
Description
It seems like spark inner join is performing a cartesian join in self joining using `joinWith` and an inner join using `join`
Snippet:
scala> val df = spark.range(0,5) df: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> df.show +---+ | id| +---+ | 0| | 1| | 2| | 3| | 4| +---+ scala> df.join(df, df("id") === df("id")).count 21/06/04 16:01:39 WARN Column: Constructing trivially true equals predicate, 'id#1649L = id#1649L'. Perhaps you need to use aliases. res21: Long = 5 scala> df.joinWith(df, df("id") === df("id")).count 21/06/04 16:01:47 WARN Column: Constructing trivially true equals predicate, 'id#1649L = id#1649L'. Perhaps you need to use aliases. res22: Long = 25
According to the comment in code source, joinWith is expected to manage this case, right?
def joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)] = { // Creates a Join node and resolve it first, to get join condition resolved, self-join resolved, // etc.
I find it weird that join and joinWith haven't the same behaviour.