Description
Dataset.join(right: Dataset[_], joinExprs: Column, joinType: String) has special logic for resolving trivially-true predicates to both sides. It currently handles regular equals but not null-safe equals; the code should be updated to also handle null-safe equals.
Pyspark example:
df = spark.range(10) df.join(df, 'id').collect() # This works. df.join(df, df['id'] == df['id']).collect() # This works. df.join(df, df['id'].eqNullSafe(df['id'])).collect() # This fails!!! # This is a workaround that works. df2 = df.withColumn('id', F.col('id')) df.join(df2, df['id'].eqNullSafe(df2['id'])).collect()
The relevant code in Dataset.join should look like this:
// Otherwise, find the trivially true predicates and automatically resolves them to both sides. // By the time we get here, since we have already run analysis, all attributes should've been // resolved and become AttributeReference. val cond = plan.condition.map { _.transform { case catalyst.expressions.EqualTo(a: AttributeReference, b: AttributeReference) if a.sameRef(b) => catalyst.expressions.EqualTo( withPlan(plan.left).resolve(a.name), withPlan(plan.right).resolve(b.name)) // This case is new!!! case catalyst.expressions.EqualNullSafe(a: AttributeReference, b: AttributeReference) if a.sameRef(b) => catalyst.expressions.EqualNullSafe( withPlan(plan.left).resolve(a.name), withPlan(plan.right).resolve(b.name)) }}