Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
3.1.0
-
None
-
None
Description
The Dataset.join(Dataset, Seq[String]) method provides extra convenience over Dataset.join(Dataset, joinExprs: Column) as it does not duplicate the join columns Seq[String] in the result DataFrame. Those columns are compared with ===. When those join columns need to be compared null-safe with <=>, the join condition becomes very verbose and requires extra drop operations:
df1.join(df2, df1("a") <=> df2("a") && df1("b") <=> df2("b")).drop(df2("a")).drop(df2("b")).show()
Elegant would be the following null-safe join operation:
df1.joinNullSafe(df2, joinColumns)
Possible namings:
- Dataset.joinNullSafe(Dataset[_], Seq[String])
- Dataset.joinWithNulls(Dataset[_], Seq[String])
- Dataset.join(Dataset[_], Seq[String], <=>)
I am happy to provide a PR if this Dataset API extension is appreciated.
This request has been sent to the Apache Spark user and dev mailing list by Marcelo Valle.