[SPARK-30957] Null-safe variant of Dataset.join(Dataset[_], Seq[String]) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

The Dataset.join(Dataset, Seq[String]) method provides extra convenience over Dataset.join(Dataset, joinExprs: Column) as it does not duplicate the join columns Seq[String] in the result DataFrame. Those columns are compared with ===. When those join columns need to be compared null-safe with <=>, the join condition becomes very verbose and requires extra drop operations:

df1.join(df2, df1("a") <=> df2("a") && df1("b") <=> df2("b")).drop(df2("a")).drop(df2("b")).show()

Elegant would be the following null-safe join operation:

df1.joinNullSafe(df2, joinColumns)

Possible namings:

Dataset.joinNullSafe(Dataset[_], Seq[String])
Dataset.joinWithNulls(Dataset[_], Seq[String])
Dataset.join(Dataset[_], Seq[String], <=>)

I am happy to provide a PR if this Dataset API extension is appreciated.

This request has been sent to the Apache Spark user and dev mailing list by Marcelo Valle.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Enrico Minack

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Feb/20 09:02

Updated:: 12/Dec/22 18:10

Resolved:: 03/Mar/20 04:24