Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30957

Null-safe variant of Dataset.join(Dataset[_], Seq[String])

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 3.1.0
    • None
    • SQL
    • None

    Description

      The Dataset.join(Dataset, Seq[String]) method provides extra convenience over Dataset.join(Dataset, joinExprs: Column) as it does not duplicate the join columns Seq[String] in the result DataFrame. Those columns are compared with ===. When those join columns need to be compared null-safe with <=>, the join condition becomes very verbose and requires extra drop operations:

      df1.join(df2, df1("a") <=> df2("a") && df1("b") <=> df2("b")).drop(df2("a")).drop(df2("b")).show()
      

      Elegant would be the following null-safe join operation:

      df1.joinNullSafe(df2, joinColumns)
      

      Possible namings:

      • Dataset.joinNullSafe(Dataset[_], Seq[String])
      • Dataset.joinWithNulls(Dataset[_], Seq[String])
      • Dataset.join(Dataset[_], Seq[String], <=>)

      I am happy to provide a PR if this Dataset API extension is appreciated.

      This request has been sent to the Apache Spark user and dev mailing list by Marcelo Valle.

      Attachments

        Activity

          People

            Unassigned Unassigned
            enricomi Enrico Minack
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: