Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27785

Introduce .joinWith() overloads for typed inner joins of 3 or more tables

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.1.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None

      Description

      Today it's rather painful to do a typed dataset join of more than two tables: Dataset[A].joinWith(Dataset[B]) returns Dataset[(A, B)] so chaining on a third inner join requires users to specify a complicated join condition (referencing variables like _1 or _2 in the join condition, AFAIK), resulting a doubly-nested schema like Dataset[((A, B), C)]. Things become even more painful if you want to layer on a fourth join. Using .map() to flatten the data into Dataset[(A, B, C)] has a performance penalty, too.

      To simplify this use case, I propose to introduce a new set of overloads of .joinWith, supporting joins of N > 2 tables for N up to some reasonable number (say, 6). For example:

      Dataset[T].joinWith[T1, T2](
        ds1: Dataset[T1],
        ds2: Dataset[T2],
        condition: Column
      ): Dataset[(T, T1, T2)]
      
      Dataset[T].joinWith[T1, T2](
        ds1: Dataset[T1],
        ds2: Dataset[T2],
        ds3: Dataset[T3],
        condition: Column
      ): Dataset[(T, T1, T2, T3)]

      I propose to do this only for inner joins (consistent with the default join type for joinWith in case joins are not specified).

      I haven't though about this too much yet and am not committed to the API proposed above (it's just my initial idea), so I'm open to suggestions for alternative typed APIs for this.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              joshrosen Josh Rosen
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: