Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28166

Query optimization for symmetric difference / disjunctive union of Datasets

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.1.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None

      Description

      The symmetric difference (a.k.a. disjunctive union) of two sets is their set union minus their set intersection: it returns tuples which are in only one of the sets and omits tuples which are present in both sets (see https://en.wikipedia.org/wiki/Symmetric_difference).

      With the Datasets API, we can express this as either

      a.union(b).except(a.intersect(b))

      or

      a.except(b).union(b.except(a))

      Spark currently plan this query with two joins. However, it may be more efficient to represent this as a full outer join followed by a filter and a distinct (and, depending on the number of duplicates, we might want to push additional distinct clauses beneath the join, but I think that's a separate optimization). It would cool if the optimizer could automatically perform this rewrite.

      This is a very low priority: I'm filing this ticket mostly for tracking / reference purposes (so searches for 'symmetric difference' turn up something useful in Spark's JIRA).

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              joshrosen Josh Rosen
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: