Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28166

Query optimization for symmetric difference / disjunctive union of Datasets

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.1.0
    • None
    • SQL
    • None

    Description

      The symmetric difference (a.k.a. disjunctive union) of two sets is their set union minus their set intersection: it returns tuples which are in only one of the sets and omits tuples which are present in both sets (see https://en.wikipedia.org/wiki/Symmetric_difference).

      With the Datasets API, we can express this as either

      a.union(b).except(a.intersect(b))

      or

      a.except(b).union(b.except(a))

      Spark currently plan this query with two joins. However, it may be more efficient to represent this as a full outer join followed by a filter and a distinct (and, depending on the number of duplicates, we might want to push additional distinct clauses beneath the join, but I think that's a separate optimization). It would cool if the optimizer could automatically perform this rewrite.

      This is a very low priority: I'm filing this ticket mostly for tracking / reference purposes (so searches for 'symmetric difference' turn up something useful in Spark's JIRA).

      Attachments

        Activity

          People

            Unassigned Unassigned
            joshrosen Josh Rosen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: