Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
3.1.0
-
None
-
None
Description
The symmetric difference (a.k.a. disjunctive union) of two sets is their set union minus their set intersection: it returns tuples which are in only one of the sets and omits tuples which are present in both sets (see https://en.wikipedia.org/wiki/Symmetric_difference).
With the Datasets API, we can express this as either
a.union(b).except(a.intersect(b))
or
a.except(b).union(b.except(a))
Spark currently plan this query with two joins. However, it may be more efficient to represent this as a full outer join followed by a filter and a distinct (and, depending on the number of duplicates, we might want to push additional distinct clauses beneath the join, but I think that's a separate optimization). It would cool if the optimizer could automatically perform this rewrite.
This is a very low priority: I'm filing this ticket mostly for tracking / reference purposes (so searches for 'symmetric difference' turn up something useful in Spark's JIRA).