[SPARK-27829] In Dataset.joinWith inner joins, don't nest data before shuffling - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 3.0.0
Component/s: SQL
Labels:
None

Description

In order to support outer joins with null top-level objects, ~~SPARK-15441~~ modified Dataset.joinWith to project both inputs into single-column structs prior to the join.

For inner joins, however, this step is unnecessary and actually harms performance: performing the nesting before the join increases the shuffled data size. As an optimization for inner joins only, we can move this nesting to occur after the join (effectively switching back to the pre- ~~SPARK-15441~~ behavior).

Attachments

Issue Links

links to

GitHub Pull Request #24693

Activity

People

Assignee:: Josh Rosen

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 24/May/19 03:31

Updated:: 29/May/19 08:15

Resolved:: 29/May/19 08:14