[SPARK-32461] Shuffled hash join improvement - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0, 3.2.0
Fix Version/s: None
Component/s: SQL
Labels:
- release-notes

Description

Shuffled hash join avoids sort compared to sort merge join. This advantage shows up obviously when joining large table in terms of saving CPU and IO (in case of external sort happens). In latest master trunk, shuffled hash join is disabled by default with config "spark.sql.join.preferSortMergeJoin"=true, with favor of reducing risk of OOM. However shuffled hash join could be improved to a better state (validated in our internal fork). Creating this Jira to track overall progress.

Attachments

Sub-Tasks

1.	Coalesce bucketed tables for shuffled hash join if applicable	Resolved	Cheng Su
2.	Preserve shuffled hash join build side partitioning	Resolved	Cheng Su
3.	Preserve hash join (BHJ and SHJ) stream side ordering	Resolved	Cheng Su
4.	Add handling for unique key in non-codegen hash join	Resolved	Cheng Su
5.	Add code-gen for shuffled hash join	Resolved	Cheng Su
6.	Support full outer join in shuffled hash join	Resolved	Cheng Su
7.	Code-gen for full outer shuffled hash join	Resolved	Cheng Su
8.	Fix the config value for shuffled hash join in test in-joins.sql	Resolved	Cheng Su
9.	Record metrics of extra BitSet/HashSet in full outer shuffled hash join	Resolved	Cheng Su
10.	Optimize BHJ/SHJ inner and semi join with empty hashed relation	Resolved	Cheng Su
11.	Exercise code-gen enable/disable code paths for SHJ in join test suites	Resolved	Cheng Su
12.	Ignore duplicated join keys when building relation for SEMI/ANTI shuffle hash join	Resolved	Cheng Su
13.	A dynamic join operator to improve the join reliability	Resolved	Unassigned
14.	Add hash probes metrics for shuffled hash join	In Progress	Unassigned
15.	Introduce sort-based fallback mechanism for shuffled hash join	In Progress	Unassigned
16.	Only codegen build side separately for shuffled hash join	Open	Unassigned
17.	Introduce hybrid join for sort merge join and shuffled hash join in AQE	Open	Unassigned
18.	Support left outer join build left or right outer join build right in shuffled hash join	Resolved	Szehon Ho
19.	Code-gen for build side outer shuffled hash join	Resolved	Szehon Ho

Activity

People

Assignee:: Unassigned

Reporter:: Cheng Su

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 27/Jul/20 20:10

Updated:: 26/Apr/22 23:14