[SPARK-12394] Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive) - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: SQL
Labels:
None

Description

In many cases users know ahead of time the columns that they will be joining or aggregating on. Ideally they should be able to leverage this information and pre-shuffle the data so that subsequent queries do not require a shuffle. Hive supports this functionality by allowing the user to define buckets, which are hash partitioning of the data based on some key.

Allow the user to specify a set of columns when caching or writing out data
Allow the user to specify some parallelism
Shuffle the data when writing / caching such that its distributed by these columns
When planning/executing a query, use this distribution to avoid another shuffle when reading, assuming the join or aggregation is compatible with the columns specified
Should work with existing save modes: append, overwrite, etc
Should work at least with all Hadoops FS data sources
Should work with any data source when caching