[SPARK-12394] Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: SQL
Labels:
None

Description

In many cases users know ahead of time the columns that they will be joining or aggregating on. Ideally they should be able to leverage this information and pre-shuffle the data so that subsequent queries do not require a shuffle. Hive supports this functionality by allowing the user to define buckets, which are hash partitioning of the data based on some key.

Allow the user to specify a set of columns when caching or writing out data
Allow the user to specify some parallelism
Shuffle the data when writing / caching such that its distributed by these columns
When planning/executing a query, use this distribution to avoid another shuffle when reading, assuming the join or aggregation is compatible with the columns specified
Should work with existing save modes: append, overwrite, etc
Should work at least with all Hadoops FS data sources
Should work with any data source when caching

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

BucketedTables.pdf
21/Dec/15 19:30
134 kB
Nong Li

Issue Links

duplicates

SPARK-12538 bucketed table support

Resolved

is duplicated by

SPARK-11512 Bucket Join

Closed

SPARK-5292 optimize join for table that are already sharded/support for hive bucket

Closed

Activity

People

Assignee:: Nong Li

Reporter:: Reynold Xin

Votes:: 1 Vote for this issue

Watchers:: 21 Start watching this issue

Dates

Created:: 17/Dec/15 06:34

Updated:: 30/Jan/18 18:47

Resolved:: 16/Jan/16 01:21