Description
In many cases users know ahead of time the columns that they will be joining or aggregating on. Ideally they should be able to leverage this information and pre-shuffle the data so that subsequent queries do not require a shuffle. Hive supports this functionality by allowing the user to define buckets, which are hash partitioning of the data based on some key.
- Allow the user to specify a set of columns when caching or writing out data
- Allow the user to specify some parallelism
- Shuffle the data when writing / caching such that its distributed by these columns
- When planning/executing a query, use this distribution to avoid another shuffle when reading, assuming the join or aggregation is compatible with the columns specified
- Should work with existing save modes: append, overwrite, etc
- Should work at least with all Hadoops FS data sources
- Should work with any data source when caching
Attachments
Attachments
Issue Links
- duplicates
-
SPARK-12538 bucketed table support
- Resolved
- is duplicated by
-
SPARK-11512 Bucket Join
- Closed
-
SPARK-5292 optimize join for table that are already sharded/support for hive bucket
- Closed