Description
Sort merge join on two datasets on the file system that have already been partitioned the same with the same number of partitions and sorted within each partition, and we don't need to sort it again while join with the sorted/partitioned keys
This functionality exists in
- Hive (hive.optimize.bucketmapjoin.sortedmerge)
- Pig (USING 'merge')
- MapReduce (CompositeInputFormat)
Attachments
Issue Links
- contains
-
SPARK-5292 optimize join for table that are already sharded/support for hive bucket
- Closed
- duplicates
-
SPARK-12394 Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)
- Resolved