Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Building on the pluggability in SPARK-2044, a sort-based shuffle implementation that takes advantage of an Ordering for keys (or just sorts by hashcode for keys that don't have it) would likely improve performance and memory usage in very large shuffles. Our current hash-based shuffle needs an open file for each reduce task, which can fill up a lot of memory for compression buffers and cause inefficient IO. This would avoid both of those issues.
Attachments
Attachments
Issue Links
- depends upon
-
SPARK-2044 Pluggable interface for shuffles
- Resolved
- is depended upon by
-
SPARK-2213 Sort Merge Join
- Resolved
- relates to
-
SPARK-3655 Support sorting of values in addition to keys (i.e. secondary sort)
- Resolved
- links to