Details
-
Improvement
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
1.2.0
-
None
Description
One drawback with sort-based shuffle compared to hash-based shuffle is that it ends up storing many more java objects in memory. If Spark could store map outputs in serialized form, it could
- spill less often because the serialized form is more compact
- reduce GC pressure
This will only work when the serialized representations of objects are independent from each other and occupy contiguous segments of memory. E.g. when Kryo reference tracking is left on, objects may contain pointers to objects farther back in the stream, which means that the sort can't relocate objects without corrupting them.
Attachments
Attachments
Issue Links
- blocks
-
SPARK-6026 Eliminate the bypassMergeThreshold parameter and associated hash-ish shuffle within the Sort shuffle code
- Resolved
- is duplicated by
-
SPARK-2114 groupByKey and joins on raw data
- Resolved
- links to