[SPARK-4550] In sort-based shuffle, store map outputs in serialized form - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.2.0
Fix Version/s: 1.4.0
Component/s: Shuffle, Spark Core
Labels:
None

Target Version/s:

1.4.0

Description

One drawback with sort-based shuffle compared to hash-based shuffle is that it ends up storing many more java objects in memory. If Spark could store map outputs in serialized form, it could

spill less often because the serialized form is more compact
reduce GC pressure

This will only work when the serialized representations of objects are independent from each other and occupy contiguous segments of memory. E.g. when Kryo reference tracking is left on, objects may contain pointers to objects farther back in the stream, which means that the sort can't relocate objects without corrupting them.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SPARK-4550-design-v1.pdf
02/Feb/15 22:57
145 kB
Sandy Ryza
kryo-flush-benchmark.scala
20/Mar/15 22:20
2 kB
Sandy Ryza

Issue Links

blocks

SPARK-6026 Eliminate the bypassMergeThreshold parameter and associated hash-ish shuffle within the Sort shuffle code

Resolved

is duplicated by

SPARK-2114 groupByKey and joins on raw data

Resolved

links to

[Github] Pull Request #4450 (sryza)

[Github] Pull Request #5916 (sryza)

Related discussion on Kryo group

Activity

People

Assignee:: Sandy Ryza

Reporter:: Sandy Ryza

Votes:: 1 Vote for this issue

Watchers:: 21 Start watching this issue

Dates

Created:: 22/Nov/14 01:06

Updated:: 05/May/15 19:59

Resolved:: 01/May/15 20:55