[SPARK-2114] groupByKey and joins on raw data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.0.0
Fix Version/s: None
Component/s: Shuffle, Spark Core
Labels:
None

Description

For groupByKey and join transformations, Spark tasks on the reduce side deserialize every record into a Java object before calling any user function.

This causes all kinds of problems for garbage collection - when aggregating enough data, objects can escape the young gen and trigger full GCs down the line. Additionally, when records are spilled, they must be serialized and deserialized multiple times.

It would be helpful to allow aggregations on serialized data - using some sort of RawHasher interface that could implement hashCode and equals for serialized records. This would also require encoding record boundaries in the serialized format, which I'm not sure we currently do.

Attachments

Issue Links

duplicates

SPARK-4550 In sort-based shuffle, store map outputs in serialized form

Resolved

SPARK-2926 Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

Resolved

is related to

SPARK-2044 Pluggable interface for shuffles

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Sandy Ryza

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 11/Jun/14 18:54

Updated:: 10/Mar/15 18:20

Resolved:: 10/Mar/15 18:20