[SPARK-2876] RDD.partitionBy loads entire partition into memory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.1
Fix Version/s: 1.1.0
Component/s: PySpark
Labels:
None

Description

RDD.partitionBy fails due to an OOM in the PySpark daemon process when given a relatively large dataset. It seems that the use of BatchedSerializer(UNLIMITED_BATCH_SIZE) is suspect, most other RDD methods use self._jrdd_deserializer.

y = x.keyBy(...)
z = y.partitionBy(512) # fails
z = y.repartition(512) # succeeds

Attachments

Issue Links

Dependent

ZOOKEEPER-704 GSoC 2010: Read-Only Mode

Open

relates to

SPARK-2538 External aggregation in Python

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Nathan Howell

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Aug/14 07:28

Updated:: 14/Apr/21 05:57

Resolved:: 17/Feb/15 00:43