Description
RDD.partitionBy fails due to an OOM in the PySpark daemon process when given a relatively large dataset. It seems that the use of BatchedSerializer(UNLIMITED_BATCH_SIZE) is suspect, most other RDD methods use self._jrdd_deserializer.
y = x.keyBy(...) z = y.partitionBy(512) # fails z = y.repartition(512) # succeeds
Attachments
Issue Links
- Dependent
-
ZOOKEEPER-704 GSoC 2010: Read-Only Mode
- Open
- relates to
-
SPARK-2538 External aggregation in Python
- Resolved