[SPARK-1065] PySpark runs out of memory with large broadcast variables - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.7.3, 0.8.1, 0.9.0
Fix Version/s: 1.1.0
Component/s: PySpark
Labels:
None

Description

PySpark's driver components may run out of memory when broadcasting large variables (say 1 gigabyte).

Because PySpark's broadcast is implemented on top of Java Spark's broadcast by broadcasting a pickled Python as a byte array, we may be retaining multiple copies of the large object: a pickled copy in the JVM and a deserialized copy in the Python driver.

The problem could also be due to memory requirements during pickling.

PySpark is also affected by broadcast variables not being garbage collected. Adding an unpersist() method to broadcast variables may fix this: https://github.com/apache/incubator-spark/pull/543.

As a first step to fixing this, we should write a failing test to reproduce the error.

This was discovered by sandy: "trouble with broadcast variables on pyspark".

Attachments

Issue Links

links to

[Github] Pull Request #1912 (davies)

Activity

People

Assignee:: Davies Liu

Reporter:: Josh Rosen

Votes:: 2 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 07/Feb/14 11:41

Updated:: 17/Aug/14 00:00

Resolved:: 17/Aug/14 00:00