Details
Description
It could be reproduced by:
for i in range(40):
sc.parallelize(range(5000), 10).flatMap(lambda i: range(10000)).collect()
It will fail after 2 or 3 jobs, and run totally successfully if I add
`gc.collect()` after each job.
We could call _detach() for the JavaList returned by collect
in Java, will send out a PR for this.
Reported by Michael and commented by Josh:
On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen <joshrosen@databricks.com> wrote:
> Based on Py4J's Memory Model page
> (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model):
>
>> Because Java objects on the Python side are involved in a circular
>> reference (JavaObject and JavaMember reference each other), these objects
>> are not immediately garbage collected once the last reference to the object
>> is removed (but they are guaranteed to be eventually collected if the Python
>> garbage collector runs before the Python program exits).
>
>
>>
>> In doubt, users can always call the detach function on the Python gateway
>> to explicitly delete a reference on the Java side. A call to gc.collect()
>> also usually works.
>
>
> Maybe we should be manually calling detach() when the Python-side has
> finished consuming temporary objects from the JVM. Do you have a small
> workload / configuration that reproduces the OOM which we can use to test a
> fix? I don't think that I've seen this issue in the past, but this might be
> because we mistook Java OOMs as being caused by collecting too much data
> rather than due to memory leaks.
>
> On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario <mnazario@palantir.com>
> wrote:
>>
>> Hi Josh,
>>
>> I have a question about how PySpark does memory management in the Py4J
>> bridge between the Java driver and the Python driver. I was wondering if
>> there have been any memory problems in this system because the Python
>> garbage collector does not collect circular references immediately and Py4J
>> has circular references in each object it receives from Java.
>>
>> When I dug through the PySpark code, I seemed to find that most RDD
>> actions return by calling collect. In collect, you end up calling the Java
>> RDD collect and getting an iterator from that. Would this be a possible
>> cause for a Java driver OutOfMemoryException because there are resources in
>> Java which do not get freed up immediately?
>>
>> I have also seen that trying to take a lot of values from a dataset twice
>> in a row can cause the Java driver to OOM (while just once works). Are there
>> some other memory considerations that are relevant in the driver?
>>
>> Thanks,
>> Michael
Attachments
Issue Links
- links to