It could be reproduced by:
It will fail after 2 or 3 jobs, and run totally successfully if I add
`gc.collect()` after each job.
We could call _detach() for the JavaList returned by collect
in Java, will send out a PR for this.
Reported by Michael and commented by Josh：
On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen <firstname.lastname@example.org> wrote:
> Based on Py4J's Memory Model page
>> Because Java objects on the Python side are involved in a circular
>> reference (JavaObject and JavaMember reference each other), these objects
>> are not immediately garbage collected once the last reference to the object
>> is removed (but they are guaranteed to be eventually collected if the Python
>> garbage collector runs before the Python program exits).
>> In doubt, users can always call the detach function on the Python gateway
>> to explicitly delete a reference on the Java side. A call to gc.collect()
>> also usually works.
> Maybe we should be manually calling detach() when the Python-side has
> finished consuming temporary objects from the JVM. Do you have a small
> workload / configuration that reproduces the OOM which we can use to test a
> fix? I don't think that I've seen this issue in the past, but this might be
> because we mistook Java OOMs as being caused by collecting too much data
> rather than due to memory leaks.
> On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario <email@example.com>
>> Hi Josh,
>> I have a question about how PySpark does memory management in the Py4J
>> bridge between the Java driver and the Python driver. I was wondering if
>> there have been any memory problems in this system because the Python
>> garbage collector does not collect circular references immediately and Py4J
>> has circular references in each object it receives from Java.
>> When I dug through the PySpark code, I seemed to find that most RDD
>> actions return by calling collect. In collect, you end up calling the Java
>> RDD collect and getting an iterator from that. Would this be a possible
>> cause for a Java driver OutOfMemoryException because there are resources in
>> Java which do not get freed up immediately?
>> I have also seen that trying to take a lot of values from a dataset twice
>> in a row can cause the Java driver to OOM (while just once works). Are there
>> some other memory considerations that are relevant in the driver?