Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6194

collect() in PySpark will cause memory leak in JVM

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.0.2, 1.1.1, 1.2.1, 1.3.0
    • 1.2.2, 1.3.1, 1.4.0
    • PySpark
    • None

    Description

      It could be reproduced by:

      for i in range(40):
          sc.parallelize(range(5000), 10).flatMap(lambda i: range(10000)).collect()
      

      It will fail after 2 or 3 jobs, and run totally successfully if I add
      `gc.collect()` after each job.

      We could call _detach() for the JavaList returned by collect
      in Java, will send out a PR for this.

      Reported by Michael and commented by Josh:

      On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen <joshrosen@databricks.com> wrote:
      > Based on Py4J's Memory Model page
      > (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model):
      >
      >> Because Java objects on the Python side are involved in a circular
      >> reference (JavaObject and JavaMember reference each other), these objects
      >> are not immediately garbage collected once the last reference to the object
      >> is removed (but they are guaranteed to be eventually collected if the Python
      >> garbage collector runs before the Python program exits).
      >
      >
      >>
      >> In doubt, users can always call the detach function on the Python gateway
      >> to explicitly delete a reference on the Java side. A call to gc.collect()
      >> also usually works.
      >
      >
      > Maybe we should be manually calling detach() when the Python-side has
      > finished consuming temporary objects from the JVM. Do you have a small
      > workload / configuration that reproduces the OOM which we can use to test a
      > fix? I don't think that I've seen this issue in the past, but this might be
      > because we mistook Java OOMs as being caused by collecting too much data
      > rather than due to memory leaks.
      >
      > On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario <mnazario@palantir.com>
      > wrote:
      >>
      >> Hi Josh,
      >>
      >> I have a question about how PySpark does memory management in the Py4J
      >> bridge between the Java driver and the Python driver. I was wondering if
      >> there have been any memory problems in this system because the Python
      >> garbage collector does not collect circular references immediately and Py4J
      >> has circular references in each object it receives from Java.
      >>
      >> When I dug through the PySpark code, I seemed to find that most RDD
      >> actions return by calling collect. In collect, you end up calling the Java
      >> RDD collect and getting an iterator from that. Would this be a possible
      >> cause for a Java driver OutOfMemoryException because there are resources in
      >> Java which do not get freed up immediately?
      >>
      >> I have also seen that trying to take a lot of values from a dataset twice
      >> in a row can cause the Java driver to OOM (while just once works). Are there
      >> some other memory considerations that are relevant in the driver?
      >>
      >> Thanks,
      >> Michael

      Attachments

        Activity

          People

            davies Davies Liu
            davies Davies Liu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: