Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11655

SparkLauncherBackendSuite leaks child processes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.6.0
    • 1.6.0
    • Tests
    • None

    Description

      We've been combatting an orphaned process issue on AMPLab Jenkins since October and I finally was able to dig in and figure out what's going on.

      After some sleuthing and working around OS limits and JDK bugs, I was able to get the full launch commands for the hanging orphaned processes. It looks like they're all running spark-submit:

      org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/ -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
      

      Based on the output of some Ganglia graphs, I was able to figure out that these leaks started around October 9.

      This roughly lines up with when https://github.com/apache/spark/pull/7052 was merged, which added LauncherBackendSuite. The launch arguments used in this suite seem to line up with the arguments that I observe in the hanging processes' jps output: https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46

      Interestingly, Jenkins doesn't show test timing or output for this suite! I think that what might be happening is that we have a mixed Scala/Java package, so maybe the two test runner XML files aren't being merged properly: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/

      Whenever I try running this suite locally, it looks like it ends up creating a zombie SparkSubmit process! I think that what's happening is that the launcher's handle.kill() call ends up destroying the bash spark-submit subprocess such that its child process (a JVM) leaks.

      I think that we'll have to do something similar to what we do in PySpark when launching a child JVM from a Python / Bash process: connect it to a socket or stream such that it can detect its parent's death and clean up after itself appropriately.

      /cc shaneknapp and vanzin.

      Attachments

        1. month_of_doom.png
          43 kB
          Shane Knapp
        2. screenshot-1.png
          190 kB
          Josh Rosen
        3. stack.log
          104 kB
          Marcelo Masiero Vanzin
        4. year_or_doom.png
          35 kB
          Shane Knapp

        Activity

          People

            vanzin Marcelo Masiero Vanzin
            joshrosen Josh Rosen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: