Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6506

python support yarn cluster mode requires SPARK_HOME to be set

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.0
    • Fix Version/s: 1.3.2, 1.4.0
    • Component/s: YARN
    • Labels:
      None

      Description

      We added support for python running in yarn cluster mode in https://issues.apache.org/jira/browse/SPARK-5173, but it requires that SPARK_HOME be set in the environment variables for application master and executor. It doesn't have to be set to anything real but it fails if its not set. See the command at the end of: https://github.com/apache/spark/pull/3976

        Issue Links

          Activity

          Hide
          lianhuiwang Lianhui Wang added a comment - - edited

          hi Thomas Graves I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i get the following exception in every executor:
          Error from python worker:
          /usr/bin/python: No module named pyspark
          PYTHONPATH was:
          /data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python
          java.io.EOFException
          at java.io.DataInputStream.readInt(DataInputStream.java:392)
          at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
          at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
          at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
          at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
          at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
          at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
          at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

          from the exception, i can find that pyspark of the spark.jar in nodeManager cannot be worked. and i donot know why pyspark of spark.jar cannot be worked. Andrew Or can you help me?
          so i think now we should put spark dirs to PYTHONPATH or SPARK_HOME at every node.

          Show
          lianhuiwang Lianhui Wang added a comment - - edited hi Thomas Graves I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i get the following exception in every executor: Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) from the exception, i can find that pyspark of the spark.jar in nodeManager cannot be worked. and i donot know why pyspark of spark.jar cannot be worked. Andrew Or can you help me? so i think now we should put spark dirs to PYTHONPATH or SPARK_HOME at every node.
          Hide
          tgraves Thomas Graves added a comment -

          If you are running on yarn you just have to set SPARK_HOME like this:
          spark.yarn.appMasterEnv.SPARK_HOME /bogus
          spark.executorEnv.SPARK_HOME /bogus

          But the error you pasted above isn't about that. I've seen this when building the assembly with jdk7 or jdk8 due to the python stuff not being packaged properly in the assembly jar. I have to use jdk6 to package it. see https://issues.apache.org/jira/browse/SPARK-1920

          Show
          tgraves Thomas Graves added a comment - If you are running on yarn you just have to set SPARK_HOME like this: spark.yarn.appMasterEnv.SPARK_HOME /bogus spark.executorEnv.SPARK_HOME /bogus But the error you pasted above isn't about that. I've seen this when building the assembly with jdk7 or jdk8 due to the python stuff not being packaged properly in the assembly jar. I have to use jdk6 to package it. see https://issues.apache.org/jira/browse/SPARK-1920
          Hide
          vanzin Marcelo Vanzin added a comment -

          Maybe you're running into SPARK-5808?

          Show
          vanzin Marcelo Vanzin added a comment - Maybe you're running into SPARK-5808 ?
          Hide
          tgraves Thomas Graves added a comment -

          No it was built with maven and the pyspark artifacts are there. You just have to set SPARK_HOME to something even though it isn't used for anything. Something is looking at SPARK_HOME and blows up if its not set.

          Show
          tgraves Thomas Graves added a comment - No it was built with maven and the pyspark artifacts are there. You just have to set SPARK_HOME to something even though it isn't used for anything. Something is looking at SPARK_HOME and blows up if its not set.
          Hide
          kostas Kostas Sakellis added a comment -

          I ran into this issue too by running:

          spark-submit --master yarn-cluster examples/pi.py 4

          it looks like I only had to set: spark.yarn.appMasterEnv.SPARK_HOME=/bogus to get it going:

          spark-submit --conf spark.yarn.appMasterEnv.SPARK_HOME=/bogus --master yarn-cluster pi.py 4

          Show
          kostas Kostas Sakellis added a comment - I ran into this issue too by running: spark-submit --master yarn-cluster examples/pi.py 4 it looks like I only had to set: spark.yarn.appMasterEnv.SPARK_HOME=/bogus to get it going: spark-submit --conf spark.yarn.appMasterEnv.SPARK_HOME=/bogus --master yarn-cluster pi.py 4
          Hide
          kostas Kostas Sakellis added a comment -

          Here is the exception I saw when I ran the above job:

          Traceback (most recent call last):
            File "pi.py", line 29, in <module>
              sc = SparkContext(appName="PythonPi")
            File "/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.23/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/context.py", line 108, in __init__
            File "/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.23/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/context.py", line 222, in _ensure_initialized
            File "/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.23/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/java_gateway.py", line 32, in launch_gateway
            File "/usr/lib64/python2.6/UserDict.py", line 22, in __getitem__
              raise KeyError(key)
          KeyError: 'SPARK_HOME'
          
          Show
          kostas Kostas Sakellis added a comment - Here is the exception I saw when I ran the above job: Traceback (most recent call last): File "pi.py" , line 29, in <module> sc = SparkContext(appName= "PythonPi" ) File "/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.23/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/context.py" , line 108, in __init__ File "/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.23/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/context.py" , line 222, in _ensure_initialized File "/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.23/jars/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar/pyspark/java_gateway.py" , line 32, in launch_gateway File "/usr/lib64/python2.6/UserDict.py" , line 22, in __getitem__ raise KeyError(key) KeyError: 'SPARK_HOME'
          Hide
          vanzin Marcelo Vanzin added a comment - - edited

          Ah, makes sense. In yarn-cluster mode, that code should probably figure out what SPARK_HOME is based on what

          __file__

          returns, instead. Not sure what that would be when the file is inside a jar, though.

          Show
          vanzin Marcelo Vanzin added a comment - - edited Ah, makes sense. In yarn-cluster mode, that code should probably figure out what SPARK_HOME is based on what __file__ returns, instead. Not sure what that would be when the file is inside a jar, though.
          Hide
          vanzin Marcelo Vanzin added a comment -

          Actually, that may not be enough... what if there really isn't a SPARK_HOME on the host launching the driver? What if there's no "spark-submit" to run?

          Show
          vanzin Marcelo Vanzin added a comment - Actually, that may not be enough... what if there really isn't a SPARK_HOME on the host launching the driver? What if there's no "spark-submit" to run?
          Hide
          apachespark Apache Spark added a comment -

          User 'vanzin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/5405

          Show
          apachespark Apache Spark added a comment - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/5405
          Hide
          vanzin Marcelo Vanzin added a comment -

          Ignore my previous comments, the fix is much, much simpler than that... :-/

          Show
          vanzin Marcelo Vanzin added a comment - Ignore my previous comments, the fix is much, much simpler than that... :-/
          Hide
          joshrosen Josh Rosen added a comment -

          Issue resolved by pull request 5405
          https://github.com/apache/spark/pull/5405

          Show
          joshrosen Josh Rosen added a comment - Issue resolved by pull request 5405 https://github.com/apache/spark/pull/5405

            People

            • Assignee:
              vanzin Marcelo Vanzin
              Reporter:
              tgraves Thomas Graves
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development