Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10795

FileNotFoundException while deploying pyspark job on cluster

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:
      None
    • Environment:

      EMR

      Description

      I am trying to run simple spark job using pyspark, it works as standalone , but while I deploy over cluster it fails.

      Events :

      2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-xxxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip

      Above uploading resource file is successfull , I manually checked file is present in above specified path , but after a while I face following error :

      Diagnostics: File does not exist: hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
      java.io.FileNotFoundException: File does not exist: hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip

        Activity

        Hide
        CBribiescas Carlos Bribiescas added a comment - - edited

        Using this command to submit job

        spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py

        If MyPythonFile.py looks like this

        from pyspark import SparkContext
        
        jobName="My Name"
        sc = SparkContext(appName=jobName)
        
        

        Then everything is fine. If MyPythonFile.py does not specify a spark context (As one would in the interactive shell) then it gives me the error you say in your bug. Using the following file instead I'm able to reproduce the bug.

        from pyspark import SparkContext
        
        jobName="My Name"
        # sc = SparkContext(appName=jobName)
        
        

        So I suspect you just didn't define a spark context properly for a cluster. Hope this helps.

        Cluster Configuration
        Release label:emr-4.2.0
        Hadoop distribution:Amazon 2.6.0
        Applications:SPARK 1.5.2, HIVE 1.0.0, HUE 3.7.1
        Show
        CBribiescas Carlos Bribiescas added a comment - - edited Using this command to submit job spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py If MyPythonFile.py looks like this from pyspark import SparkContext jobName= "My Name" sc = SparkContext(appName=jobName) Then everything is fine. If MyPythonFile.py does not specify a spark context (As one would in the interactive shell) then it gives me the error you say in your bug. Using the following file instead I'm able to reproduce the bug. from pyspark import SparkContext jobName= "My Name" # sc = SparkContext(appName=jobName) So I suspect you just didn't define a spark context properly for a cluster. Hope this helps. Cluster Configuration Release label:emr-4.2.0 Hadoop distribution:Amazon 2.6.0 Applications:SPARK 1.5.2, HIVE 1.0.0, HUE 3.7.1
        Hide
        djouany Daniel Jouany added a comment - - edited

        Hi - I am facing the exact same problem.

        However

        • I do initialize my SparkContext correctly, as the first statement in my main method.
        • I have spark-submitted the job with your exact command line : spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1

        Can it be a configuration problem? The user that launches the spark-submit does have sufficient rights in the given HDFS directory (/user/$USERNAME/.sparkStaging/...)

        Thanks in advance!

        [EDIT!]
        My .py file works just fine if launched simply through pyspark myfile.py.
        But as soon as I add --master yarn-cluster to the command line it fails...

        Show
        djouany Daniel Jouany added a comment - - edited Hi - I am facing the exact same problem. However I do initialize my SparkContext correctly, as the first statement in my main method. I have spark-submitted the job with your exact command line : spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 Can it be a configuration problem? The user that launches the spark-submit does have sufficient rights in the given HDFS directory ( /user/$USERNAME/.sparkStaging/ ...) Thanks in advance! [EDIT!] My .py file works just fine if launched simply through pyspark myfile.py . But as soon as I add --master yarn-cluster to the command line it fails...
        Hide
        zjffdu Jeff Zhang added a comment -

        What's your spark version ? And is it possible for you attach your code if it is simple and not sensitive ?

        Show
        zjffdu Jeff Zhang added a comment - What's your spark version ? And is it possible for you attach your code if it is simple and not sensitive ?
        Hide
        djouany Daniel Jouany added a comment -

        I am using spark 1.4.1 on HDP 2.3.2.

        My code is a bit complex, i'll try to reduce it to the minimum failing code and then post it!

        Show
        djouany Daniel Jouany added a comment - I am using spark 1.4.1 on HDP 2.3.2. My code is a bit complex, i'll try to reduce it to the minimum failing code and then post it!
        Hide
        CBribiescas Carlos Bribiescas added a comment -

        Have you tried just specifying the SparkContext and nothing else? For example, if you tried to specify a Master via the spark context but also did so on the command line I don't know the expected output. I suggest doing that before trying to cut up your code too much.

        I do realize that there may be many other causes of this issue, so I dont mean to suggest that not initializing your SparkContext is the only way. Just trying to rule this one cause out.

        Show
        CBribiescas Carlos Bribiescas added a comment - Have you tried just specifying the SparkContext and nothing else? For example, if you tried to specify a Master via the spark context but also did so on the command line I don't know the expected output. I suggest doing that before trying to cut up your code too much. I do realize that there may be many other causes of this issue, so I dont mean to suggest that not initializing your SparkContext is the only way. Just trying to rule this one cause out.
        Hide
        djouany Daniel Jouany added a comment - - edited

        Hi there,
        If i follow your suggestions, it works.

        Our code was like that :

        Import numpy as np
        Import SparkContext
        foo = np.genfromtext(xxxxx)
        sc=SparkContext(...)
        #compute
        

        ===> It fails

        We have just moved the global variable initialization after the context init:

        Import numpy as np
        Import SparkContext
        global foo
        sc=SparkContext(...)
        foo = np.genfromtext(xxxxx)
        #compute
        

        ===> It works perfectly

        Note that you could reproduce this behaviour with something else than a numpy call - eventhough not every statement does entail the crash.
        The question is : why is this non-spark variable init interfering with the SparkContext ????

        Show
        djouany Daniel Jouany added a comment - - edited Hi there, If i follow your suggestions, it works. Our code was like that : Import numpy as np Import SparkContext foo = np.genfromtext(xxxxx) sc=SparkContext(...) #compute ===> It fails We have just moved the global variable initialization after the context init: Import numpy as np Import SparkContext global foo sc=SparkContext(...) foo = np.genfromtext(xxxxx) #compute ===> It works perfectly Note that you could reproduce this behaviour with something else than a numpy call - eventhough not every statement does entail the crash. The question is : why is this non-spark variable init interfering with the SparkContext ????
        Hide
        HackerWilson HackerWilson added a comment -

        Hi All, I am facing the same problem, too. I have tested `spark-2.0.0-bin-hadoop2.6` & `spark-1.6.2-bin-hadoop2.6` with `hadoop2.6`, these two exceptions are different.

        for `spark-2.0.0-bin-hadoop2.6`:
        Diagnostics: File file:/tmp/spark-d7b81767-66bb-431f-9817-c623787fe2ac/__spark_libs__1242224443873929949.zip does not exist

        for `spark-2.0.0-bin-hadoop2.6`:
        Diagnostics: File file:/home/platform/services/spark/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip does not exist

        YARN client didn't copy them beacuse `yarn.Client: Source and destination file systems are the same. Not copying file...`,
        but these two files can be found in `nm-local-dir/usercache/$user/filecache`.

        The task I run on YARN cluster is simply `examples/src/main/python/pi.py`, sometimes this task can be successfully complete but sometimes are not. Actually, `spark-2.0.0-bin-hadoop2.7` with `hadoop2.7` did have the same problem.

        Show
        HackerWilson HackerWilson added a comment - Hi All, I am facing the same problem, too. I have tested `spark-2.0.0-bin-hadoop2.6` & `spark-1.6.2-bin-hadoop2.6` with `hadoop2.6`, these two exceptions are different. for `spark-2.0.0-bin-hadoop2.6`: Diagnostics: File file:/tmp/spark-d7b81767-66bb-431f-9817-c623787fe2ac/__spark_libs__1242224443873929949.zip does not exist for `spark-2.0.0-bin-hadoop2.6`: Diagnostics: File file:/home/platform/services/spark/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip does not exist YARN client didn't copy them beacuse `yarn.Client: Source and destination file systems are the same. Not copying file...`, but these two files can be found in `nm-local-dir/usercache/$user/filecache`. The task I run on YARN cluster is simply `examples/src/main/python/pi.py`, sometimes this task can be successfully complete but sometimes are not. Actually, `spark-2.0.0-bin-hadoop2.7` with `hadoop2.7` did have the same problem.
        Hide
        nico.pappagianis Nico Pappagianis added a comment -

        HackerWilson Were you able to resolve this? I'm hitting the same thing running Spark 2.0.1 and Hadoop 2.7.2.

        My Python code is just creating a SparkContext and then calling sc.stop().

        In the YARN logs I see:

        INFO: 2017-06-08 22:16:24,462 INFO [main] yarn.Client - Uploading resource file:/home/.../python/lib/py4j-0.10.1-src.zip -> hdfs://.../.sparkStaging/application_1494012577752_1403/py4j-0.10.1-src.zip

        when I do an fs -ls on the above HDFS directory it shows the py4j file, but the job fails with a FileNotFoundException for the py4j file above:

        File does not exist: hdfs://.../.sparkStaging/application_1494012577752_1403/py4j-0.10.1-src.zip
        (stack trace here: https://gist.github.com/anonymous/5506654b88e19e6f51ffbd85cd3f25ee)

        One thing to note is that I am launching a Map-only job that launches a the Spark application on the cluster. The launcher job is using SparkLauncher (Java). Master and deploy mode are set to "yarn" and "cluster", respectively.

        When I submit the Python job from via a spark-submit it runs successfully (I set the HADOOP_CONF_DIR and HADOOP_JAVA_HOME to the same as what I am setting using the launcher job).

        Show
        nico.pappagianis Nico Pappagianis added a comment - HackerWilson Were you able to resolve this? I'm hitting the same thing running Spark 2.0.1 and Hadoop 2.7.2. My Python code is just creating a SparkContext and then calling sc.stop(). In the YARN logs I see: INFO: 2017-06-08 22:16:24,462 INFO [main] yarn.Client - Uploading resource file:/home/.../python/lib/py4j-0.10.1-src.zip -> hdfs://.../.sparkStaging/application_1494012577752_1403/py4j-0.10.1-src.zip when I do an fs -ls on the above HDFS directory it shows the py4j file, but the job fails with a FileNotFoundException for the py4j file above: File does not exist: hdfs://.../.sparkStaging/application_1494012577752_1403/py4j-0.10.1-src.zip (stack trace here: https://gist.github.com/anonymous/5506654b88e19e6f51ffbd85cd3f25ee ) One thing to note is that I am launching a Map-only job that launches a the Spark application on the cluster. The launcher job is using SparkLauncher (Java). Master and deploy mode are set to "yarn" and "cluster", respectively. When I submit the Python job from via a spark-submit it runs successfully (I set the HADOOP_CONF_DIR and HADOOP_JAVA_HOME to the same as what I am setting using the launcher job).

          People

          • Assignee:
            Unassigned
            Reporter:
            harshit.sharma Harshit
          • Votes:
            7 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

            • Created:
              Updated:

              Development