Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10795

FileNotFoundException while deploying pyspark job on cluster

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:
      None
    • Environment:

      EMR

      Description

      I am trying to run simple spark job using pyspark, it works as standalone , but while I deploy over cluster it fails.

      Events :

      2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-xxxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip

      Above uploading resource file is successfull , I manually checked file is present in above specified path , but after a while I face following error :

      Diagnostics: File does not exist: hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
      java.io.FileNotFoundException: File does not exist: hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip

        Activity

        Hide
        CBribiescas Carlos Bribiescas added a comment - - edited

        Using this command to submit job

        spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py

        If MyPythonFile.py looks like this

        from pyspark import SparkContext
        
        jobName="My Name"
        sc = SparkContext(appName=jobName)
        
        

        Then everything is fine. If MyPythonFile.py does not specify a spark context (As one would in the interactive shell) then it gives me the error you say in your bug. Using the following file instead I'm able to reproduce the bug.

        from pyspark import SparkContext
        
        jobName="My Name"
        # sc = SparkContext(appName=jobName)
        
        

        So I suspect you just didn't define a spark context properly for a cluster. Hope this helps.

        Cluster Configuration
        Release label:emr-4.2.0
        Hadoop distribution:Amazon 2.6.0
        Applications:SPARK 1.5.2, HIVE 1.0.0, HUE 3.7.1
        Show
        CBribiescas Carlos Bribiescas added a comment - - edited Using this command to submit job spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py If MyPythonFile.py looks like this from pyspark import SparkContext jobName= "My Name" sc = SparkContext(appName=jobName) Then everything is fine. If MyPythonFile.py does not specify a spark context (As one would in the interactive shell) then it gives me the error you say in your bug. Using the following file instead I'm able to reproduce the bug. from pyspark import SparkContext jobName= "My Name" # sc = SparkContext(appName=jobName) So I suspect you just didn't define a spark context properly for a cluster. Hope this helps. Cluster Configuration Release label:emr-4.2.0 Hadoop distribution:Amazon 2.6.0 Applications:SPARK 1.5.2, HIVE 1.0.0, HUE 3.7.1
        Hide
        djouany Daniel Jouany added a comment - - edited

        Hi - I am facing the exact same problem.

        However

        • I do initialize my SparkContext correctly, as the first statement in my main method.
        • I have spark-submitted the job with your exact command line : spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1

        Can it be a configuration problem? The user that launches the spark-submit does have sufficient rights in the given HDFS directory (/user/$USERNAME/.sparkStaging/...)

        Thanks in advance!

        [EDIT!]
        My .py file works just fine if launched simply through pyspark myfile.py.
        But as soon as I add --master yarn-cluster to the command line it fails...

        Show
        djouany Daniel Jouany added a comment - - edited Hi - I am facing the exact same problem. However I do initialize my SparkContext correctly, as the first statement in my main method. I have spark-submitted the job with your exact command line : spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 Can it be a configuration problem? The user that launches the spark-submit does have sufficient rights in the given HDFS directory ( /user/$USERNAME/.sparkStaging/ ...) Thanks in advance! [EDIT!] My .py file works just fine if launched simply through pyspark myfile.py . But as soon as I add --master yarn-cluster to the command line it fails...
        Hide
        zjffdu Jeff Zhang added a comment -

        What's your spark version ? And is it possible for you attach your code if it is simple and not sensitive ?

        Show
        zjffdu Jeff Zhang added a comment - What's your spark version ? And is it possible for you attach your code if it is simple and not sensitive ?
        Hide
        djouany Daniel Jouany added a comment -

        I am using spark 1.4.1 on HDP 2.3.2.

        My code is a bit complex, i'll try to reduce it to the minimum failing code and then post it!

        Show
        djouany Daniel Jouany added a comment - I am using spark 1.4.1 on HDP 2.3.2. My code is a bit complex, i'll try to reduce it to the minimum failing code and then post it!
        Hide
        CBribiescas Carlos Bribiescas added a comment -

        Have you tried just specifying the SparkContext and nothing else? For example, if you tried to specify a Master via the spark context but also did so on the command line I don't know the expected output. I suggest doing that before trying to cut up your code too much.

        I do realize that there may be many other causes of this issue, so I dont mean to suggest that not initializing your SparkContext is the only way. Just trying to rule this one cause out.

        Show
        CBribiescas Carlos Bribiescas added a comment - Have you tried just specifying the SparkContext and nothing else? For example, if you tried to specify a Master via the spark context but also did so on the command line I don't know the expected output. I suggest doing that before trying to cut up your code too much. I do realize that there may be many other causes of this issue, so I dont mean to suggest that not initializing your SparkContext is the only way. Just trying to rule this one cause out.
        Hide
        djouany Daniel Jouany added a comment - - edited

        Hi there,
        If i follow your suggestions, it works.

        Our code was like that :

        Import numpy as np
        Import SparkContext
        foo = np.genfromtext(xxxxx)
        sc=SparkContext(...)
        #compute
        

        ===> It fails

        We have just moved the global variable initialization after the context init:

        Import numpy as np
        Import SparkContext
        global foo
        sc=SparkContext(...)
        foo = np.genfromtext(xxxxx)
        #compute
        

        ===> It works perfectly

        Note that you could reproduce this behaviour with something else than a numpy call - eventhough not every statement does entail the crash.
        The question is : why is this non-spark variable init interfering with the SparkContext ????

        Show
        djouany Daniel Jouany added a comment - - edited Hi there, If i follow your suggestions, it works. Our code was like that : Import numpy as np Import SparkContext foo = np.genfromtext(xxxxx) sc=SparkContext(...) #compute ===> It fails We have just moved the global variable initialization after the context init: Import numpy as np Import SparkContext global foo sc=SparkContext(...) foo = np.genfromtext(xxxxx) #compute ===> It works perfectly Note that you could reproduce this behaviour with something else than a numpy call - eventhough not every statement does entail the crash. The question is : why is this non-spark variable init interfering with the SparkContext ????
        Hide
        HackerWilson HackerWilson added a comment -

        Hi All, I am facing the same problem, too. I have tested `spark-2.0.0-bin-hadoop2.6` & `spark-1.6.2-bin-hadoop2.6` with `hadoop2.6`, these two exceptions are different.

        for `spark-2.0.0-bin-hadoop2.6`:
        Diagnostics: File file:/tmp/spark-d7b81767-66bb-431f-9817-c623787fe2ac/__spark_libs__1242224443873929949.zip does not exist

        for `spark-2.0.0-bin-hadoop2.6`:
        Diagnostics: File file:/home/platform/services/spark/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip does not exist

        YARN client didn't copy them beacuse `yarn.Client: Source and destination file systems are the same. Not copying file...`,
        but these two files can be found in `nm-local-dir/usercache/$user/filecache`.

        The task I run on YARN cluster is simply `examples/src/main/python/pi.py`, sometimes this task can be successfully complete but sometimes are not. Actually, `spark-2.0.0-bin-hadoop2.7` with `hadoop2.7` did have the same problem.

        Show
        HackerWilson HackerWilson added a comment - Hi All, I am facing the same problem, too. I have tested `spark-2.0.0-bin-hadoop2.6` & `spark-1.6.2-bin-hadoop2.6` with `hadoop2.6`, these two exceptions are different. for `spark-2.0.0-bin-hadoop2.6`: Diagnostics: File file:/tmp/spark-d7b81767-66bb-431f-9817-c623787fe2ac/__spark_libs__1242224443873929949.zip does not exist for `spark-2.0.0-bin-hadoop2.6`: Diagnostics: File file:/home/platform/services/spark/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip does not exist YARN client didn't copy them beacuse `yarn.Client: Source and destination file systems are the same. Not copying file...`, but these two files can be found in `nm-local-dir/usercache/$user/filecache`. The task I run on YARN cluster is simply `examples/src/main/python/pi.py`, sometimes this task can be successfully complete but sometimes are not. Actually, `spark-2.0.0-bin-hadoop2.7` with `hadoop2.7` did have the same problem.

          People

          • Assignee:
            Unassigned
            Reporter:
            harshit.sharma Harshit
          • Votes:
            7 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:

              Development