Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3606

Spark container fails to launch if spark-assembly.jar file has different timestamp



    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.6.0
    • None
    • yarn
    • None
    • YARN 2.6.0
      Spark 1.3.1


      In a YARN cluster, when submitting a Spark job, the Spark job will fail to run because YARN fails to launch containers on the other nodes (not the node where the job submission took place).

      YARN checks for similar spark-assembly.jar file by looking at the timestamps. This check will fail when the spark-assembly.jar is the same but copied to the location at different time.

      YARN throws this exception:

      15/05/07 20:13:22 INFO yarn.ExecutorRunnable: Setting up executor with commands: List(JAVA_HOME/bin/java, -server, -XX:OnOutOfMemoryError='kill %p', -Xms1024m, -Xmx1024m, -Djava.io.tmpdir=PWD/tmp, '-Dspark.driver.port=52357', -Dspark.yarn.app.container.log.dir=<LOG_DIR>, org.apache.spark.executor.CoarseGrainedExecutorBackend, --driver-url, akka.tcp://sparkDriver@xxx:52357/user/CoarseGrainedScheduler, --executor-id, 4, --hostname, xxx, --cores, 1, --app-id, application_1431047540996_0001, --user-class-path, file:$PWD/__app__.jar, 1>, <LOG_DIR>/stdout, 2>, <LOG_DIR>/stderr)
      15/05/07 20:13:22 INFO impl.ContainerManagementProtocolProxy: Opening proxy : xxx:34165
      15/05/07 20:13:27 INFO yarn.YarnAllocator: Completed container container_1431047540996_0001_02_000005 (state: COMPLETE, exit status: -1000)
      15/05/07 20:13:27 INFO yarn.YarnAllocator: Container marked as failed: container_1431047540996_0001_02_000005. Exit status: -1000. Diagnostics: Resource file:/home/spark/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar changed on src filesystem (expected 1430944255000, was 1430944249000
      java.io.IOException: Resource file:/home/spark/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar changed on src filesystem (expected 1430944255000, was 1430944249000
      at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
      at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
      at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
      at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:415)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
      at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
      at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
      at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:745)

      Problem can be easily replicated by setting up two nodes and copying the spark-assembly.jar to each node but changing the timestamp of the file on one of the nodes. Then execute spark-shell --master yarn-client. Observe the nodemanager log on the other node to find the error.

      Work around is to make sure the jar file has the same timestamp. But it looks like perhaps the function that does the copy and check of the jar file (org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253) should check for file similarity using a checksum rather than timestamp.


        Issue Links



              Unassigned Unassigned
              mvle Michael Le
              0 Vote for this issue
              7 Start watching this issue

