Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15937

Spark declares a succeeding job to be failed in yarn-cluster mode if the job takes very small time (~ < 10 seconds) to finish

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Not A Problem
    • 1.6.1
    • None
    • None
    • None

    Description

      Problem:

      Spark Job fails in yarn-cluster mode if the job takes less time than 10 seconds. The job execution here is successful but, spark framework declares it failed.

      16/06/13 10:50:29 INFO yarn.ApplicationMaster: Registered signal handlers for [TERM, HUP, INT]
      16/06/13 10:50:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      16/06/13 10:50:31 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1465791692084_0078_000001
      16/06/13 10:50:32 INFO spark.SecurityManager: Changing view acls to: subroto
      16/06/13 10:50:32 INFO spark.SecurityManager: Changing modify acls to: subroto
      16/06/13 10:50:32 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(subroto); users with modify permissions: Set(subroto)
      16/06/13 10:50:32 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread
      16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context initialization
      16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context initialization ... 
      16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Initializing plugin registry on cluster...
      16/06/13 10:50:33 INFO util.DefaultTimeZone: Loading default time zone of US/Eastern
      16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property das.big-decimal.precision=32
      16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property das.default-timezone=US/Eastern
      16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property das.security.conductor.properties.keysLocation=etc/securePropertiesKeys
      16/06/13 10:50:33 INFO util.DefaultTimeZone: Changing default time zone of from US/Eastern to US/Eastern
      16/06/13 10:50:34 INFO job.PluginRegistryImpl: --- JVM Information ---
      16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM: Java HotSpot(TM) 64-Bit Server VM, 1.7 (Oracle Corporation)
      16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM arguments: -Xmx1024m -Djava.io.tmpdir=/mnt/hadoop/yarn/usercache/subroto/appcache/application_1465791692084_0078/container_1465791692084_0078_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop/yarn/application_1465791692084_0078/container_1465791692084_0078_01_000001 -XX:MaxPermSize=256m 
      16/06/13 10:50:34 INFO job.PluginRegistryImpl: Log4j:  'file:/mnt/hadoop/yarn/usercache/subroto/filecache/103/__spark_conf__6826322497897602970.zip/log4j.properties' (default classpath)
      16/06/13 10:50:34 INFO job.PluginRegistryImpl: Max memory : 910.5 MB
      16/06/13 10:50:34 INFO job.PluginRegistryImpl: Free memory: 831.8 MB, before Plugin Registry start-up: 847.5 MB
      16/06/13 10:50:34 INFO job.PluginRegistryImpl: ---------------------
      16/06/13 10:50:34 INFO graphv2.ClusterTaskRuntime: Initializing cluster task configuration...
      16/06/13 10:50:34 INFO util.LoggingUtil: Setting root logger level for hadoop task to DEBUG: 
      16/06/13 10:50:35 INFO cluster.JobProcessor: Processing JobInput{_jobName=Import job (76): BookorderHS2ImportJob_SparkCluster#import(Identity)}
      16/06/13 10:50:35 DEBUG security.UserGroupInformation: hadoop login
      16/06/13 10:50:35 INFO cluster.JobProcessor: Writing job output to hdfs://ip-10-195-43-46.eu-west-1.compute.internal:8020/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827.
      16/06/13 10:50:35 DEBUG hdfs.DFSClient: /user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827: masked=rw-r--r--
      16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #2
      16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #2
      16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: create took 2ms
      16/06/13 10:50:35 DEBUG hdfs.DFSClient: computePacketChunkSize: src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, chunkSize=516, chunksPerPacket=127, packetSize=65532
      16/06/13 10:50:35 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [DFSClient_NONMAPREDUCE_1004172348_1] with renew id 1 started
      16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient writeChunk allocating new packet seqno=0, src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, packetSize=65532, chunksPerPacket=127, bytesCurBlock=0
      16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 0
      16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 1
      16/06/13 10:50:35 DEBUG hdfs.DFSClient: Allocating new block
      16/06/13 10:50:35 DEBUG hdfs.DFSClient: Waiting for ack for: 1
      16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #3
      16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #3
      16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: addBlock took 1ms
      16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.126.43.144:1004
      16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.195.43.46:1004
      16/06/13 10:50:35 DEBUG hdfs.DFSClient: Connecting to datanode 10.126.43.144:1004
      16/06/13 10:50:35 DEBUG hdfs.DFSClient: Send buf size 131071
      16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet packet seqno:0 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock: 330
      16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 0 status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 510884
      16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet packet seqno:1 offsetInBlock:330 lastPacketInBlock:true lastByteOffsetInBlock: 330
      16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 1 status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 1449877
      16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #4
      16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #4
      16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: complete took 1ms
      16/06/13 10:50:35 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
      16/06/13 10:50:35 DEBUG yarn.ApplicationMaster: Done running users class
      16/06/13 10:50:42 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 500000 ms. Please check earlier log output for errors. Failing the application.
      16/06/13 10:50:42 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
      16/06/13 10:50:42 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1465791692084_0078
      16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #5
      16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #5
      16/06/13 10:50:42 DEBUG ipc.ProtobufRpcEngine: Call: delete took 2ms
      16/06/13 10:50:42 INFO util.ShutdownHookManager: Shutdown hook called
      16/06/13 10:50:42 DEBUG ipc.Client: Stopping client
      16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: closed
      16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: stopped, remaining connections 0
      
      Root Cause

      Race condition in the logic of method org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized() to figure out if the SparkContext is available or not is not sufficient and it doesn't respect the fact job can be finished within 10 seconds.
      From the above log snippet (log of App Master) we can see that user class is done executing and marks as finished with App status as SUCCEEDED and exitCode as 0 but, immediately after that there is an error mentioning that SparkContext was't initialised after 500 seconds(the job here actually takes little less than 5 seconds to complete)

      Attachments

        Activity

          People

            Unassigned Unassigned
            subrotosanyal Subroto Sanyal
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: