Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Not A Problem
-
1.6.1
-
None
-
None
-
None
Description
Problem:
Spark Job fails in yarn-cluster mode if the job takes less time than 10 seconds. The job execution here is successful but, spark framework declares it failed.
16/06/13 10:50:29 INFO yarn.ApplicationMaster: Registered signal handlers for [TERM, HUP, INT] 16/06/13 10:50:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/06/13 10:50:31 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1465791692084_0078_000001 16/06/13 10:50:32 INFO spark.SecurityManager: Changing view acls to: subroto 16/06/13 10:50:32 INFO spark.SecurityManager: Changing modify acls to: subroto 16/06/13 10:50:32 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(subroto); users with modify permissions: Set(subroto) 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context initialization 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context initialization ... 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Initializing plugin registry on cluster... 16/06/13 10:50:33 INFO util.DefaultTimeZone: Loading default time zone of US/Eastern 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property das.big-decimal.precision=32 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property das.default-timezone=US/Eastern 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property das.security.conductor.properties.keysLocation=etc/securePropertiesKeys 16/06/13 10:50:33 INFO util.DefaultTimeZone: Changing default time zone of from US/Eastern to US/Eastern 16/06/13 10:50:34 INFO job.PluginRegistryImpl: --- JVM Information --- 16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM: Java HotSpot(TM) 64-Bit Server VM, 1.7 (Oracle Corporation) 16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM arguments: -Xmx1024m -Djava.io.tmpdir=/mnt/hadoop/yarn/usercache/subroto/appcache/application_1465791692084_0078/container_1465791692084_0078_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop/yarn/application_1465791692084_0078/container_1465791692084_0078_01_000001 -XX:MaxPermSize=256m 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Log4j: 'file:/mnt/hadoop/yarn/usercache/subroto/filecache/103/__spark_conf__6826322497897602970.zip/log4j.properties' (default classpath) 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Max memory : 910.5 MB 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Free memory: 831.8 MB, before Plugin Registry start-up: 847.5 MB 16/06/13 10:50:34 INFO job.PluginRegistryImpl: --------------------- 16/06/13 10:50:34 INFO graphv2.ClusterTaskRuntime: Initializing cluster task configuration... 16/06/13 10:50:34 INFO util.LoggingUtil: Setting root logger level for hadoop task to DEBUG: 16/06/13 10:50:35 INFO cluster.JobProcessor: Processing JobInput{_jobName=Import job (76): BookorderHS2ImportJob_SparkCluster#import(Identity)} 16/06/13 10:50:35 DEBUG security.UserGroupInformation: hadoop login 16/06/13 10:50:35 INFO cluster.JobProcessor: Writing job output to hdfs://ip-10-195-43-46.eu-west-1.compute.internal:8020/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827. 16/06/13 10:50:35 DEBUG hdfs.DFSClient: /user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827: masked=rw-r--r-- 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #2 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #2 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: create took 2ms 16/06/13 10:50:35 DEBUG hdfs.DFSClient: computePacketChunkSize: src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, chunkSize=516, chunksPerPacket=127, packetSize=65532 16/06/13 10:50:35 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [DFSClient_NONMAPREDUCE_1004172348_1] with renew id 1 started 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient writeChunk allocating new packet seqno=0, src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, packetSize=65532, chunksPerPacket=127, bytesCurBlock=0 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 0 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 1 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Allocating new block 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Waiting for ack for: 1 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #3 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #3 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: addBlock took 1ms 16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.126.43.144:1004 16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.195.43.46:1004 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Connecting to datanode 10.126.43.144:1004 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Send buf size 131071 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet packet seqno:0 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock: 330 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 0 status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 510884 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet packet seqno:1 offsetInBlock:330 lastPacketInBlock:true lastByteOffsetInBlock: 330 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 1 status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 1449877 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #4 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #4 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: complete took 1ms 16/06/13 10:50:35 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 16/06/13 10:50:35 DEBUG yarn.ApplicationMaster: Done running users class 16/06/13 10:50:42 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 500000 ms. Please check earlier log output for errors. Failing the application. 16/06/13 10:50:42 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 16/06/13 10:50:42 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1465791692084_0078 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #5 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #5 16/06/13 10:50:42 DEBUG ipc.ProtobufRpcEngine: Call: delete took 2ms 16/06/13 10:50:42 INFO util.ShutdownHookManager: Shutdown hook called 16/06/13 10:50:42 DEBUG ipc.Client: Stopping client 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: closed 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: stopped, remaining connections 0
Root Cause
Race condition in the logic of method org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized() to figure out if the SparkContext is available or not is not sufficient and it doesn't respect the fact job can be finished within 10 seconds.
From the above log snippet (log of App Master) we can see that user class is done executing and marks as finished with App status as SUCCEEDED and exitCode as 0 but, immediately after that there is an error mentioning that SparkContext was't initialised after 500 seconds(the job here actually takes little less than 5 seconds to complete)