Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33085

"Master removed our application" error leads to FAILED driver status instead of KILLED driver status

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.4.6
    • None
    • Scheduler, Spark Core
    • None

    Description

       

      driver-20200930160855-0316 exited with status FAILED

       

      I am using Spark Standalone scheduler with spot ec2 workers. I confirmed that myip.87 EC2 instance was terminated at 2020-09-30 16:16

       

      I would expect the overall driver status to be KILLED but instead it was FAILED, my goal is to interpret FAILED status as 'don't rerun as non-transient error faced' but KILLED/ERROR status as 'yes, rerun as transient error faced'. But it looks like FAILED status is being set in below case of transient error:

        

      Below are driver logs

      2020-09-30 16:12:41,183 [main] INFO  com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to s3a://redacted2020-09-30 16:12:41,183 [main] INFO  com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to s3a://redacted20-09-30 16:16:40,366 [dispatcher-event-loop-15] ERROR org.apache.spark.scheduler.TaskSchedulerImpl - Lost executor 0 on myip.87: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.2020-09-30 16:16:40,372 [dispatcher-event-loop-15] WARN  org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 6.0 (TID 6, myip.87, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.2020-09-30 16:16:40,376 [dispatcher-event-loop-13] WARN  org.apache.spark.storage.BlockManagerMasterEndpoint - No more replicas available for rdd_3_0 !2020-09-30 16:16:40,398 [dispatcher-event-loop-2] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/0 removed: Worker shutting down2020-09-30 16:16:40,399 [dispatcher-event-loop-2] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/1 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,401 [dispatcher-event-loop-5] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/1 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,402 [dispatcher-event-loop-5] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/2 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,403 [dispatcher-event-loop-11] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/2 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,404 [dispatcher-event-loop-11] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/3 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,405 [dispatcher-event-loop-1] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/3 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,406 [dispatcher-event-loop-1] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/4 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,407 [dispatcher-event-loop-12] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/4 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,408 [dispatcher-event-loop-12] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/5 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,409 [dispatcher-event-loop-4] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/5 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,410 [dispatcher-event-loop-5] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/6 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,420 [dispatcher-event-loop-9] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/6 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,421 [dispatcher-event-loop-9] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/7 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,423 [dispatcher-event-loop-15] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/7 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,424 [dispatcher-event-loop-15] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/8 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,425 [dispatcher-event-loop-2] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/8 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,425 [dispatcher-event-loop-2] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/9 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,427 [dispatcher-event-loop-14] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/9 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,429 [dispatcher-event-loop-5] ERROR org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Application has been killed. Reason: Master removed our application: FAILED2020-09-30 16:16:40,438 [main] ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter - Aborting job 564822f2-f2fd-42cd-8d57-b6d5dff145f6.org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:677) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:286) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:272) at com.yotpo.metorikku.output.writers.file.FileOutputWriter.save(FileOutputWriter.scala:134) at com.yotpo.metorikku.output.writers.file.FileOutputWriter.write(FileOutputWriter.scala:65) at com.yotpo.metorikku.metric.Metric.com$yotpo$metorikku$metric$Metric$$writeBatch(Metric.scala:97) at com.yotpo.metorikku.metric.Metric$$anonfun$write$1.apply(Metric.scala:136) at com.yotpo.metorikku.metric.Metric$$anonfun$write$1.apply(Metric.scala:125) at scala.collection.immutable.List.foreach(List.scala:392) at com.yotpo.metorikku.metric.Metric.write(Metric.scala:125) at com.yotpo.metorikku.metric.MetricSet$$anonfun$run$1.apply(MetricSet.scala:44) at com.yotpo.metorikku.metric.MetricSet$$anonfun$run$1.apply(MetricSet.scala:39) at scala.collection.immutable.List.foreach(List.scala:392) at com.yotpo.metorikku.metric.MetricSet.run(MetricSet.scala:39) at com.yotpo.metorikku.Metorikku$$anonfun$runMetrics$1.apply(Metorikku.scala:17) at com.yotpo.metorikku.Metorikku$$anonfun$runMetrics$1.apply(Metorikku.scala:15) at scala.collection.immutable.List.foreach(List.scala:392) at com.yotpo.metorikku.Metorikku$.runMetrics(Metorikku.scala:15) at com.yotpo.metorikku.Metorikku$.delayedEndpoint$com$yotpo$metorikku$Metorikku$1(Metorikku.scala:11) at com.yotpo.metorikku.Metorikku$delayedInit$body.apply(Metorikku.scala:7) at scala.Function0$class.apply$mcV$sp(Function0.scala:34) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) at scala.App$class.main(App.scala:76) at com.yotpo.metorikku.Metorikku$.main(Metorikku.scala:7) at com.yotpo.metorikku.Metorikku.main(Metorikku.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)2020-09-30 16:16:40,457 [stop-spark-context] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Shutting down all executors2020-09-30 16:16:40,461 [stop-spark-context] ERROR org.apache.spark.util.Utils - Uncaught exception in thread stop-spark-contextorg.apache.spark.SparkException: Exception thrown in awaitResult:  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.deploy.client.StandaloneAppClient.stop(StandaloneAppClient.scala:283) at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:227) at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:124) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:669) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2044) at org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1949) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) at org.apache.spark.SparkContext.stop(SparkContext.scala:1948) at org.apache.spark.SparkContext$$anon$3.run(SparkContext.scala:1903)Caused by: org.apache.spark.SparkException: Could not find AppClient. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160) at org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) at org.apache.spark.rpc.RpcEndpointRef.ask(RpcEndpointRef.scala:63) ... 9 more
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            toopt4 t oo

            Dates

              Created:
              Updated:

              Slack

                Issue deployment