Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44478

Executor decommission causes stage failure

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.4.0, 3.4.1
    • None
    • Scheduler
    • None

    Description

      During spark execution, save fails due to executor decommissioning. Issue not present in 3.3.0

      Sample error:

       

      An error occurred while calling o8948.save.
      : org.apache.spark.SparkException: Job aborted due to stage failure: Authorized committer (attemptNumber=0, stage=170, partition=233) failed; but task commit success, data duplication may happen. reason=ExecutorLostFailure(1,false,Some(Executor decommission: Executor 1 is decommissioned.))
              at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
              at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)
              at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720)
              at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
              at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
              at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
              at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720)
              at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1(DAGScheduler.scala:1199)
              at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleStageFailed$1$adapted(DAGScheduler.scala:1199)
              at scala.Option.foreach(Option.scala:407)
              at org.apache.spark.scheduler.DAGScheduler.handleStageFailed(DAGScheduler.scala:1199)
              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2981)
              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)
              at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)
              at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
              at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971)
              at org.apache.spark.SparkContext.runJob(SparkContext.scala:2263)
              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeWrite$4(FileFormatWriter.scala:307)
              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.writeAndCommit(FileFormatWriter.scala:271)
              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeWrite(FileFormatWriter.scala:304)
              at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190)
              at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:190)
              at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
              at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
              at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
              at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
              at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
              at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
              at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
              at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
              at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
              at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
              at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
              at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
              at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
              at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
              at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
              at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
              at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
              at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
              at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
              at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488)
              at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
              at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
              at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
              at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:133)
              at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856)
              at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387)
              at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360)
              at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
              at jdk.internal.reflect.GeneratedMethodAccessor497.invoke(Unknown Source)
              at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
              at java.base/java.lang.reflect.Method.invoke(Unknown Source)
              at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
              at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
              at py4j.Gateway.invoke(Gateway.java:282)
              at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
              at py4j.commands.CallCommand.execute(CallCommand.java:79)
              at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
              at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
              at java.base/java.lang.Thread.run(Unknown Source)
      

       

       

      This occurred while running our production k8s spark jobs (spark 3.3.0) in a duplicate test environment, with the only change being the image used was spark 3.4.0 and 3.4.1, and the only changes in jar versions were the requisite dependencies. 

      Current workaround is to retry the job, but this can cause substantial slowdowns if it occurs during a longer job.  Possibly related to https://issues.apache.org/jira/browse/SPARK-44389 ?

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            dhuett Dale Huettenmoser

            Dates

              Created:
              Updated:

              Slack

                Issue deployment