Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3099

KuduBackup/KuduRestore System.exit(0) results in Spark on YARN failure with exitCode: 16

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.10.0, 1.11.0
    • 1.12.0
    • backup, spark
    • None

    Description

      When running KuduBackup/KuduRestore the underlying Spark application can fail when running on YARN even when the backup/restore tasks complete successfully. The following was from the Spark driver log:

      INFO spark.SparkContext: Submitted application: Kudu Table Backup
      ..
      INFO spark.SparkContext: Starting job: save at KuduBackup.scala:90
      INFO scheduler.DAGScheduler: Got job 0 (save at KuduBackup.scala:90) with 200 output partitions
      scheduler.DAGScheduler: Final stage: ResultStage 0 (save at KuduBackup.scala:90)
      ..
      INFO scheduler.DAGScheduler: Submitting 200 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at save at KuduBackup.scala:90) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
      INFO cluster.YarnClusterScheduler: Adding task set 0.0 with 200 tasks
      ..
      
      INFO cluster.YarnClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 
      INFO scheduler.DAGScheduler: Job 0 finished: save at KuduBackup.scala:90, took 20.007488 s
      ..
      INFO spark.SparkContext: Invoking stop() from shutdown hook
      ..
      INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
      ..
      INFO spark.SparkContext: Successfully stopped SparkContext
      INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
      INFO util.ShutdownHookManager: Shutdown hook called

      Spark explicitly added this shutdown hook to catch System.exit() calls and in case this occurs before the SparkContext stops then the application status is considered a failure:
      https://github.com/apache/spark/blob/branch-2.3/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L299

      The System.exit() call added as part of KUDU-2787 can cause this race condition and that was merged in the 1.10.x and 1.11.x branches. 

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            waleedfateem Waleed Fateem
            waleedfateem Waleed Fateem
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment