Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3099

KuduBackup/KuduRestore System.exit(0) results in Spark on YARN failure with exitCode: 16

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.10.0, 1.11.0
    • 1.12.0
    • backup, spark
    • None

    Description

      When running KuduBackup/KuduRestore the underlying Spark application can fail when running on YARN even when the backup/restore tasks complete successfully. The following was from the Spark driver log:

      INFO spark.SparkContext: Submitted application: Kudu Table Backup
      ..
      INFO spark.SparkContext: Starting job: save at KuduBackup.scala:90
      INFO scheduler.DAGScheduler: Got job 0 (save at KuduBackup.scala:90) with 200 output partitions
      scheduler.DAGScheduler: Final stage: ResultStage 0 (save at KuduBackup.scala:90)
      ..
      INFO scheduler.DAGScheduler: Submitting 200 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at save at KuduBackup.scala:90) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
      INFO cluster.YarnClusterScheduler: Adding task set 0.0 with 200 tasks
      ..
      
      INFO cluster.YarnClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 
      INFO scheduler.DAGScheduler: Job 0 finished: save at KuduBackup.scala:90, took 20.007488 s
      ..
      INFO spark.SparkContext: Invoking stop() from shutdown hook
      ..
      INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
      ..
      INFO spark.SparkContext: Successfully stopped SparkContext
      INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
      INFO util.ShutdownHookManager: Shutdown hook called

      Spark explicitly added this shutdown hook to catch System.exit() calls and in case this occurs before the SparkContext stops then the application status is considered a failure:
      https://github.com/apache/spark/blob/branch-2.3/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L299

      The System.exit() call added as part of KUDU-2787 can cause this race condition and that was merged in the 1.10.x and 1.11.x branches. 

       

      Attachments

        Activity

          People

            waleedfateem Waleed Fateem
            waleedfateem Waleed Fateem
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: