Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44542

eagerly load SparkExitCode class in SparkUncaughtExceptionHandler

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Trivial
    • Resolution: Fixed
    • 3.1.3, 3.3.2, 3.4.1
    • 3.5.0
    • Spark Core
    • None

    Description

      There are two background for this improvement proposal:

      1. When running spark on yarn, the disk might be corrupted during application running. The corrupted disk might contain the spark jars(cache archive from spark.yarn.archive). In that case , the executor JVM cannot load any spark related classes any more.

      2. Spark leverages the OutputCommitCoordinator to avoid data race between speculate tasks so that no tasks could commit the same partition in the same time. In other words, once a task's commit request is allowed, other commit requests would be denied until the committing task is failed.

       

      We encountered a corner case combined the above two cases, which makes the spark hangs.  A short timeline could be described as below:

      1. task 5372(tid: 21662) starts running in 21:55
      2. the disk contains the spark archive for that task/executor is corrupted, thus making the archive inaccessible from executor's JVM perspective, it happened around 22:00
      3. the task continues running, at 22:05, it requests commit from coordinator and performs the commit. 
      4. however due the corrupted disk, some exception raised in the executor JVM.
      5. The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is corrupted, the handler itself throws an exception, and the halt process throws an exception too.
      6. The executor is hanging there, no more tasks are running. However the authorized commit request is still valid in the driver side
      7. Speculate tasks start to click in, due to no commit permission, all speculate tasks are killed/denied.
      8. The job is hanging until our SRE killed the container from outside.

      Some screenshot are provided below.

      For this specific case: I'd like to the propose to eagerly load SparkExitCode class in the 
      SparkUncaughtExceptionHandler, so that the halt process could be executed rather than throws an exception as SparkExitCode is not loadable during the previous scenario.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            advancedxy YE
            advancedxy YE
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment