[SPARK-44542] eagerly load SparkExitCode class in SparkUncaughtExceptionHandler - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Trivial
Resolution: Fixed
Affects Version/s: 3.1.3, 3.3.2, 3.4.1
Fix Version/s: 3.5.0
Component/s: Spark Core
Labels:
None

Description

There are two background for this improvement proposal:

1. When running spark on yarn, the disk might be corrupted during application running. The corrupted disk might contain the spark jars(cache archive from spark.yarn.archive). In that case , the executor JVM cannot load any spark related classes any more.

2. Spark leverages the OutputCommitCoordinator to avoid data race between speculate tasks so that no tasks could commit the same partition in the same time. In other words, once a task's commit request is allowed, other commit requests would be denied until the committing task is failed.

We encountered a corner case combined the above two cases, which makes the spark hangs. A short timeline could be described as below:

task 5372(tid: 21662) starts running in 21:55
the disk contains the spark archive for that task/executor is corrupted, thus making the archive inaccessible from executor's JVM perspective, it happened around 22:00
the task continues running, at 22:05, it requests commit from coordinator and performs the commit.
however due the corrupted disk, some exception raised in the executor JVM.
The SparkUncaughtExceptionHandler kicks in, however as the jar/disk is corrupted, the handler itself throws an exception, and the halt process throws an exception too.
The executor is hanging there, no more tasks are running. However the authorized commit request is still valid in the driver side
Speculate tasks start to click in, due to no commit permission, all speculate tasks are killed/denied.
The job is hanging until our SRE killed the container from outside.

Some screenshot are provided below.

For this specific case: I'd like to the propose to eagerly load SparkExitCode class in the
SparkUncaughtExceptionHandler, so that the halt process could be executed rather than throws an exception as SparkExitCode is not loadable during the previous scenario.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2023-07-25-16-46-03-989.png
25/Jul/23 08:46
421 kB
YE
image-2023-07-25-16-46-28-158.png
25/Jul/23 08:46
89 kB
YE
image-2023-07-25-16-46-42-522.png
25/Jul/23 08:46
481 kB
YE

Activity

People

Assignee:: YE

Reporter:: YE

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 25/Jul/23 08:44

Updated:: 31/Jul/23 03:13

Resolved:: 31/Jul/23 03:13