[SPARK-39955] Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: Spark Core
Labels:
None

Description

There are two possible reasons, including Network Failure and Task Failure, to make RPC failures.

(1) Task Failure: The network is good, but the task causes the executor's JVM crash. Hence, RPC fails.
(2) Network Failure: The executor works well, but the network between Driver and Executor is broken. Hence, RPC fails.

We should handle these two different kinds of failure in different ways. First, if the failure is Task Failure, we should increment the variable `numFailures`. If the value of `numFailures` is larger than a threshold, Spark will label the job failed. Second, if the failure is Network Failure, we will not increment the variable `numFailures`. We will just assign the task to a new executor. Hence, the job will not be recognized as failed due to Network Failure.

However, currently, Spark recognizes every RPC failure as Task Failure. Hence, it will cause extra Spark job failures.

Attachments

Issue Links

links to

[Github] Pull Request #37384 (kevin85421)

Activity

People

Assignee:: Kai-Hsun Chen

Reporter:: Kai-Hsun Chen

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Aug/22 21:34

Updated:: 11/Aug/22 17:23

Resolved:: 11/Aug/22 17:18