[SPARK-24755] Executor loss can cause task to not be resubmitted - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.3.3, 2.4.0
Component/s: Spark Core
Labels:
None

Description

As part of ~~SPARK-22074~~, when an executor is lost, TSM.executorLost currently checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide if task needs to be resubmitted for partition.

Consider following:

For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 respectively (one of them being speculative task)

T1 finishes successfully first.

This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
We also end up killing task T2.

Now, exec-1 if/when goes MIA.
executorLost will no longer schedule task for P1 - since killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there is no other copy of P1 around (T2 was killed when T1 succeeded).

I noticed this bug as part of reviewing PR# 21653 for ~~SPARK-13343~~

Essentially, ~~SPARK-22074~~ causes a regression (which I dont usually observe due to shuffle service, sigh) - and as such the fix is broken IMO.

I dont have a PR handy for this, so if anyone wants to pick it up, please do feel free !
+CC XuanYuan who fixed ~~SPARK-22074~~ initially.

Attachments

Issue Links

links to

[Github] Pull Request #21729 (hthuynh2)

Activity

People

Assignee:: Hieu Tri Huynh

Reporter:: Mridul Muralidharan

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 07/Jul/18 07:30

Updated:: 19/Jul/18 14:53

Resolved:: 19/Jul/18 14:53