[SPARK-36509] Executors don't get rescheduled in standalone mode when worker dies - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.1, 3.1.1, 3.1.2
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Target Version/s:

3.2.0, 3.1.3, 3.0.4

Description

This is reproducible with an application that uses less cores than what are available on the workers:

E.g. with 1 application with 1 executor, when the worker with the executor is killed, the application will not get another executor assigned even if there are enough resources in the cluster. This seems to be a regression, caused by https://github.com/apache/spark/commit/51de86baed0776304c6184f2c04b6303ef48df90#diff-ca694acef669f50f9b45ca0d32ab6f5a516270bb26b33c4abb704e2dc00a1a03 .

That causes an assertion error on the master because it get's an executorStateChange from 'RUNNING' to 'RUNNING' instead of 'FAILED':

2021-08-13 14:04:12,554 [dispatcher-event-loop-2] INFO : I have been elected leader! New state: ALIVE
2021-08-13 14:04:12,554 [dispatcher-event-loop-2] INFO : I have been elected leader! New state: ALIVE
2021-08-13 14:04:56,489 [dispatcher-event-loop-10] INFO : Registering worker 172.27.64.1:58636 with 12 cores, 30.7 GiB RAM
2021-08-13 14:04:59,949 [dispatcher-event-loop-6] INFO : Registering worker 172.27.64.1:58694 with 12 cores, 30.7 GiB RAM
2021-08-13 14:05:20,212 [dispatcher-event-loop-2] INFO : Registering app query-frontend-null-172.27.64.1
2021-08-13 14:05:20,212 [dispatcher-event-loop-2] INFO : Registered app query-frontend-null-172.27.64.1 with ID app-20210813140520-0000
2021-08-13 14:05:20,228 [dispatcher-event-loop-2] INFO : Launching executor app-20210813140520-0000/0 on worker worker-20210813140459-172.27.64.1-58694
2021-08-13 14:05:37,991 [dispatcher-event-loop-9] ERROR: Ignoring errorjava.lang.AssertionError: assertion failed: executor 0 state transfer from RUNNING to RUNNING is illegal at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:323) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)

Attachments

Issue Links

links to

[Github] Pull Request #33818 (sarutak)

Activity

People

Assignee:: Kousuke Saruta

Reporter:: Peter Kaiser

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 13/Aug/21 12:36

Updated:: 28/Aug/21 09:07

Resolved:: 28/Aug/21 09:07