[SPARK-15262] race condition in killing an executor and reregistering an executor - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.1
Fix Version/s: 1.6.2, 2.0.0
Component/s: Spark Core
Labels:
None

Target Version/s:

1.6.2, 2.0.0

Description

There is a race condition when killing an executor and reregistering an executor happen at the same time. Here is the execution steps to reproduce it.

1. master find a worker is dead
2. master tells driver to remove executor
3. driver remove executor
4. BlockManagerMasterEndpoint remove the block manager
5. executor finds it's not reigstered via heartbeat
6. executor send reregister block manager
7. register block manager
8. executor is killed by worker
9. CoarseGrainedSchedulerBackend ignores onDisconnected as this address is not in the executor list
10. BlockManagerMasterEndpoint.blockManagerInfo contains dead block managers

As BlockManagerMasterEndpoint.blockManagerInfo contains some dead block managers, when we unpersist a RDD, remove a broadcast, or clean a shuffle block via a RPC endpoint of a dead block manager, we will get ClosedChannelException.

Attachments

Issue Links

relates to

SPARK-14559 Netty RPC didn't check channel is active before sending message

Resolved

links to

[Github] Pull Request #13055 (andrewor14)

Activity

People

Assignee:: Andrew Or

Reporter:: Shixiong Zhu

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 11/May/16 00:36

Updated:: 11/May/16 22:30

Resolved:: 11/May/16 20:37