[SPARK-32091] Ignore timeout error when remove blocks on the lost executor - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0, 3.0.0
Fix Version/s: 3.1.0
Component/s: Spark Core
Labels:
None

Description

When removing blocks(e.g. RDD, broadcast, shuffle), BlockManagerMaserEndpoint will make RPC calls to each known BlockManagerSlaveEndpoint to remove the specific blocks. The PRC call sometimes could end in a timeout when the executor has been lost, but only notified the BlockManagerMasterEndpoint after the removing call has already happened. The timeout could therefore fail the whole query.

In this case, we actually could just ignore the error since those blocks on the lost executor could be considered as removed already.

Attachments

Issue Links

links to

[Github] Pull Request #28924 (Ngone51)

Activity

People

Assignee:: wuyi

Reporter:: wuyi

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 24/Jun/20 15:01

Updated:: 10/Jul/20 13:36

Resolved:: 10/Jul/20 13:36