[SPARK-19831] Sending the heartbeat master from worker maybe blocked by other rpc messages - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.2.0
Component/s: Spark Core
Labels:
None

Description

Cleaning the application may cost much time at worker, then it will block that the worker send heartbeats master because the worker is extend ThreadSafeRpcEndpoint. If the heartbeat from a worker is blocked by the message ApplicationFinished, master will think the worker is dead. If the worker has a driver, the driver will be scheduled by master again. So I think it is the bug on spark. It may solve this problem by the followed suggests：

1. It had better put the cleaning the application in a single asynchronous thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages like SendHeartbeat;

2. It had better not receive the heartbeat master by receive method. Because any other rpc message may block the receive method. Then worker won't receive the heartbeat message timely. So it had better send the heartbeat master at an asynchronous timing thread .

Attachments

Issue Links

links to

[Github] Pull Request #17189 (hustfxj)

Activity

People

Assignee:: John Fang

Reporter:: John Fang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Mar/17 05:15

Updated:: 12/Mar/17 17:29

Resolved:: 12/Mar/17 17:29