[SPARK-28305] When the AM cannot obtain the container loss reason, ask GetExecutorLossReason times out - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Trivial
Resolution: Incomplete
Affects Version/s: 2.4.0
Fix Version/s: None
Component/s: Spark Core, YARN
Labels:
- bulk-closed

Description

In some cases, such as NM machine crashes or shuts down,driver ask GetExecutorLossReason,
AM getCompletedContainersStatuses can't get the failure information of container.

Because the yarn NM detection timeout is 10 minutes, it is controlled by the parameter yarn.resourcemanager.rm.container-allocation.expiry-interval-ms.
So AM has to wait for 10 minutes to get the cause of the container failure.

Although the driver's ask fails, it will call recover.
However, due to the 2-minute timeout (spark.network.timeout) configured by IdleStateHandler, the connection between driver and am is closed, AM exits, app finish, driver exits, causing the job to fail.

AM LOG:

19/07/08 16:56:48 [dispatcher-event-loop-0] INFO YarnAllocator: add executor 951 to pendingLossReasonRequests for get the loss reason
19/07/08 16:58:48 [dispatcher-event-loop-26] INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down.
19/07/08 16:58:48 [dispatcher-event-loop-26] INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0

Driver LOG:

19/07/08 16:58:48,476 [rpc-server-3-3] ERROR TransportChannelHandler: Connection to /xx.xx.xx.xx:19398 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
19/07/08 16:58:48,476 [rpc-server-3-3] ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /xx.xx.xx.xx:19398 is closed
19/07/08 16:58:48,510 [rpc-server-3-3] WARN NettyRpcEnv: Ignored failure: java.io.IOException: Connection from /xx.xx.xx.xx:19398 closed
19/07/08 16:58:48,516 [netty-rpc-env-timeout] WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to get executor loss reason for executor id 951 at RPC address xx.xx.xx.xx:49175, but got no response. Marking as slave lost.
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from null in 120 seconds. This timeout is controlled by spark.rpc.askTimeout

Attachments

Issue Links

links to

GitHub Pull Request #25078

Activity

People

Assignee:: Unassigned

Reporter:: dzcxzl

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Jul/19 15:57

Updated:: 25/May/21 01:52

Resolved:: 25/May/21 01:39