[FLINK-17947] Retry REST requests if RpcEndpoint died before responding to request - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Not a Priority
Resolution: Unresolved
Affects Version/s: 1.10.1, 1.11.0
Fix Version/s: None
Component/s: Runtime / REST
Labels:
- auto-deprioritized-major
- auto-deprioritized-minor

Description

Currently, it can happen that a REST handler sends a request to a leader RpcEndpoint and before the RpcEndpoint has a chance to respond, it might shut down (e.g. due to losing the leadership). In this case, the ActorSystem will send an AskTimeoutException as the response with the message Recipient Actorakka://flink/user/rpc/dispatcher_1#-1875884516 had already been terminated.. This exception will be treated as any other exception and forwarded to the REST client. There it will be treated as a normal timeout exception which causes the operation (e.g. requesting job details) to fail.

I was wondering whether this case should not be handled slightly differently. If the REST handler would respond with a SERVICE_UNAVAILABLE HTTP response code, then the RestClusterClient would retry the operation. One could think of it as if there wouldn't have been a leader available before. This is similar to the situation when there is no current leader and we are waiting for the leader election to finish. Alternatively, we could extend the RestClusterClient.isConnectionProblemOrServiceUnavailable predicate to also cover the case of special AskTimeoutExceptions.

cc chesnay

Attachments

Issue Links

causes

FLINK-17750 YARNHighAvailabilityITCase.testKillYarnSessionClusterEntrypoint failed on azure

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Till Rohrmann

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/May/20 14:46

Updated:: 27/Nov/21 23:07