Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-17947

Retry REST requests if RpcEndpoint died before responding to request

    XMLWordPrintableJSON

Details

    Description

      Currently, it can happen that a REST handler sends a request to a leader RpcEndpoint and before the RpcEndpoint has a chance to respond, it might shut down (e.g. due to losing the leadership). In this case, the ActorSystem will send an AskTimeoutException as the response with the message Recipient Actorakka://flink/user/rpc/dispatcher_1#-1875884516 had already been terminated.. This exception will be treated as any other exception and forwarded to the REST client. There it will be treated as a normal timeout exception which causes the operation (e.g. requesting job details) to fail.

      I was wondering whether this case should not be handled slightly differently. If the REST handler would respond with a SERVICE_UNAVAILABLE HTTP response code, then the RestClusterClient would retry the operation. One could think of it as if there wouldn't have been a leader available before. This is similar to the situation when there is no current leader and we are waiting for the leader election to finish. Alternatively, we could extend the RestClusterClient.isConnectionProblemOrServiceUnavailable predicate to also cover the case of special AskTimeoutExceptions.

      cc chesnay

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              trohrmann Till Rohrmann
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: