As requested, here's a design doc:
Client Retry/Fail Over in IPC Overview
The goal of
HDFS-1973 is to provide a facility for clients to fail over and retry an RPC to a different NN in the event of failure, in support of HDFS HA. Since the HDFS RPC mechanisms are built on top of the Common IPC library, this is a natural place to implement this functionality. Furthermore, since client fail over in general has challenges which are not HDFS-specific, we may be able to reuse this mechanism for other HA services in the future (e.g. perhaps the MRv2 Resource Manager.)
This mechanism should be able to support the following:
- A way to support multiple distinct ways for determining which object an RPC should be attempted against.
- A way of specifying customized failover strategies. For example, some HA service may only support a pair of machines to serve requests, while another might support an arbitrary number. The strategy for failing over to a new machine might be different for these cases.
- The method for specifying a retry/failover strategy should be able to control both retry and failover logic. For example, a strategy may want to try a request to server A once, attempt a failover to server B immediately, fail back to server A and then retry several times with some amount of backoff.
- A way for a remote process to indicate that it is not the appropriate process to serve the request, and that an attempt should be made to another process.
- A way of specifying that some operations in an IPC interface are not safe to be failed over and retried.
The Common IPC library already supports retrying an IPC against the same remote process. This is done via the classes in the o.a.h.io.retry package. The exact desired retry semantics can be specified using an implementation of the RetryPolicy interface. An implementation of this interface is responsible for determining whether or not a particular method call should be retried given the number of times its already been tried and the particular exception which caused the method to fail. An implementer of this interface can either choose to have the failed method retried, or the exception re-thrown.
In a sense, client-side failover is exactly the same as the existing retry mechanism, except method calls can be retried against a different proxy object. Thus, this JIRA proposes to extend the existing retry facility to also support client failover.
Presently, the RetryPolicy.shouldRetry method only returns a boolean to indicate whether a method invocation should be retried or considered to have failed. This can be augmented to return an enum value to indicate whether a particular method invocation should fail, be retried against the same object, or retried on another object. In order to determine what object a method should be tried against, this JIRA introduces the concept of a FailoverProxyProvider. An implementer of this interface is capable of getting a proxy object and initiating a client failover when the result of RetryPolicy.shouldRetry indicates a failover should be performed for the particular RetryPolicy the RetryInvocationHandler is configured to use. This addresses goals 1, 2, and 3 from above.
To address goal 4, this JIRA introduces a new exception type - StandbyException - which can be thrown by remote processes to indicate that it is not the appropriate process to handle requests for a given service at this time. RetryPolicy implementations may choose to handle this exception differently than other exception types when determining whether or not to retry or fail over an operation.
Though there may be circumstances in which a client may desire a more complex retry/failover strategy, most clients will want to failover only on network exceptions in which an RPC is guaranteed to have not reached the remote process, or in cases in which the particular method to be retried will have no mutative ill-effects if retried (e.g. read operations or idempotent write operations.) This JIRA thus introduces a new RetryPolicy implementation - FailoverOnNetworkExceptionRetry. This retry policy fails over whenever a method call fails in such a way as to guarantee that it did not reach the original remote process, or when retrying a method which can be safely retried. In order to know which methods of an IPC interface can be safely retried, this JIRA introduces an @Idempotent method annotation which, if present, will be passed on to the retry policy by the RetryInvocationHandler when determining whether or not to retry a method. This addresses goal 5.