Client Failover overview
On failover between active and standby NNs, it's necessary for clients to be redirected to the new active NN. The goal of
HDFS-1623 is to provide a framework for HDFS HA which can in fact support multiple underlying mechanisms. As such, the client failover approach should support multiple options.
Cases to support
- Proxy-based client failover. Clients always communicate with an in-band proxy service which forwards all RPCs on to the correct NN. On failure, a process causes this proxy to begin sending requests to the now-active NN.
- Virtual IP-based client failover. Clients always connect to a hostname which resolves to a particular IP address. On failure of the active NN, a process is initiated to switch which NIC will receive packets intended for said IP address to the now-active NN. (From a client's perspective, this case is equivalent to case #1.)
- Zookeeper-based client failover. The URI to contact the active NN is stored in Zookeeper or some other highly-available service. Clients look up which NN to talk to by communicating with ZK to discern the currently active NN. On failure, some process causes the address stored in ZK to be changed to point to the now-active NN.
- Configuration-based client failover. Clients are configured with a set of NN addresses to try until an operation succeeds. This configuration might exist in client-side configuration files, or perhaps in DNS via a SRV record that lists the NNs with different priorities.
This proposal assumes that NN fencing works, and that after a failover any standby NN is either unreachable or will throw a StandbyException on any RPC from a client. That is, a client will not possibly receive incorrect results if it chooses to contact the wrong NN. This proposal also presumes that there is no direct coordination required between any central failover coordinator and clients, i.e. there's an intermediate name resolution system of some sort (ZK, DNS, local configuration, etc.)
The commit of
HADOOP-7380 already introduced a facility whereby an IPC RetryInvocationHandler can utilize a FailoverProxyProvider implementation to perform the appropriate client-side action in the event of failover. At the moment, the only implementation of a FailoverProxyProvider is the DefaultFailoverProxyProvider, which does nothing in the case of failover. HADOOP-7380 also added an @Idempotent annotation which can be used to identify which methods can be safely retried during a failover event.
What remains, then, is:
- To implement FailoverProxyProviders which can support the cases outlined above (and perhaps others).
- To provide a mechanism to select which FailoverProxyProvider implementation to use for a given HDFS URI.
- To annotate the appropriate HDFS ClientProtocol interface methods with the @Idempotent tag.
Cases 1 and 2 above can be achieved by implementing a single FailoverProxyProvider which simply retries to reconnect to the previous hostname/IP address on failover. Cases 3 and 4 can be implemented as distinct custom FailoverProxyProviders.
A mechanism to select the appropriate FailoverProxyProvider implementation
I propose we add a mechanism to configure a mapping from URI authority -> FailoverProxyProvider implementation. Absolute URIs which previously specified the NN host name will instead contain a logical cluster name (which might be chosen to be identical to one of the NN's host names) which will be used by the chosen FailoverProxyProvider to determine the appropriate host to connect to. Introducing the concept of a cluster name will be a useful abstraction in general if, for example, in the future someone develops a fully-distributed NN, the cluster name still applies.
On instantiation of a DFSClient (or other user of an HDFS URI, e.g. HFTP), the mapping would be checked to see if there's an entry for the given URI authority. If there is not, then a normal RPC client with connected socket to the given authority will be created as is done today with a DefaultProxyProvider. If there is an entry, then the authority will be treated as a logical cluster name, a FailoverProxyProvider of the correct type will be instantiated (via a factory class), and an RPC client will be created which utilizes this FailoverProxyProvider. The various FailoverProxyProvider implementations are responsible for their own configuration.
As a straw man example, consider the following configuration:
This would cause URIs which begin with hdfs://cluster1.foo.com to use the ZookeeperFailoverProxyProvider. Slash-relative URIs would also default to using this. An absolute URI which, for example, referenced an NN in another cluster (e.g. nn.cluster2.foo.com) which was not HA-enabled would default to using the DefaultFailoverProxyProvider.
- I believe this scheme will work transparently with viewfs. Instead of configuring the mount table to communicate with a particular NN for a given portion of the name space, one would configure viewfs to use the logical cluster name, which when paired with the configuration from URI authority -> FailoverProxyProvider will cause the appropriate FailoverProxyProvider to be selected and the appropriate NN to be located. I'm no viewfs expert and so would love to hear any thoughts on this.
- Are there any desirable client failover mechanisms I'm forgetting about?
- I'm sure there are places (which I haven't fully identified yet) where host names are cached client side. Those may need to get changed as well.
- Federation already introduced a concept of a "cluster ID", but in Federation this is not intended to be user-facing. Should we combine this notion with the "logical cluster name" I described above?