The DN listens on multiple IP addresses (the default dfs.datanode.address is the wildcard) however per HADOOP-6867 only the source address (IP) of the registration is given to clients.
HADOOP-985 made clients access datanodes by IP primarily to avoid the latency of a DNS lookup, this had the side effect of breaking DN multihoming (the client can not route the IP exposed by the NN if the DN registers with an interface that has a cluster-private IP). To fix this let's add back the option for Datanodes to be accessed by hostname.
This can be done by:
- Modifying the primary field of the Datanode descriptor to be the hostname, or
- Modifying Client/Datanode <-> Datanode access use the hostname field instead of the IP
Approach #2 does not require an incompatible client protocol change, and is much less invasive. It minimizes the scope of modification to just places where clients and Datanodes connect, vs changing all uses of Datanode identifiers.
New client and Datanode configuration options are introduced:
- dfs.client.use.datanode.hostname indicates all client to datanode connections should use the datanode hostname (as clients outside cluster may not be able to route the IP)
- dfs.datanode.use.datanode.hostname indicates whether Datanodes should use hostnames when connecting to other Datanodes for data transfer
If the configuration options are not used, there is no change in the current behavior.