Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3150

Add option for clients to contact DNs via hostname

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0, 2.0.0-alpha
    • Fix Version/s: 1.1.0, 2.0.2-alpha
    • Component/s: datanode, hdfs-client
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The DN listens on multiple IP addresses (the default dfs.datanode.address is the wildcard) however per HADOOP-6867 only the source address (IP) of the registration is given to clients. HADOOP-985 made clients access datanodes by IP primarily to avoid the latency of a DNS lookup, this had the side effect of breaking DN multihoming (the client can not route the IP exposed by the NN if the DN registers with an interface that has a cluster-private IP). To fix this let's add back the option for Datanodes to be accessed by hostname.

      This can be done by:

      1. Modifying the primary field of the Datanode descriptor to be the hostname, or
      2. Modifying Client/Datanode <-> Datanode access use the hostname field instead of the IP

      Approach #2 does not require an incompatible client protocol change, and is much less invasive. It minimizes the scope of modification to just places where clients and Datanodes connect, vs changing all uses of Datanode identifiers.

      New client and Datanode configuration options are introduced:

      • dfs.client.use.datanode.hostname indicates all client to datanode connections should use the datanode hostname (as clients outside cluster may not be able to route the IP)
      • dfs.datanode.use.datanode.hostname indicates whether Datanodes should use hostnames when connecting to other Datanodes for data transfer

      If the configuration options are not used, there is no change in the current behavior.

      1. hdfs-3150-b1.txt
        32 kB
        Eli Collins
      2. hdfs-3150-b1.txt
        32 kB
        Eli Collins
      3. hdfs-3150.txt
        50 kB
        Eli Collins
      4. hdfs-3150.txt
        50 kB
        Eli Collins
      5. hdfs-3150.txt
        50 kB
        Eli Collins
      6. hdfs-3150.txt
        50 kB
        Eli Collins

        Issue Links

          Activity

          Eli Collins created issue -
          Eli Collins made changes -
          Field Original Value New Value
          Attachment hdfs-3150-b1.txt [ 12520691 ]
          Eli Collins made changes -
          Attachment hdfs-3150-b1.txt [ 12520845 ]
          Eli Collins made changes -
          Hadoop Flags Reviewed [ 10343 ]
          Target Version/s 1.1.0 [ 12317959 ]
          Fix Version/s 1.1.0 [ 12317959 ]
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Resolved [ 5 ]
          Eli Collins made changes -
          Assignee Eli Collins [ eli2 ] Eli Collins [ eli ]
          Eli Collins made changes -
          Parent HDFS-3140 [ 12547916 ]
          Issue Type Sub-task [ 7 ] New Feature [ 2 ]
          Eli Collins made changes -
          Summary Add option for clients to contact DNs via hostname in branch-1 Add option for clients to contact DNs via hostname
          Affects Version/s 2.0.0-alpha [ 12320353 ]
          Affects Version/s 1.0.0 [ 12318243 ]
          Target Version/s 2.2.0-alpha [ 12322472 ]
          Description Per the document attached to HADOOP-8198, this is just for branch-1, and unbreaks DN multihoming. The datanode can be configured to listen on a bond, or all interfaces by specifying the wildcard in the dfs.datanode.*.address configuration options, however per HADOOP-6867 only the source address of the registration is exposed to clients. HADOOP-985 made clients access datanodes by IP primarily to avoid the latency of a DNS lookup, this had the side effect of breaking DN multihoming. In order to fix it let's add back the option for Datanodes to be accessed by hostname. This can be done by:
          # Modifying the primary field of the Datanode descriptor to be the hostname, or
          # Modifying Client/Datanode <-> Datanode access use the hostname field instead of the IP

          I'd like to go with approach #2 as it does not require making an incompatible change to the client protocol, and is much less invasive. It minimizes the scope of modification to just places where clients and Datanodes connect, vs changing all uses of Datanode identifiers.

          New client and Datanode configuration options are introduced:
          - {{dfs.client.use.datanode.hostname}} indicates all client to datanode connections should use the datanode hostname (as clients outside cluster may not be able to route the IP)
          - {{dfs.datanode.use.datanode.hostname}} indicates whether Datanodes should use hostnames when connecting to other Datanodes for data transfer

          If the configuration options are not used, there is no change in the current behavior.

          I'm doing something similar to #1 btw in trunk in HDFS-3144 - refactoring the use of DatanodeID to use the right field (IP, IP:xferPort, hostname, etc) based on the context the ID is being used in, vs always using the IP:xferPort as the Datanode's name, and using the name everywhere.
          The DN listens on multiple IP addresses (the default {{dfs.datanode.address}} is the wildcard) however per HADOOP-6867 only the source address (IP) of the registration is given to clients. HADOOP-985 made clients access datanodes by IP primarily to avoid the latency of a DNS lookup, this had the side effect of breaking DN multihoming (the client can not route the IP exposed by the NN if the DN registers with an interface that has a cluster-private IP). To fix this let's add back the option for Datanodes to be accessed by hostname.

          This can be done by:
          # Modifying the primary field of the Datanode descriptor to be the hostname, or
          # Modifying Client/Datanode <-> Datanode access use the hostname field instead of the IP

          Approach #2 does not require an incompatible client protocol change, and is much less invasive. It minimizes the scope of modification to just places where clients and Datanodes connect, vs changing all uses of Datanode identifiers.

          New client and Datanode configuration options are introduced:
          - {{dfs.client.use.datanode.hostname}} indicates all client to datanode connections should use the datanode hostname (as clients outside cluster may not be able to route the IP)
          - {{dfs.datanode.use.datanode.hostname}} indicates whether Datanodes should use hostnames when connecting to other Datanodes for data transfer

          If the configuration options are not used, there is no change in the current behavior.
          Eli Collins made changes -
          Attachment hdfs-3150.txt [ 12539963 ]
          Eli Collins made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Eli Collins made changes -
          Status Reopened [ 4 ] Patch Available [ 10002 ]
          Eli Collins made changes -
          Attachment hdfs-3150.txt [ 12540361 ]
          Eli Collins made changes -
          Attachment hdfs-3150.txt [ 12540786 ]
          Eli Collins made changes -
          Attachment hdfs-3150.txt [ 12540836 ]
          Eli Collins made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Target Version/s 2.2.0-alpha [ 12322472 ]
          Fix Version/s 2.2.0-alpha [ 12322472 ]
          Resolution Fixed [ 1 ]
          Arun C Murthy made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Eli Collins made changes -
          Link This issue is related to HDFS-3140 [ HDFS-3140 ]
          zhaoyunjiong made changes -
          Link This issue relates to MAPREDUCE-5495 [ MAPREDUCE-5495 ]
          Arpit Agarwal made changes -
          Link This issue relates to HDFS-6273 [ HDFS-6273 ]

            People

            • Assignee:
              Eli Collins
              Reporter:
              Eli Collins
            • Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development