Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-1379

Multihoming brokenness in HDFS

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 0.20.1
    • Fix Version/s: None
    • Component/s: hdfs-client, namenode
    • Labels:
      None
    • Environment:

      Multi-homed namenode and datanodes. hadoop-0.20.1 (cloudera distribution on linux)

      Description

      We have a setup where - because we only have a very few machines (4 x 16 core) we're looking at co-locating namenodes and datanodes. We also have front-end and back-end networks. Set-up is something like:

      • machine1
        • front-end 10.18.80.80
        • back-end 192.168.24.40
      • machine2
        • front-end 10.18.80.82
        • back-end 192.168.24.41
      • machine3
        • front-end 10.18.80.84
        • back-end 192.168.24.42
      • machine4
        • front-end 10.18.80.86
        • back-end 192.168.24.43

      On each, the property slave.host.name is configured with the 192. address, (the .dns.interface settings don't actually seem to help, but that's a separate problem), and the *dfs.datanode.address is bound to the 192.168.24.x address on :50010, similarly the dfs.datanode.ipc.address is bound there.

      In order to get efficient use of our machines, we bring up a namenode on one of them (this then rsyncs the latest namenode fsimage etc) by bringing up a VIP on each side (we use the 10.18.80.x side for monitoring, rather than actual hadoop comms), and binding the namenode to that - on the inside this is 192.168.24.19.

      The namenode now knows about 4 datanodes - 192.168.24.40/1/2/3. These datanodes know how they're bound. However, when the datanode is telling an external hdfs client where to store the blocks, it gives out 192.168.24.19:50010 as one of the addresses (despite the datanode not being bound there) - because that's where the datanode->namenode RPC comes from.

      This is wrong because if you've bound the datanode explicitly (using dfs.datanode.address) then that's should be the only address the namenode can give out (it's reasonable, given your comms model not to support NAT between clients and data slaves). If you bind it to * then your normal rules for slave.host.name, dfs.datanode.dns.interface etc should take precedence.

      This may already be fixed in later releases than 0.20.1 - but if it isn't it should probably be - you explicitly allow binding of the datanode addresses - it's unreasonable to expect that comms to the datanode will always come from those addresses - especially in multi-homed environments (and separating traffic out by network - especially when dealing with large volumes of data) is useful.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                mbm Matthew Byng-Maddick
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: