Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.1
    • Fix Version/s: 0.11.0
    • Component/s: None
    • Labels:
      None

      Description

      This issue assumes that HDFS runs on a cluster of computers that spread across many racks. Communication between two nodes on different racks needs to go through switches. Bandwidth in/out of a rack may be less than the total bandwidth of machines in the rack. The purpose of rack-aware replica placement is to improve data reliability, availability, and network bandwidth utilization. The basic idea is that each data node determines to which rack it belongs at the startup time and notifies the name node of the rack id upon registration. The name node maintains a rackid-to-datanode map and tries to place replicas across racks.

      1. rack.patch
        101 kB
        Hairong Kuang
      2. Rack_aware_HDFS_proposal.pdf
        16 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Tsz Wo Nicholas Sze added a comment -

          Do you really need four or more levels? How about "/CoreDC-DC/rack/host"?

          Show
          Tsz Wo Nicholas Sze added a comment - Do you really need four or more levels? How about "/CoreDC-DC/rack/host"?
          Hide
          shifeng added a comment -

          I have a questions:
          Setting cluster topology map has to match physical network topology,dosen't it?
          I want to build the four level of physical network topology(/CoreDC/DC/rack/host)or more level. Does Rack aware has support ?

          Is this question resolved?

          Show
          shifeng added a comment - I have a questions: Setting cluster topology map has to match physical network topology,dosen't it? I want to build the four level of physical network topology(/CoreDC/DC/rack/host)or more level. Does Rack aware has support ? Is this question resolved?
          Hide
          shifeng added a comment -

          is this question resolved?
          I searched but found no answer.

          Show
          shifeng added a comment - is this question resolved? I searched but found no answer.
          Hide
          Szuyi added a comment -

          I have a questions:
          Setting cluster topology map has to match physical network topology,dosen't it?
          I want to build the four level of physical network topology(/CoreDC/DC/rack/host)or more level. Does Rack aware has support ?

          Show
          Szuyi added a comment - I have a questions: Setting cluster topology map has to match physical network topology,dosen't it? I want to build the four level of physical network topology(/CoreDC/DC/rack/host)or more level. Does Rack aware has support ?
          Hide
          James P. White added a comment -

          As described in HADOOP-952, I get a divide by zero ArithmeticException in 0.11.0 due to this patch.

          http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/java/org/apache/hadoop/dfs/FSNamesystem.java?annotate=502821&diff_format=h#l2593

          2593 : int maxNodesPerRack =
          2594 : (totalNumOfReplicas-1)/clusterMap.getNumOfRacks()+2;

          Show
          James P. White added a comment - As described in HADOOP-952 , I get a divide by zero ArithmeticException in 0.11.0 due to this patch. http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/java/org/apache/hadoop/dfs/FSNamesystem.java?annotate=502821&diff_format=h#l2593 2593 : int maxNodesPerRack = 2594 : (totalNumOfReplicas-1)/clusterMap.getNumOfRacks()+2;
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Hairong!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Hairong!
          Hide
          Hadoop QA added a comment -

          +1, because http://issues.apache.org/jira/secure/attachment/12350262/rack.patch applied and successfully tested against trunk revision r502749.

          Show
          Hadoop QA added a comment - +1, because http://issues.apache.org/jira/secure/attachment/12350262/rack.patch applied and successfully tested against trunk revision r502749.
          Hide
          Hairong Kuang added a comment -

          I decide to ask data nodes to register after a name node restarts. To avoid the rack-aware patch to become to bulky, I created a separate issue HADOOP-937 to address the data node re-registration.

          Show
          Hairong Kuang added a comment - I decide to ask data nodes to register after a name node restarts. To avoid the rack-aware patch to become to bulky, I created a separate issue HADOOP-937 to address the data node re-registration.
          Hide
          Hairong Kuang added a comment -

          Currently when a name node restarts, it asks each data node to resend its block report. So what we can do is to put a data node's network location as part of a block report and then the name node adjusts the network topology unpon the arrival of a block report.

          Show
          Hairong Kuang added a comment - Currently when a name node restarts, it asks each data node to resend its block report. So what we can do is to put a data node's network location as part of a block report and then the name node adjusts the network topology unpon the arrival of a block report.
          Hide
          dhruba borthakur added a comment -

          In the current code, topology information is NOT persisted into the fsimage. This is perfect. However, when the namenode restarts it does not re-create the topology map, it assumes that all datanodes are in a default-rack.

          Show
          dhruba borthakur added a comment - In the current code, topology information is NOT persisted into the fsimage. This is perfect. However, when the namenode restarts it does not re-create the topology map, it assumes that all datanodes are in a default-rack.
          Hide
          Sameer Paranjpye added a comment -

          Is the topology persistence a consequence of us serializing Datanode descriptors to the image? This was implemented a while ago and has probably never made sense.

          We should file a follow-on issue to not persist Datanode information in the image.

          Show
          Sameer Paranjpye added a comment - Is the topology persistence a consequence of us serializing Datanode descriptors to the image? This was implemented a while ago and has probably never made sense. We should file a follow-on issue to not persist Datanode information in the image.
          Hide
          dhruba borthakur added a comment -

          +1 Code Reviewed.

          One issue that came up in the code review was the persistence of Network Topology in the case of namenode restart. Our consensus was to not persistently store the network topology but require the namenode to reconstruct the network topology upon restart.

          Show
          dhruba borthakur added a comment - +1 Code Reviewed. One issue that came up in the code review was the persistence of Network Topology in the case of namenode restart. Our consensus was to not persistently store the network topology but require the namenode to reconstruct the network topology upon restart.
          Hide
          Hairong Kuang added a comment -

          Although the 2nd replica is placed off-rack, the 3rd replica is placed on the same rack as the first replica. Besides chooseTarget returns a sorted list of the choosen targets so that they form a shortest path that starts from the writer and traverses all the targets. So in the returned list, the 1st element is the local node, 2nd is the same-rack node, and 3rd is the off-rack node.

          Show
          Hairong Kuang added a comment - Although the 2nd replica is placed off-rack, the 3rd replica is placed on the same rack as the first replica. Besides chooseTarget returns a sorted list of the choosen targets so that they form a shortest path that starts from the writer and traverses all the targets. So in the returned list, the 1st element is the local node, 2nd is the same-rack node, and 3rd is the off-rack node.
          Hide
          Dennis Kubes added a comment -

          Suggestion for a slight change to replica placement

          • place the first replica in a node closest to the writer.
          • place the 2nd replica on a different node on the same rack as the 1st node.

          if there are only 2 replicas then shouldn't the 2 replica should be placed on a different rack or at least across a switch.

          Show
          Dennis Kubes added a comment - Suggestion for a slight change to replica placement place the first replica in a node closest to the writer. place the 2nd replica on a different node on the same rack as the 1st node. if there are only 2 replicas then shouldn't the 2 replica should be placed on a different rack or at least across a switch.
          Hide
          Hairong Kuang added a comment -

          The attached patch includes:

          1. data structures for describing cluster network topology
          a. org.apache.hadoop.net.NetworkTopology: class for manipulating the tree-like cluster topology.
          . InnerNode: represents racks/datacenters
          . Leaves: represents datanodes
          b. org.apache.hadoop.net.Node: class for abstracting a node in NetworkTopology
          . name: name of the node
          . networklocation: indicating which datacenter and rack that this node is located.
          c. org.apchae.hadoop.net.NodeBase: a base implementation of Node
          2. code for implementing replica placement strategy
          a org.apache.hadoop.dfs.FSNamesystem.Replicator
          . chooseTarget(numOfReplicas, writer, excludedNodes, blockSize): choose numOfReplcas targets
          . chooseTarget(numOfReplicas, writer, choosenNodes, excludedNodes, block): repreplicate numOfReplicas targets
          Targets are returned forming a shortest path that starts from writer and traversing all targets
          3. protocol changes
          a. org.apache.hadoop.dfs.ClientProtocol
          . open is enhanced by passing the clientMachine name. In the returned blocks, containing data nodes are sorted by its distance to the clientMachine
          b. org.apache.hadoop.dfs.DatanodeProtocol
          . register is enhanced by passing the data node network location. NetworkTopology is built while registering data nodes.
          4. data structures changes to reflecting data node network location information
          a. org.apache.hadoop.dfs.Datanode
          b. org.apache.hadoop.dfs.DatanodeInfo
          5. various test cases
          a. org.apache.hadoop.net.TestNetworkTopolgy: test NetworkTopology
          b. org.apache.hadoop.dfs.TestReplicationPolocy: test replicas placement
          c. org.apache.hadoop.dfs.TestRackAwareReplication: test replicas placement/datanode registration by bringing up a MiniDFSCluster
          d. org.apache.hadoop.dfs.MiniDFSCluster: enhance it to start datanodes on different network locations

          Please review! I appreciate your feedback.

          Show
          Hairong Kuang added a comment - The attached patch includes: 1. data structures for describing cluster network topology a. org.apache.hadoop.net.NetworkTopology: class for manipulating the tree-like cluster topology. . InnerNode: represents racks/datacenters . Leaves: represents datanodes b. org.apache.hadoop.net.Node: class for abstracting a node in NetworkTopology . name: name of the node . networklocation: indicating which datacenter and rack that this node is located. c. org.apchae.hadoop.net.NodeBase: a base implementation of Node 2. code for implementing replica placement strategy a org.apache.hadoop.dfs.FSNamesystem.Replicator . chooseTarget(numOfReplicas, writer, excludedNodes, blockSize): choose numOfReplcas targets . chooseTarget(numOfReplicas, writer, choosenNodes, excludedNodes, block): repreplicate numOfReplicas targets Targets are returned forming a shortest path that starts from writer and traversing all targets 3. protocol changes a. org.apache.hadoop.dfs.ClientProtocol . open is enhanced by passing the clientMachine name. In the returned blocks, containing data nodes are sorted by its distance to the clientMachine b. org.apache.hadoop.dfs.DatanodeProtocol . register is enhanced by passing the data node network location. NetworkTopology is built while registering data nodes. 4. data structures changes to reflecting data node network location information a. org.apache.hadoop.dfs.Datanode b. org.apache.hadoop.dfs.DatanodeInfo 5. various test cases a. org.apache.hadoop.net.TestNetworkTopolgy: test NetworkTopology b. org.apache.hadoop.dfs.TestReplicationPolocy: test replicas placement c. org.apache.hadoop.dfs.TestRackAwareReplication: test replicas placement/datanode registration by bringing up a MiniDFSCluster d. org.apache.hadoop.dfs.MiniDFSCluster: enhance it to start datanodes on different network locations Please review! I appreciate your feedback.
          Hide
          Runping Qi added a comment -

          When re-replicating, we should try to place replicas to different racks whenever it is feasible.

          To minimize write latency, when we write a block that needs more than 2 replicas, does it make sense to let the client return after writing the first 2 copies onto dofferent racks, and let the background re-plication process take care the other replicas?

          Show
          Runping Qi added a comment - When re-replicating, we should try to place replicas to different racks whenever it is feasible. To minimize write latency, when we write a block that needs more than 2 replicas, does it make sense to let the client return after writing the first 2 copies onto dofferent racks, and let the background re-plication process take care the other replicas?
          Hide
          Hairong Kuang added a comment -

          Attached please read my revised proposal.

          Show
          Hairong Kuang added a comment - Attached please read my revised proposal.
          Hide
          Sameer Paranjpye added a comment -

          > Sould we consider a mobile network ...

          No, that is not the concern. The concern is that in a large enough installation (1000s of nodes) there are going to be a several instances a week of nodes going down and getting repaired. When the nodes return they may not be in the same racks or on the same switches. This probably does not happen by design, but mistakes are always possible.

          > I would simply update network toplogy when a datanode registers or exits.

          Certainly, that seems like the right approach. The question is how the topology is updated.

          Is it by updating a central configuration file and having the namenode read it? This implies potentially updating configuration every time anything changes in a datacenter.

          Is it by running timing experiments every time a datanode registers? This can be biased by transient network conditions.

          Neither of the above seem like a productive use of admin or namenode time.

          Instead of a network topology interface, why not have a network location interface on the Datanode.

          public interface NetworkLocation

          { String[] getLocation(); }

          This returns an array of hubs that characterize a nodes location on the network. This is probably an array of string of the form, <key>=<value>. So a possible output could be:

          rack=r1, switch=s2, datacenter=d3, ...

          as many levels as are desirable. Also, network distances between nodes aren't the only things that are interesting. I think it's useful to distinguish between a rack and a switch because a rack is commonly a physical power domain.

          Given this output from each Datanode we can then have a concrete implementation of NetworkTopology that simply tracks the membership of each hub. Finding the distance between two nodes is done by comparing their arrays of hubs and stopping where they differ.

          > Allocating all blocks of a file to the same 3 racks limits the aggaregate read bandwith.

          It limits the aggregate read bandwidth to that particular file only. Every file will have it's own set of 3 racks, so I don't think it affects overall filesystem bandwidth. On the other hand, it potentially gives a client that is reading an entire file better locality.

          > Users may specify its replica placement policy

          I agree with Doug. This seems like a subsequent feature.

          Show
          Sameer Paranjpye added a comment - > Sould we consider a mobile network ... No, that is not the concern. The concern is that in a large enough installation (1000s of nodes) there are going to be a several instances a week of nodes going down and getting repaired. When the nodes return they may not be in the same racks or on the same switches. This probably does not happen by design, but mistakes are always possible. > I would simply update network toplogy when a datanode registers or exits. Certainly, that seems like the right approach. The question is how the topology is updated. Is it by updating a central configuration file and having the namenode read it? This implies potentially updating configuration every time anything changes in a datacenter. Is it by running timing experiments every time a datanode registers? This can be biased by transient network conditions. Neither of the above seem like a productive use of admin or namenode time. Instead of a network topology interface, why not have a network location interface on the Datanode. public interface NetworkLocation { String[] getLocation(); } This returns an array of hubs that characterize a nodes location on the network. This is probably an array of string of the form, <key>=<value>. So a possible output could be: rack=r1, switch=s2, datacenter=d3, ... as many levels as are desirable. Also, network distances between nodes aren't the only things that are interesting. I think it's useful to distinguish between a rack and a switch because a rack is commonly a physical power domain. Given this output from each Datanode we can then have a concrete implementation of NetworkTopology that simply tracks the membership of each hub. Finding the distance between two nodes is done by comparing their arrays of hubs and stopping where they differ. > Allocating all blocks of a file to the same 3 racks limits the aggaregate read bandwith. It limits the aggregate read bandwidth to that particular file only. Every file will have it's own set of 3 racks, so I don't think it affects overall filesystem bandwidth. On the other hand, it potentially gives a client that is reading an entire file better locality. > Users may specify its replica placement policy I agree with Doug. This seems like a subsequent feature.
          Hide
          Doug Cutting added a comment -

          > Should we consider a mobile network [ .. . ]

          I think it's easy to construct interfaces that permit this, but I don't think we should attempt to support this in at least the initial implementation.

          > it can not express the case when nodes are connected by a switch

          How so? It should be easy to implement this so that nodes connected by a switch are always closer to one another than to any other nodes, no?

          > Users may specify its replica placement policy

          I see that as a subsequent feature. I think in the first version simply placing things at progressive distances until the replications are all placed will satisfy the majority of applications.

          Show
          Doug Cutting added a comment - > Should we consider a mobile network [ .. . ] I think it's easy to construct interfaces that permit this, but I don't think we should attempt to support this in at least the initial implementation. > it can not express the case when nodes are connected by a switch How so? It should be easy to implement this so that nodes connected by a switch are always closer to one another than to any other nodes, no? > Users may specify its replica placement policy I see that as a subsequent feature. I think in the first version simply placing things at progressive distances until the replications are all placed will satisfy the majority of applications.
          Hide
          Hairong Kuang added a comment -

          1. Network topology construction
          Sould we consider a mobile network with laptops running datanode moving around? Otherwise, once a datanode gets started, the possiblity of the node moves to a different location is slim. I would simply update network toplogy when a datanode registers or exits.

          The # of hops between hubs can be specified in a configuration file which read by the namenode at startup time.

          2. Network toplogy interface
          Doug, I like the interface that you described. But it looks like it can not express the case when nodes are connected by a switch where the distance between two nodes is 1. It also need to a method to expose all nodes that belong to a hub.

          2. block placement strategy
          Allocating all blocks of a file to the same 3 racks limits the aggaregate read bandwith. I do not see much of its benefit.

          I am thinking to allow users to specify replica placement policy at runtime when it sets the replication factor of a file. It can use any predefined placement policy or set a user-defined placement policy.

          Users may specify its replica placement policy using a declarative language. Something like:

          replica 1: same node
          replica 2: same rack
          replica 3: different rack
          others: random

          Any comment?

          Show
          Hairong Kuang added a comment - 1. Network topology construction Sould we consider a mobile network with laptops running datanode moving around? Otherwise, once a datanode gets started, the possiblity of the node moves to a different location is slim. I would simply update network toplogy when a datanode registers or exits. The # of hops between hubs can be specified in a configuration file which read by the namenode at startup time. 2. Network toplogy interface Doug, I like the interface that you described. But it looks like it can not express the case when nodes are connected by a switch where the distance between two nodes is 1. It also need to a method to expose all nodes that belong to a hub. 2. block placement strategy Allocating all blocks of a file to the same 3 racks limits the aggaregate read bandwith. I do not see much of its benefit. I am thinking to allow users to specify replica placement policy at runtime when it sets the replication factor of a file. It can use any predefined placement policy or set a user-defined placement policy. Users may specify its replica placement policy using a declarative language. Something like: replica 1: same node replica 2: same rack replica 3: different rack others: random Any comment?
          Hide
          Doug Cutting added a comment -

          How about a topology interface?

          public interface NetworkTopology

          { HubDistance[] getHubDistances(String host) }

          public interface HubDistance

          { public String getHubName(); public int getHops(); }

          The namenode can use this to refresh things dynamically, or it might be read statically from a config file, depending on the implementation. The implementation class can be specified in the config. The default could just chop up hostnames, so that foo.bar.co.uk yeilds hops <foo.bar.co.uk,0>, <bar.co.uk,2>, <co.uk,3>, <uk,4>. Someone could implement this with DNS too.

          To compute a distance, you find the common hub between two nodes and sum the hops. To find nearby nodes, look for nodes sharing a nearby hub. Etc.

          Show
          Doug Cutting added a comment - How about a topology interface? public interface NetworkTopology { HubDistance[] getHubDistances(String host) } public interface HubDistance { public String getHubName(); public int getHops(); } The namenode can use this to refresh things dynamically, or it might be read statically from a config file, depending on the implementation. The implementation class can be specified in the config. The default could just chop up hostnames, so that foo.bar.co.uk yeilds hops <foo.bar.co.uk,0>, <bar.co.uk,2>, <co.uk,3>, <uk,4>. Someone could implement this with DNS too. To compute a distance, you find the common hub between two nodes and sum the hops. To find nearby nodes, look for nodes sharing a nearby hub. Etc.
          Hide
          Sameer Paranjpye added a comment -

          Declaring the topology in the slaves file could be brittle. Nodes go down, get repaired and may or may not return to the same location. Particularly if nodes are spread across datacenters, keeping the slaves file in sync with every change in datacenters will quickly become cumbersome.

          Ditto, constructing the topology with timing experiments on the namenode. Timing experiments can be unreliable and biased by transient network congestions. A persistent topology map is error prone because nodes can move around.It's also not clear how frequently the topology map would be constructed. Every time node(s) are added or return?

          It appears feasible to determine the location of a node in a network with a local operation, probably by running an installation specific script. The Datanode reports it's location to the namenode upon registration, which updates a dynamic topology map. This map can be cheaply re-constructed at startup when processing Datanode registrations and block reports.

          On the subject of replica placement, would there be any value in co-locating blocks from a single file at a rack or datacenter level?

          Say, block 0 is placed on <node a, rack p>, <node b, rack p> and <node c, rack q> i.e. first replica node local, next rack local and the third on a different rack. Would it make sense to put block 1 and subsequent blocks similarly?

          So block 1 could be on <node a, rack p>, <node d, rack p> and <node e, rack q>. Then all blocks from the file would end up on racks 'p' and 'q', and it would be easy to get rack locality for the entire file.

          This probably needs a lot more modeling and analysis...

          Show
          Sameer Paranjpye added a comment - Declaring the topology in the slaves file could be brittle. Nodes go down, get repaired and may or may not return to the same location. Particularly if nodes are spread across datacenters, keeping the slaves file in sync with every change in datacenters will quickly become cumbersome. Ditto, constructing the topology with timing experiments on the namenode. Timing experiments can be unreliable and biased by transient network congestions. A persistent topology map is error prone because nodes can move around.It's also not clear how frequently the topology map would be constructed. Every time node(s) are added or return? It appears feasible to determine the location of a node in a network with a local operation, probably by running an installation specific script. The Datanode reports it's location to the namenode upon registration, which updates a dynamic topology map. This map can be cheaply re-constructed at startup when processing Datanode registrations and block reports. — On the subject of replica placement, would there be any value in co-locating blocks from a single file at a rack or datacenter level? Say, block 0 is placed on <node a, rack p>, <node b, rack p> and <node c, rack q> i.e. first replica node local, next rack local and the third on a different rack. Would it make sense to put block 1 and subsequent blocks similarly? So block 1 could be on <node a, rack p>, <node d, rack p> and <node e, rack q>. Then all blocks from the file would end up on racks 'p' and 'q', and it would be easy to get rack locality for the entire file. This probably needs a lot more modeling and analysis...
          Hide
          Hairong Kuang added a comment -

          I think that what we only need is a rack distance matrix. So the storage is much less. But after I talked to Koji, it looks that it is hard to pre-compute an acurate distance between nodes. So a simple thing to do is assume that distance between two nodes in different datacenters is 3 or 4. No matrix is needed.

          I am working on designing an interface for network topology and an interface for replica placement policy. Will post them later.

          Show
          Hairong Kuang added a comment - I think that what we only need is a rack distance matrix. So the storage is much less. But after I talked to Koji, it looks that it is hard to pre-compute an acurate distance between nodes. So a simple thing to do is assume that distance between two nodes in different datacenters is 3 or 4. No matrix is needed. I am working on designing an interface for network topology and an interface for replica placement policy. Will post them later.
          Hide
          Doug Cutting added a comment -

          I think that at this point what we want is a declarative description of the cluster topology, provided by the administrator, rather than a distance function. For example, we could extend the format of the slaves file to list all of the network levels that a node lives at:

          <hostname> <rackswitchname> <clusterswitchname> <datacentername>, etc.

          This can be used not just to compute the distance between any two nodes, but also to quickly list nodes that are close to a node. The latter is hard with a distance function without pre-computing all inter-node distances. With 10k nodes, that's 100M distances, which is expensive.

          Show
          Doug Cutting added a comment - I think that at this point what we want is a declarative description of the cluster topology, provided by the administrator, rather than a distance function. For example, we could extend the format of the slaves file to list all of the network levels that a node lives at: <hostname> <rackswitchname> <clusterswitchname> <datacentername>, etc. This can be used not just to compute the distance between any two nodes, but also to quickly list nodes that are close to a node. The latter is hard with a distance function without pre-computing all inter-node distances. With 10k nodes, that's 100M distances, which is expensive.
          Hide
          Hairong Kuang added a comment -

          Is there any way that we could find out distance (#hops) between two nodes at runtime? Traceout prints all the hops but it has to run at one of the node.

          Once the namenode could find out the distance between any two nodes, it can find out which nodes belong to the same rack (nodes has a distance of 1 belong to a same rack). It can also get the distance between any two racks.

          Once the name node has this distance info, it can perform the following rack-aware optimizations:
          1. replica placement policy
          It's similar to what Konstantin proposed but with a slight change.

          • place the first replica in a node closest to the writer.
          • place the 2nd replica on a different node on the same rack as the 1st node.
          • place the 3rd replica on a different rack with a distance of 2 to the first node.
          • place the rest of the replicas on a random not-yet-selected one.
            2. orgnize the write pipeline in the ascending order of distance to the writer.
            3. for reading, the best node is defined as the node closiest to the reader
          Show
          Hairong Kuang added a comment - Is there any way that we could find out distance (#hops) between two nodes at runtime? Traceout prints all the hops but it has to run at one of the node. Once the namenode could find out the distance between any two nodes, it can find out which nodes belong to the same rack (nodes has a distance of 1 belong to a same rack). It can also get the distance between any two racks. Once the name node has this distance info, it can perform the following rack-aware optimizations: 1. replica placement policy It's similar to what Konstantin proposed but with a slight change. place the first replica in a node closest to the writer. place the 2nd replica on a different node on the same rack as the 1st node. place the 3rd replica on a different rack with a distance of 2 to the first node. place the rest of the replicas on a random not-yet-selected one. 2. orgnize the write pipeline in the ascending order of distance to the writer. 3. for reading, the best node is defined as the node closiest to the reader
          Hide
          Doug Cutting added a comment -

          We need an extensible API for describing network topology & distance. Here's stab at one.

          public Interface NetworkLocatable

          { /** Return the names of each network level of this node: host, rack, datacenter, etc. */ String[] getNetworkLocation(); }

          It is tempting to use something more abstract, e.g. a distance method, but that would not provide a concise indication of the topology, and would require linear scans, or storage of a full distance matrix, when, e.g., placing blocks.

          As a first implementation, I like the idea of trying to distribute blocks to increasing network levels: first to the same host, then to the same rack, then to the same datacenter, etc. until the replication level is exhausted. If the replication level is higher than the number of levels, then a uniform distribution at the highest level should be sought.

          Down the road we may wish to make this strategy extensible, or to have a more elaborate distance calculation, but we can first probably get a lot of milage from this simple model.

          Show
          Doug Cutting added a comment - We need an extensible API for describing network topology & distance. Here's stab at one. public Interface NetworkLocatable { /** Return the names of each network level of this node: host, rack, datacenter, etc. */ String[] getNetworkLocation(); } It is tempting to use something more abstract, e.g. a distance method, but that would not provide a concise indication of the topology, and would require linear scans, or storage of a full distance matrix, when, e.g., placing blocks. As a first implementation, I like the idea of trying to distribute blocks to increasing network levels: first to the same host, then to the same rack, then to the same datacenter, etc. until the replication level is exhausted. If the replication level is higher than the number of levels, then a uniform distribution at the highest level should be sought. Down the road we may wish to make this strategy extensible, or to have a more elaborate distance calculation, but we can first probably get a lot of milage from this simple model.
          Hide
          Koji Noguchi added a comment -

          If the nodes are connected with switches, I think it'll be hard to figure out how many 'hops' there are between nodes.
          Switches are transparent to the nodes.
          For example, traceroute only shows the routers and not the switches.

          Show
          Koji Noguchi added a comment - If the nodes are connected with switches, I think it'll be hard to figure out how many 'hops' there are between nodes. Switches are transparent to the nodes. For example, traceroute only shows the routers and not the switches.
          Hide
          Konstantin Shvachko added a comment -
          • Node locality
            May be we should build cluster topology map based on network hops as a measure of node locality.
            Rather than specifying externally which rack each node belongs to.
            The map is built at startup. It takes O(#racks * #nodes) communications to partition all nodes into racks.
            The topology map can be persistent, which will make it easier to restart the cluster.
            We should keep the topology map separate to let e.g. map-reduce use it.
            If a client runs on one of the cluster machines then it should know its rack both for reads and writes.
            If not, then it is not applicable.
            I like hierarchical topology map, like datacenters (2-3 hops). What is the distance between datacenters in hops?
            But probably not too deep. The rest may be just considered relatively far away from each other.
          • Replica placement
            Is it far to say that the replica placement strategy is
          • place first replica locally if possible (client runs from the cluster machine), if not - on an arbitrary node.
          • place second replica on a different node on the same rack as the first node.
          • place third replica on a different rack but in the same datacenter as the second node.
            If we don't have further hierarchical level then
          • place 4-th replica and all subsequent ones randomly on any node not yet selected.
          • Different placement strategies
            I think we should define an interface responsible for block placement
            interface BlockReplicator { // just an example chooseTargets(); getTopology(); distance(d1,d2); ............ }

            So that we could implement different replication strategies and use them interchangeably
            if not in the runtime then at least in compile time.

          Show
          Konstantin Shvachko added a comment - Node locality May be we should build cluster topology map based on network hops as a measure of node locality. Rather than specifying externally which rack each node belongs to. The map is built at startup. It takes O(#racks * #nodes) communications to partition all nodes into racks. The topology map can be persistent, which will make it easier to restart the cluster. We should keep the topology map separate to let e.g. map-reduce use it. If a client runs on one of the cluster machines then it should know its rack both for reads and writes. If not, then it is not applicable. I like hierarchical topology map, like datacenters (2-3 hops). What is the distance between datacenters in hops? But probably not too deep. The rest may be just considered relatively far away from each other. Replica placement Is it far to say that the replica placement strategy is place first replica locally if possible (client runs from the cluster machine), if not - on an arbitrary node. place second replica on a different node on the same rack as the first node. place third replica on a different rack but in the same datacenter as the second node. If we don't have further hierarchical level then place 4-th replica and all subsequent ones randomly on any node not yet selected. Different placement strategies I think we should define an interface responsible for block placement interface BlockReplicator { // just an example chooseTargets(); getTopology(); distance(d1,d2); ............ } So that we could implement different replication strategies and use them interchangeably if not in the runtime then at least in compile time.
          Hide
          Milind Bhandarkar added a comment -

          rackId is associated with the datanode.
          Automatic rebalancing is orthogonal to this proposal. The rebalancing feature will use the same chooseTargets method that implements this rack awareness, and so the replicas in that case will conform to the rack-aware logic.

          Show
          Milind Bhandarkar added a comment - rackId is associated with the datanode. Automatic rebalancing is orthogonal to this proposal. The rebalancing feature will use the same chooseTargets method that implements this rack awareness, and so the replicas in that case will conform to the rack-aware logic.
          Hide
          dhruba borthakur added a comment -

          I have a two questions:

          1. Is the rackId associated with a datanode or is it associated with the data directories in a datanode?
          2. Suppose that all three replicas was allocated in the same rack because there weren't any space available on other racks. Now, if more racks are added to the system, will HDFS automatically rebalance the replicas to conform to your rack-aware logic?

          Show
          dhruba borthakur added a comment - I have a two questions: 1. Is the rackId associated with a datanode or is it associated with the data directories in a datanode? 2. Suppose that all three replicas was allocated in the same rack because there weren't any space available on other racks. Now, if more racks are added to the system, will HDFS automatically rebalance the replicas to conform to your rack-aware logic?
          Hide
          Bryan Pendleton added a comment -

          Seems implied here, but I'll state it anyway, just in case it should be a separate issue:

          Reads should also be rack-aware. Requests to the NameNode can take the rack as an argument, and order the returned locations by rack locality. This implies that there should be a mechanism for a client to be told/discover what rack its on, too, so the argument to the datanode isn't the only place rack information is useful.

          Show
          Bryan Pendleton added a comment - Seems implied here, but I'll state it anyway, just in case it should be a separate issue: Reads should also be rack-aware. Requests to the NameNode can take the rack as an argument, and order the returned locations by rack locality. This implies that there should be a mechanism for a client to be told/discover what rack its on, too, so the argument to the datanode isn't the only place rack information is useful.
          Hide
          Doug Cutting added a comment -

          Should we think about super-rack locality at this point? For example, Amazon EC2 can allocate nodes in a cluster across multiple datacenters, or at least a few network hops away. This is obviously not an optimal configuration (although, in my 20-node benchmarks, it didn't slow things noticably). Should we try to support things like this now?

          http://developer.amazonwebservices.com/connect/thread.jspa?threadID=11615

          Network hops would be one way to measure things. Localhost is zero hops, same rack is one, different rack in the same datacenter is two, and so on. This could be repressented by a list of ids per node (host, rack, datacenter, ...).

          Show
          Doug Cutting added a comment - Should we think about super-rack locality at this point? For example, Amazon EC2 can allocate nodes in a cluster across multiple datacenters, or at least a few network hops away. This is obviously not an optimal configuration (although, in my 20-node benchmarks, it didn't slow things noticably). Should we try to support things like this now? http://developer.amazonwebservices.com/connect/thread.jspa?threadID=11615 Network hops would be one way to measure things. Localhost is zero hops, same rack is one, different rack in the same datacenter is two, and so on. This could be repressented by a list of ids per node (host, rack, datacenter, ...).
          Hide
          Hairong Kuang added a comment -

          I plan to address the issue from three areas:

          • Determine rack id

          Each data node gets its rack id from the command line. Data node supports an option "-r <rack id>" or "?rack <rack id>". A rack id is a unique string representation of the rack that the data node belongs to. It usually consists of a name/ip plus its port number of the switch to which the data node directly connects. If the option is not set, the data node belongs to a default rack.

          How to get the rack id is proprietary to each organization. So a mechanism needs to be provided when starting HDFS, for example, a script which prints the rack id on the screen. The output is feed to a data node when it starts.

          The rack id is then stored in DatanodeID and sent to the name node as part of the registration information.

          • Maintain rackid-to-datanodes map

          A name node maintains a rack id to data node descriptors map that maps a rack id to a list of data nodes that belong to the rack. When the name node receives a registration message from a data node, it first check if the map already has an entry for the data node. If yes, it removes the old entry. It then adds a new entry. When the name node removes a data node, the data node entry in the map is also removed.

          If all data nodes start without providing a rack id, the map contains one default rack id mapping to all the data nodes in the system. So HDFS will behave the same as it is now.

          • Place replicas

          A simple policy is to place replicas across racks. This prevents losing data when an entire rack fails and allows to make use of bandwidth from multiple racks when read a file. This policy evenly distribution of replicas in the cluster and thus makes it easy to balance load when a map/reduce job with the data as input is scheduled to run. However, the cost of the policy is the write expense that a block needs to write to multiple racks. One minor optimization is to place the one replica in the local node, where the writer is located. If not, place it in a different node at the local rack.

          For the most common case when the replica factor is three, another possible placement policy is to place one replica in the local node, place one in a different node at the local rack, and place one at a different rack. This policy cuts the out-of-rack write traffic and hence improves write performance. Because the chance of rack failure is far less than that of node failure, it does not effect the data reliability and availability much. But it reduces the aggregate bandwidth of network bandwith when read a file since now a block is placed in two racks rather than three. The replicas of a file do not evenly distribute across the racks evenly. One third is at one node, two thirds are on one rack, the other one third is evenly distributed. But a map/reduce job still gets a chance to balance its load because each block has one replica that is placed in a random rack. Overall I feel that this placement policy has a good trade-off. It greatly improves write performance while not at a great cost of data reliability or read performance.

          Show
          Hairong Kuang added a comment - I plan to address the issue from three areas: Determine rack id Each data node gets its rack id from the command line. Data node supports an option "-r <rack id>" or "?rack <rack id>". A rack id is a unique string representation of the rack that the data node belongs to. It usually consists of a name/ip plus its port number of the switch to which the data node directly connects. If the option is not set, the data node belongs to a default rack. How to get the rack id is proprietary to each organization. So a mechanism needs to be provided when starting HDFS, for example, a script which prints the rack id on the screen. The output is feed to a data node when it starts. The rack id is then stored in DatanodeID and sent to the name node as part of the registration information. Maintain rackid-to-datanodes map A name node maintains a rack id to data node descriptors map that maps a rack id to a list of data nodes that belong to the rack. When the name node receives a registration message from a data node, it first check if the map already has an entry for the data node. If yes, it removes the old entry. It then adds a new entry. When the name node removes a data node, the data node entry in the map is also removed. If all data nodes start without providing a rack id, the map contains one default rack id mapping to all the data nodes in the system. So HDFS will behave the same as it is now. Place replicas A simple policy is to place replicas across racks. This prevents losing data when an entire rack fails and allows to make use of bandwidth from multiple racks when read a file. This policy evenly distribution of replicas in the cluster and thus makes it easy to balance load when a map/reduce job with the data as input is scheduled to run. However, the cost of the policy is the write expense that a block needs to write to multiple racks. One minor optimization is to place the one replica in the local node, where the writer is located. If not, place it in a different node at the local rack. For the most common case when the replica factor is three, another possible placement policy is to place one replica in the local node, place one in a different node at the local rack, and place one at a different rack. This policy cuts the out-of-rack write traffic and hence improves write performance. Because the chance of rack failure is far less than that of node failure, it does not effect the data reliability and availability much. But it reduces the aggregate bandwidth of network bandwith when read a file since now a block is placed in two racks rather than three. The replicas of a file do not evenly distribute across the racks evenly. One third is at one node, two thirds are on one rack, the other one third is evenly distributed. But a map/reduce job still gets a chance to balance its load because each block has one replica that is placed in a random rack. Overall I feel that this placement policy has a good trade-off. It greatly improves write performance while not at a great cost of data reliability or read performance.

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development