|
I think the main problem with the heartbeats right now is the 1 minute timeout before they fail.
Reducing the timeout to say 3 seconds (or may be 0 seconds we will need to experimant with that) could be an easy short term solution. First of all, this will randomize data-nodes' access to the name-node, and give them equal chances to acknowledge their existence within the 10 minute interval. Secondly, we will let other requests, the most often of which are leases extensions, to go through and succeed, which eventually will reduce the failure rate of map-reduce tasks. IMO, this will not increase the load on the name-node, because the name-node does nothing to reject I like the idea of self-adjusting heartbeats. Why not, if a data-node observes a consistent 30% I agree the request rates should not increase linearly with the cluster size. This makes self-adjustments > IMO, this will not increase the load on the name-node ...
It depends on where the bottlenecks are in the namenode. For example, if heartbeats are already using 75% of its capacity, and we want replications to use the last 25%, then vastly increasing the heartbeat rate will starve the replications. To my thinking, we should design things so that we don't see timeouts in normal operation (except when trying to contact nodes that are malfunctioning). In particular, we shouldn't use timeouts as a primary control mechanism. Also note that there are different kinds of timeouts. There are connect timeouts, which mean that the server never saw the request. Response timeouts, however, usually mean that the server has recieved the request and will eventually respond to it, but just not in the time you're willing to wait. In the latter case, the server won't notice that the client has timed out until after the response has been computed. So, if you're going to retry sooner, you should only do so after a connect timeout, and even then I'd argue that this is a poor solution. How about this approach:
A directory walk may also be a thing to look at - the NameNode walks the file system hierarchy and for each file it pings the set of all DataNodes containing the blocks of that file but in this case, we will end up pinging the same DataNode multiple times. To avoid this, the (DataNode->{blocks}) mapping can be used. Devaraj: I think your proposal is part of the long-term solution. A short-term fix will simply be to control the rate of replication and increase the heartbeat interval. Longer-term we should consider the sort of inverted control you propose.
This issue has been addressed by the following fixes:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Think of it this way, the namenode's observed current capacity is 200 heartbeats per second and 50 block replications per second. We're attempting in excess of 50 replications and still attempting 200 heartbeats, and the many of the heartbeats are failing to arrive in a timely manner (as are probably many of the replication reports, but those are less critical). Retrying heartbeats sooner will just increase the load on the namenode, aggravating the problem.
The other thing to do is limit the heartbeat traffic. Currently, heartbeat traffic is proportional to cluster size, which is not scalable. As a simple measure, we can make the heartbeat interval configurable. Longer term we can make it adaptive. Longer-yet, we could even consider inverting the control, so that the namenode pings datanodes to check if they're alive and hand them work.
Another long-term fix would of course be to improve the namenode's performance and lessen its bottlenecks, so that it can handle more requests per second. But no matter how much we do this, we still need to make sure that all request rates are limited, and do not increase linearly with cluster size.