Hadoop Common
  1. Hadoop Common
  2. HADOOP-572

Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.2
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      Large dfs cluster

      Description

      I've observed a cluster crash caused by simultaneous failure of only 3 data-nodes.
      The crash is reproducable. In order to reproduce it you need a rather large cluster.
      To simplify calculations I'll consider a 600 node cluster as an example.
      The cluster should also contain a substantial amount of data.
      We will need at least 3 data-nodes containing 10,000+ blocks each.
      Now suppose that these 3 data-nodes fail at the same time, and the name-node
      started replicating all missing blocks belonging to the nodes.
      The name-node can replicate 50 blocks per second on average based on experimental data.
      Meaning, it will take more than 10 minutes, which is the heartbeat expiration interval,
      to replicates all 30,000+ blocks.

      With the 3 second heartbeat interval there are 600 / 3 = 200 heartbeats hitting the name-node every second.
      Under heavy replication load the name-node accepts about 50 heartbeats per second.
      So at most 3/4 of all heartbeats remain unserved.

      Each node SHOULD send 200 heartbeats during the 10 minute interval, and every time the probability
      of the heartbeat being unserved is 3/4 or less.
      So the probability of failing of all 200 heartbeats is (3/4) ** 200 = 0 from the practical standpoint.

      IN FACT since current implementation sets the rpc timeout to 1 minute, a failed heartbeat takes
      1 minute and 8 seconds to complete, and under this circumstances each data-node can send only
      9 heartbeats during the 10 minute interval. Thus, the probability of failing of all 9 of them is 0.075,
      which means that we will loose 45 nodes out of 600 at the end of the 10 minute interval.
      From this point the name-node will be constantly replicating blocks and loosing more nodes, and
      becomes effectively dysfunctional.

      A map-reduce framework running on top of it makes things deteriorate even faster, because failing
      tasks and jobs are trying to remove files and re-create them again increasing the overall load on
      the name-node.

      I see at least 2 problems that contribute to the chain reaction described above.
      1. A heartbeat failure takes too long (1'8").
      2. Name-node synchronized operations should be fine-grained.

        Issue Links

          Activity

          Hide
          Sameer Paranjpye added a comment -

          This issue has been addressed by the following fixes:

          HADOOP-725
          HADOOP-641
          HADOOP-642
          HADOOP-255

          Show
          Sameer Paranjpye added a comment - This issue has been addressed by the following fixes: HADOOP-725 HADOOP-641 HADOOP-642 HADOOP-255
          Hide
          Doug Cutting added a comment -

          Devaraj: I think your proposal is part of the long-term solution. A short-term fix will simply be to control the rate of replication and increase the heartbeat interval. Longer-term we should consider the sort of inverted control you propose.

          Show
          Doug Cutting added a comment - Devaraj: I think your proposal is part of the long-term solution. A short-term fix will simply be to control the rate of replication and increase the heartbeat interval. Longer-term we should consider the sort of inverted control you propose.
          Hide
          Devaraj Das added a comment -

          How about this approach:

          • The DataNodes register once with the NameNode after the latter comes up (and registers just once - no heartbeats).
          • Whenever the NameNode requires the services of a DataNode (to store/delete blocks), it pings the chosen DataNode to see whether it is indeed alive. As a response to the "ping", the DataNode sends its latest status (block report, free disk space, etc.). The response can also be controlled - if the DataNode sent its status once in the last 30 secs, it doesn't send it again.
          • For each DataNode, the NameNode maintains a list of the blocks it hosts.
          • A separate thread in the NameNode pings the set of known DataNodes and whenever it is not able to ping a particular DataNode, it issues the replication requests of the blocks that were there in that datanode (for this it takes the help of the inverse mapping from a "block" to the set of DataNodes containing that block).

          A directory walk may also be a thing to look at - the NameNode walks the file system hierarchy and for each file it pings the set of all DataNodes containing the blocks of that file but in this case, we will end up pinging the same DataNode multiple times. To avoid this, the (DataNode->

          {blocks}

          ) mapping can be used.

          Show
          Devaraj Das added a comment - How about this approach: The DataNodes register once with the NameNode after the latter comes up (and registers just once - no heartbeats). Whenever the NameNode requires the services of a DataNode (to store/delete blocks), it pings the chosen DataNode to see whether it is indeed alive. As a response to the "ping", the DataNode sends its latest status (block report, free disk space, etc.). The response can also be controlled - if the DataNode sent its status once in the last 30 secs, it doesn't send it again. For each DataNode, the NameNode maintains a list of the blocks it hosts. A separate thread in the NameNode pings the set of known DataNodes and whenever it is not able to ping a particular DataNode, it issues the replication requests of the blocks that were there in that datanode (for this it takes the help of the inverse mapping from a "block" to the set of DataNodes containing that block). A directory walk may also be a thing to look at - the NameNode walks the file system hierarchy and for each file it pings the set of all DataNodes containing the blocks of that file but in this case, we will end up pinging the same DataNode multiple times. To avoid this, the (DataNode-> {blocks} ) mapping can be used.
          Hide
          Doug Cutting added a comment -

          > IMO, this will not increase the load on the name-node ...

          It depends on where the bottlenecks are in the namenode. For example, if heartbeats are already using 75% of its capacity, and we want replications to use the last 25%, then vastly increasing the heartbeat rate will starve the replications. To my thinking, we should design things so that we don't see timeouts in normal operation (except when trying to contact nodes that are malfunctioning). In particular, we shouldn't use timeouts as a primary control mechanism.

          Also note that there are different kinds of timeouts. There are connect timeouts, which mean that the server never saw the request. Response timeouts, however, usually mean that the server has recieved the request and will eventually respond to it, but just not in the time you're willing to wait. In the latter case, the server won't notice that the client has timed out until after the response has been computed. So, if you're going to retry sooner, you should only do so after a connect timeout, and even then I'd argue that this is a poor solution.

          Show
          Doug Cutting added a comment - > IMO, this will not increase the load on the name-node ... It depends on where the bottlenecks are in the namenode. For example, if heartbeats are already using 75% of its capacity, and we want replications to use the last 25%, then vastly increasing the heartbeat rate will starve the replications. To my thinking, we should design things so that we don't see timeouts in normal operation (except when trying to contact nodes that are malfunctioning). In particular, we shouldn't use timeouts as a primary control mechanism. Also note that there are different kinds of timeouts. There are connect timeouts, which mean that the server never saw the request. Response timeouts, however, usually mean that the server has recieved the request and will eventually respond to it, but just not in the time you're willing to wait. In the latter case, the server won't notice that the client has timed out until after the response has been computed. So, if you're going to retry sooner, you should only do so after a connect timeout, and even then I'd argue that this is a poor solution.
          Hide
          Konstantin Shvachko added a comment -

          I think the main problem with the heartbeats right now is the 1 minute timeout before they fail.
          Reducing the timeout to say 3 seconds (or may be 0 seconds we will need to experimant with that)
          could be an easy short term solution.
          First of all, this will randomize data-nodes' access to the name-node, and give them equal chances to
          acknowledge their existence within the 10 minute interval.
          Secondly, we will let other requests, the most often of which are leases extensions, to go through
          and succeed, which eventually will reduce the failure rate of map-reduce tasks.

          IMO, this will not increase the load on the name-node, because the name-node does nothing to reject
          timed out requests. I expect the name-node replication rate to drop automatically since it will be
          processing other requests between individual block replications, which is desirable. The data-nodes
          will have to leave with higher rate of TimeoutExceptions, which is acceptable.

          I like the idea of self-adjusting heartbeats. Why not, if a data-node observes a consistent 30%
          rate of timed out heartbeats it should increase the heartbeat interval by 30%. And making the adjusted
          parameters persistent is a good alternative to configuring.

          I agree the request rates should not increase linearly with the cluster size. This makes self-adjustments
          even more important, since optimizing configurations for different cluster sizes becomes a non-trivial task.

          Show
          Konstantin Shvachko added a comment - I think the main problem with the heartbeats right now is the 1 minute timeout before they fail. Reducing the timeout to say 3 seconds (or may be 0 seconds we will need to experimant with that) could be an easy short term solution. First of all, this will randomize data-nodes' access to the name-node, and give them equal chances to acknowledge their existence within the 10 minute interval. Secondly, we will let other requests, the most often of which are leases extensions, to go through and succeed, which eventually will reduce the failure rate of map-reduce tasks. IMO, this will not increase the load on the name-node, because the name-node does nothing to reject timed out requests. I expect the name-node replication rate to drop automatically since it will be processing other requests between individual block replications, which is desirable. The data-nodes will have to leave with higher rate of TimeoutExceptions, which is acceptable. I like the idea of self-adjusting heartbeats. Why not, if a data-node observes a consistent 30% rate of timed out heartbeats it should increase the heartbeat interval by 30%. And making the adjusted parameters persistent is a good alternative to configuring. I agree the request rates should not increase linearly with the cluster size. This makes self-adjustments even more important, since optimizing configurations for different cluster sizes becomes a non-trivial task.
          Hide
          Doug Cutting added a comment -

          A namenode that drops 75% of its requests for 10 minutes is a problem. I think the first thing to do is to control the replication rate, so that fewer than 50 replications are attempted per second. This is fairly simple to do, since the namenode controls the issuance of replication requests. For example, it can limit the number of outstanding replications, which will effectively control the rate.

          Think of it this way, the namenode's observed current capacity is 200 heartbeats per second and 50 block replications per second. We're attempting in excess of 50 replications and still attempting 200 heartbeats, and the many of the heartbeats are failing to arrive in a timely manner (as are probably many of the replication reports, but those are less critical). Retrying heartbeats sooner will just increase the load on the namenode, aggravating the problem.

          The other thing to do is limit the heartbeat traffic. Currently, heartbeat traffic is proportional to cluster size, which is not scalable. As a simple measure, we can make the heartbeat interval configurable. Longer term we can make it adaptive. Longer-yet, we could even consider inverting the control, so that the namenode pings datanodes to check if they're alive and hand them work.

          Another long-term fix would of course be to improve the namenode's performance and lessen its bottlenecks, so that it can handle more requests per second. But no matter how much we do this, we still need to make sure that all request rates are limited, and do not increase linearly with cluster size.

          Show
          Doug Cutting added a comment - A namenode that drops 75% of its requests for 10 minutes is a problem. I think the first thing to do is to control the replication rate, so that fewer than 50 replications are attempted per second. This is fairly simple to do, since the namenode controls the issuance of replication requests. For example, it can limit the number of outstanding replications, which will effectively control the rate. Think of it this way, the namenode's observed current capacity is 200 heartbeats per second and 50 block replications per second. We're attempting in excess of 50 replications and still attempting 200 heartbeats, and the many of the heartbeats are failing to arrive in a timely manner (as are probably many of the replication reports, but those are less critical). Retrying heartbeats sooner will just increase the load on the namenode, aggravating the problem. The other thing to do is limit the heartbeat traffic. Currently, heartbeat traffic is proportional to cluster size, which is not scalable. As a simple measure, we can make the heartbeat interval configurable. Longer term we can make it adaptive. Longer-yet, we could even consider inverting the control, so that the namenode pings datanodes to check if they're alive and hand them work. Another long-term fix would of course be to improve the namenode's performance and lessen its bottlenecks, so that it can handle more requests per second. But no matter how much we do this, we still need to make sure that all request rates are limited, and do not increase linearly with cluster size.

            People

            • Assignee:
              Sameer Paranjpye
              Reporter:
              Konstantin Shvachko
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development