Hadoop Common
  1. Hadoop Common
  2. HADOOP-1255

Name-node falls into infinite loop trying to remove a dead node.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.12.3
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None

      Description

      Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
      It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
      The data-node dies, and 10 minutes later I get

      07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
      07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
      ...................................................
      07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
      07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077

      Here is what I see in the debugger:
      FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
      DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
      the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.

      1. heartbeat.patch
        0.7 kB
        Hairong Kuang
      2. heartbeat.patch
        1 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Hadoop QA added a comment -
          Show
          Hadoop QA added a comment - Integrated in Hadoop-Nightly #83 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/83/ )
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Hairong!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Hairong!
          Show
          Hadoop QA added a comment - +1 http://issues.apache.org/jira/secure/attachment/12356681/heartbeat.patch applied and successfully tested against trunk revision r536239. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/123/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/123/console
          Hide
          Hairong Kuang added a comment -

          Many many thanks to Konstantin who spent time reproducing the infinite loop problem that he got sometime ago. I looked at his case and found out that the loop was caused by HADOOP-1256. Since 1256 already took care of his problem, I am going to mark this patch available.

          Show
          Hairong Kuang added a comment - Many many thanks to Konstantin who spent time reproducing the infinite loop problem that he got sometime ago. I looked at his case and found out that the loop was caused by HADOOP-1256 . Since 1256 already took care of his problem, I am going to mark this patch available.
          Hide
          dhruba borthakur added a comment -

          +1. Code looks good. Let's get this into the next release as soon as possible.

          Show
          dhruba borthakur added a comment - +1. Code looks good. Let's get this into the next release as soon as possible.
          Hide
          Hairong Kuang added a comment -

          Update the patch to the latest trunk.

          Show
          Hairong Kuang added a comment - Update the patch to the latest trunk.
          Hide
          Koji Noguchi added a comment -

          Whenever this happens, we have to restart the dfs.

          Show
          Koji Noguchi added a comment - Whenever this happens, we have to restart the dfs.
          Hide
          Christian Kunz added a comment -

          Just for the record, our namenode servers with release 0.12.3 got into this situation twice, once with a 1000-node cluster, once with a 500-node cluster. In this situation the server spits out 300+ messages per sec and becomes rather unresponsive to DFS clients.

          Show
          Christian Kunz added a comment - Just for the record, our namenode servers with release 0.12.3 got into this situation twice, once with a 1000-node cluster, once with a 500-node cluster. In this situation the server spits out 300+ messages per sec and becomes rather unresponsive to DFS clients.
          Hide
          Hairong Kuang added a comment -

          Konstantin, this is interesting! I will take a look at how HADOOP-1256 causes the infinite loop after I come back from my vacation.

          Show
          Hairong Kuang added a comment - Konstantin, this is interesting! I will take a look at how HADOOP-1256 causes the infinite loop after I come back from my vacation.
          Hide
          Konstantin Shvachko added a comment -

          I still get infinite loop with this patch.
          But HADOOP-1256 fixes it.

          Show
          Konstantin Shvachko added a comment - I still get infinite loop with this patch. But HADOOP-1256 fixes it.
          Hide
          Hairong Kuang added a comment -

          After much investigation, I was able to reproduce the problem. This is caused by the same datanode registers more than once. Each registeration puts the datanodeDescriptor in the heartbeat queue. When the heartbeat queue has more than one reference to the same DataNodeDescriptor and the datanode losts a heartbeat, heartbeatCheck will get into an infinite loop.

          This problem could be fixed either by doing a contains check before adding a datanodeDescriptor to the heartbeat queue or using a collection type that disallow duplicate entries for the heartbeat queue.

          Show
          Hairong Kuang added a comment - After much investigation, I was able to reproduce the problem. This is caused by the same datanode registers more than once. Each registeration puts the datanodeDescriptor in the heartbeat queue. When the heartbeat queue has more than one reference to the same DataNodeDescriptor and the datanode losts a heartbeat, heartbeatCheck will get into an infinite loop. This problem could be fixed either by doing a contains check before adding a datanodeDescriptor to the heartbeat queue or using a collection type that disallow duplicate entries for the heartbeat queue.

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Konstantin Shvachko
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development