Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
0.12.3
-
None
-
None
Description
Under certain conditions the name-node fall into infinite loop in heartbeatCheck().
It's rather hard to reproduce. I'm running one node cluster: 1 name-node, 1 data-node.
The data-node dies, and 10 minutes later I get
07/04/12 10:40:34 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
07/04/12 10:44:35 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
...................................................
07/04/12 10:45:17 INFO net.NetworkTopology: Removing a node: /default-rack/0.0.0.0:50077
07/04/12 10:47:44 INFO dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 0.0.0.0:50077
Here is what I see in the debugger:
FSNamesystem.heartbeats contains 2 identical (same instance) DatanodeDescriptor entries, both have
DatanodeDescriptor.isAlive = false. The heartbeatCheck() correctly detects that there is a dead node in
the list, but removeDatanode() does not delete the node from the heartbeats because it is dead.
Attachments
Attachments
Issue Links
- relates to
-
HADOOP-1312 heartbeat monitor thread goes away
- Closed