[HADOOP-641] Name-node should demand a block report from resurrected data-nodes. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.1.0, 0.7.2
Fix Version/s: 0.8.0
Component/s: None
Labels:
None

Description

1. This bug contributed to the crash discussed in ~~HADOOP-572~~.
The problem is that when the name-node is busy, and is not able to process all requests from its clients,
it can consider one of data-nodes dead and discard its blocks sending them into the neededRelications list.
When it finally gets the heartbeat from this data-node it resurrects the node, but not the data-node blocks,
and hence continues to replicate them.
Of course, eventually the name-node will receive the block report from this data-node, but it could take up
to 1 hour. During this time it proceeds with unnecessary block replications, which could be avoided if the
data-node sent its block report right after the resurrection.

I modified code so that the name-node requests block report if it receives a heartbeat from a dead data-node.
I introduced a new command type in the BlockCommand class.
I replaced multiple boolean indicators of the command types by one enum field.
I changed the DatanodeProtocol version.

2. This patch also includes a fix for the data-node registration. If a data-nodes times out during registration
it silently exits, which is hard to notice with a large number of nodes. This patch places registration in a loop,
so that it could retry.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ResurrectDN.patch
26/Oct/06 01:38
12 kB
Konstantin Shvachko

Issue Links

is related to

HADOOP-572 Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes.

Closed

Activity

People

Assignee:: Konstantin Shvachko

Reporter:: Konstantin Shvachko

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 26/Oct/06 01:06

Updated:: 03/Nov/06 22:40

Resolved:: 26/Oct/06 20:22