[HADOOP-725] chooseTargets method in FSNamesystem is very inefficient - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8.0
Fix Version/s: 0.9.0
Component/s: None
Labels:
None
Environment:

All

Description

Currently the chooseTargets method (that selects datanodes for block-placement) takes in excess of 20% of cpu on a namenode. This is the most time-consuming namenode method, according to the profiler. This inefficiency has already contributed to cascading crash in DFS earlier. As datanodes went down, new locations needed to be found for the blocks on dead datanodes, and since this was done inside a synchronized method, it locked the whole namesystem for several minutes, which caused more datanode failures, when the namenode marked them dead because no heartbeat could be processed during that interval. This has been detailed in ~~HADOOP-572~~.

The patch I am about to upload reduces the time taken in the chooseTarget method to be proportional to nReplicas per block, instead of the current implementation, which is proportional to (nDataNodes * nReplicas). Also, when a number of datanodes crash, their blocks are put on the pendingReplications list one datanode at a time in a synchronized section. (Currently, the syncchronized section processes ALL the dead datanodes, thus locking the namesystem for a considerable amount of time.) Also, this patch will add a unit test to check replication.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

chooseTargets.patch
16/Nov/06 00:17
24 kB
Milind Barve

Issue Links

relates to

HADOOP-572 Chain reaction in a big cluster caused by simultaneous failure of only a few data-nodes.

Closed

Activity

People

Assignee:: Milind Barve

Reporter:: Milind Barve

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 15/Nov/06 23:40

Updated:: 08/Jul/09 16:42

Resolved:: 20/Nov/06 23:25