[HADOOP-1113] namenode slowdown when orphan block(s) left in neededReplication - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 0.10.1
Fix Version/s: None
Component/s: None
Labels:
None

Description

There were about 200 files that had some under-replicated blocks. A "dfs -setrep 4" followed by a "dfs -setrep 3" was done on these files. Most of the replications took place but the namenode CPU usage got stuck at 99%. The cluster has about 450 datanodes.

The stack trace of the namenode, we saw that there is always one thread of the following type:

IPC Server handler 3 on 8020" daemon prio=1 tid=0x0000002d941c7d30 nid=0x2d52 runnable [0x0000000042072000..0x0000000042072eb0]
at org.apache.hadoop.dfs.FSDirectory.getFileByBlock(FSDirectory.java:745)

waiting to lock <0x0000002aa212f030> (a org.apache.hadoop.dfs.FSDirectory$INode)
at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2155)
locked <0x0000002aa210f6b8> (a java.util.TreeSet)
locked <0x0000002aa21401a0> (a org.apache.hadoop.dfs.FSNamesystem)
at org.apache.hadoop.dfs.NameNode.sendHeartbeat(NameNode.java:521)
at sun.reflect.GeneratedMethodAccessor55.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:337)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:538)

Also, the namenode is currently not doing any replication requests (as seen from the namenode log). A new "setrep" command immediately took place.

My belief is that there is a block(s) that is permanently stuck in neededReplication. This causes all heartbeats requests to do lots of additional processing. thus leading to higher CPU usage. One possibility is that all datanodes that host the replicas of the block in neededReplication are down.

Attachments

Issue Links

blocks

HADOOP-1117 DFS Scalability: When the namenode is restarted it consumes 80% CPU

Closed

is related to

HADOOP-1133 Tools to analyze and debug namenode on a production cluster

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Dhruba Borthakur

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 12/Mar/07 22:11

Updated:: 08/Jul/09 16:42

Resolved:: 11/Aug/08 17:41