Issue Details (XML | Word | Printable)

Key: HADOOP-2606
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Konstantin Shvachko
Reporter: Koji Noguchi
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Namenode unstable when replicating 500k blocks at once

Created: 14/Jan/08 11:58 PM   Updated: 08/Jul/09 04:42 PM
Return to search
Component/s: None
Affects Version/s: 0.14.3
Fix Version/s: 0.17.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works ReplicatorNew.patch 2008-03-14 11:49 AM Konstantin Shvachko 43 kB
Text File Licensed for inclusion in ASF works ReplicatorNew1.patch 2008-03-18 03:12 AM Konstantin Shvachko 47 kB
Text File Licensed for inclusion in ASF works ReplicatorNew2.patch 2008-03-18 11:33 PM Konstantin Shvachko 47 kB
Text File ReplicatorTestOld.patch 2008-03-14 11:37 AM Konstantin Shvachko 38 kB
Issue Links:
Reference

Resolution Date: 19/Mar/08 12:16 AM


 Description  « Hide
We tried to decommission about 40 nodes at once, each containing 12k blocks. (about 500k total)
(This also happened when we first tried to decommission 2 million blocks)

Clients started experiencing "java.lang.RuntimeException: java.net.SocketTimeoutException: timed out waiting for rpc
response" and namenode was in 100% cpu state.

It was spending most of its time on one thread,

"org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@7f401d28" daemon prio=10 tid=0x0000002e10702800 nid=0x6718
runnable [0x0000000041a42000..0x0000000041a42a30]
java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.dfs.FSNamesystem.containingNodeList(FSNamesystem.java:2766)
at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2870)

  • locked <0x0000002aa3cef720> (a org.apache.hadoop.dfs.UnderReplicatedBlocks)
  • locked <0x0000002aa3c42e28> (a org.apache.hadoop.dfs.FSNamesystem)
    at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1928)
    at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1868)
    at java.lang.Thread.run(Thread.java:619)

We confirmed that Namenode was not in the fullGC states when these problem happened.

Also, dfsadmin -metasave was showing "Blocks waiting for replication" was decreasing very slowly.

I believe this is not specific to decommission and same problem would happen if we lose one rack.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
dhruba borthakur added a comment - 15/Jan/08 12:27 AM
It appears to me that the ReplicationMonitor thread wakes up every 3 seconds and does one iteration. Each iteration scans the neededReplication list once for every datanode.

If a cluster has 2000 datanodes and 20K blocks per datanode, then decommissioning 40 nodes means that the size of the neededReplication list is almost 8 million. Thus, this list of 8 million is scanned 2000 times every 3 seconds. Heavy CPU consumption!


Konstantin Shvachko added a comment - 07/Mar/08 04:12 AM
I spent some time investigating this issue.
ReplicationMonitor is definitely a problem when we have a lot of under-replicated blocks. Here is how it works.
ReplicationMonitor wakes up every 3 secs and selects 32% of data-nodes.
For each selected data-node the monitor scans the list of under-replicated blocks (called neededReplications)
and selects two blocks from that list that the current node can replicate.

If we have 2000 nodes and 500,000 blocks each iteration of the monitor (the one that happens
every 3 seconds) consists of about 640 searches in the list of 500,000 blocks.
Each search is a sequential scan of the list until 2 blocks are found.
This sure can take a lot of time on average, and is especially expensive if a data-node
does not contain replicas of the blocks in the list.

Rather than optimizing this algorithm I propose to change it so that instead of choosing
data-nodes and then looking for related blocks the ReplicationMonitor selected
under replicated blocks and assigned for replication to one of the data-nodes it belongs to.
We of course should avoid the case when a lot of blocks (more than 4?) are assigned for replication to the same node.
So if all nodes a block belongs to have already been scheduled for a lot of replications, the block should be skipped.
The number of blocks to scan during one sweep should depend on the number of live data-nodes.
I'd say double that number.


dhruba borthakur added a comment - 10/Mar/08 05:29 PM
This approach look good. It might make sense to create a "replication benchmark" so that the performance gain using the new algorithm can be measured. Also, it will help to ensure that we do not regress on replication performance in the future.

Konstantin Shvachko added a comment - 14/Mar/08 11:37 AM
ReplicatorTestOld.patch contains both the new and the old versions of the replication scheduling algorithms, as well as the benchmark to compare them.
It is not intended for committing, but if anybody wants to run and compare it is here.

Konstantin Shvachko added a comment - 14/Mar/08 11:49 AM
This patch implements the approach mentioned above.
Namely, replication monitorscans the list of under-replicated blocks and schedules them for replication to and from appropriate data-nodes. This is in contrast to the current approach when we choose a node and then scan the list in order to choose a small number of blocks that the chosen node can replicate. The new algorithm tries to schedule more replications on nodes with ongoing decommission. It also does not schedule any replications on nodes that are already in decommissioned state, this part was not present in the previous algorithm.

The patch also presents a benchmark and a test.
The benchmark directly calls the replication scheduler until all blocks are replicated and measures how many blocks per second on average it can schedule. The test runs the benchmark with default parameters.

I ran the test for the old version and for the new one.
On my machine the new replicator processes about 9700 blocks per second while the old one does only 640, which is about 15 times faster.
This of course does not mean that blocks will be replicated 15 times faster in a real cluster. This just means that replication monitor will consume much less cpu and will let other name-node operations run faster.

For those who want to accelerate replication: you need to adjust an undocumented configuration parameter "dfs.max-repl-streams", which defines maximal number of replications a data-node is allowed to handle at one time. The default it is 2.

TestReplication is supposed to fail with the new algorithm. The problem is that data-nodes do not report to the name-node crc exceptions obtained during replications. Previously another data-node (if exists) would be chosen as source for the block, and the replication will finally succeed. But now the same source node is deterministically chosen all the time. I think data-nodes should report crc-exceptions the same as clients do. I'll file a bug for discussion.


dhruba borthakur added a comment - 14/Mar/08 09:08 PM
If the namenode always deterministically choses the same datanode as the source of a replication request and the source machine has a problem (bad disk, crc error, read-only partition, etc.etc) then the replication request will never be successful.

It could also be the case that maybe there is a non-transient network failure between the source datanode and the target datanode. However, both the datanodes are successfully sending heartbeats to the namenode. No CRCs error occuring here. However, the replication request between these two datanodes will keep on failing permanently.

Isn't it better if we can ensure that the namenodes tries different datanodes as the source of a replication request?


Hadoop QA added a comment - 15/Mar/08 04:20 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12377892/ReplicatorNew.patch
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 3 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs -1. The patch appears to introduce 1 new Findbugs warnings.

core tests -1. The patch failed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1973/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1973/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1973/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1973/console

This message is automatically generated.


Konstantin Shvachko added a comment - 18/Mar/08 03:12 AM
This fixes find bugs, and I also randomized target selection in order TestReplication could pass.
I think another solution should be worked out as a part of HADOOP-3035.

dhruba borthakur added a comment - 18/Mar/08 06:58 PM
1. This patch exits the ReplicationMonitor thread when it receives Interruptedexception. This is nice, because it helps unit tests that restart namenode. Maybe we can make the same change for all other FSNamesystem deamons, e.g. DecommissionedMonitor, ResolutionMonitor, etc.

2. A typo "arleady reached replication limit". Should be "already ....".

3. If a block in neededReplication does not belong to any file, we silently remove it from neededreplication. This is a cannot happen case and we could log a message in the log.

4. This patch prefers nodes-being-decommissioned to be source of replication requests. When a node changes to the decommmissioned state, the administrator is likely to shutdown that node. There is a higher probability that node is currently serving a replication request. That repliaction request will timeout because the machine was shutdown. This is probably acceptable.

5. FSNamesystem.chooseSourceDatanode() should always return a node if possible. In the current code, this is not guaranteed because r.nextBoolean() may return false for many invocations at a stretch. It might be a good idea to do the following at the end of chooseSourceDatanode:

if (srcNode == null) {
srcNode = first datanode in list that has not reached its limit
}

6. There used to be an important log message that described a replication request:
" pending Transfer .... ask node ... ".
This has changed to
" computeReplicationWork .. ask node.."
Maybe it is a better idea to not have the name of the method in the log messages. Otherwise, when the method name changes (in the future) that log message changes too and makes it harder for people accustomed to earlier log messages to debug the system.

7, Typo in NNThroughOPutbenchmark.isInPorgress(). It should be isInProgress().


Hadoop QA added a comment - 18/Mar/08 11:06 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378097/ReplicatorNew1.patch
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 3 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1988/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1988/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1988/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1988/console

This message is automatically generated.


Konstantin Shvachko added a comment - 18/Mar/08 11:33 PM
> 1. Interruptedexception or all other FSNamesystem deamons, e.g. DecommissionedMonitor, ResolutionMonitor, etc.
Yes, we should do that. And probably not only name system daemons. I'll file a jira.

> 2. A typo
> 3. If a block in neededReplication
Done

> 4. This patch prefers nodes-being-decommissioned to be source of replication requests.
My understanding is that the node in decommission-in-progress state SHOULD not be shutdown until its state changes to decommissioned.
And the state can be changed to decommissioned only if all its blocks are replicated no matter who performs replications.
If the machine is shutdown anyway then the block will eventually be replicated by another machine.

> 5. FSNamesystem.chooseSourceDatanode() should always return a node if possible.
This is a good catch thanks I corrected it.

> 6.
Removed method names from the state change logs. We should not use method names in the future because
NameNode.stateChangeLog() prints the name automatically.

> 7.
Done.


dhruba borthakur added a comment - 18/Mar/08 11:49 PM
+1 Code look good.

Konstantin Shvachko added a comment - 19/Mar/08 12:16 AM
I just committed this.

Hudson added a comment - 19/Mar/08 11:37 AM