|
I spent some time investigating this issue.
ReplicationMonitor is definitely a problem when we have a lot of under-replicated blocks. Here is how it works. ReplicationMonitor wakes up every 3 secs and selects 32% of data-nodes. For each selected data-node the monitor scans the list of under-replicated blocks (called neededReplications) and selects two blocks from that list that the current node can replicate. If we have 2000 nodes and 500,000 blocks each iteration of the monitor (the one that happens Rather than optimizing this algorithm I propose to change it so that instead of choosing This approach look good. It might make sense to create a "replication benchmark" so that the performance gain using the new algorithm can be measured. Also, it will help to ensure that we do not regress on replication performance in the future.
ReplicatorTestOld.patch contains both the new and the old versions of the replication scheduling algorithms, as well as the benchmark to compare them.
It is not intended for committing, but if anybody wants to run and compare it is here. This patch implements the approach mentioned above.
Namely, replication monitorscans the list of under-replicated blocks and schedules them for replication to and from appropriate data-nodes. This is in contrast to the current approach when we choose a node and then scan the list in order to choose a small number of blocks that the chosen node can replicate. The new algorithm tries to schedule more replications on nodes with ongoing decommission. It also does not schedule any replications on nodes that are already in decommissioned state, this part was not present in the previous algorithm. The patch also presents a benchmark and a test. I ran the test for the old version and for the new one. For those who want to accelerate replication: you need to adjust an undocumented configuration parameter "dfs.max-repl-streams", which defines maximal number of replications a data-node is allowed to handle at one time. The default it is 2. TestReplication is supposed to fail with the new algorithm. The problem is that data-nodes do not report to the name-node crc exceptions obtained during replications. Previously another data-node (if exists) would be chosen as source for the block, and the replication will finally succeed. But now the same source node is deterministically chosen all the time. I think data-nodes should report crc-exceptions the same as clients do. I'll file a bug for discussion. If the namenode always deterministically choses the same datanode as the source of a replication request and the source machine has a problem (bad disk, crc error, read-only partition, etc.etc) then the replication request will never be successful.
It could also be the case that maybe there is a non-transient network failure between the source datanode and the target datanode. However, both the datanodes are successfully sending heartbeats to the namenode. No CRCs error occuring here. However, the replication request between these two datanodes will keep on failing permanently. Isn't it better if we can ensure that the namenodes tries different datanodes as the source of a replication request? -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12377892/ReplicatorNew.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 1 new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1973/testReport/ This message is automatically generated. This fixes find bugs, and I also randomized target selection in order TestReplication could pass.
I think another solution should be worked out as a part of 1. This patch exits the ReplicationMonitor thread when it receives Interruptedexception. This is nice, because it helps unit tests that restart namenode. Maybe we can make the same change for all other FSNamesystem deamons, e.g. DecommissionedMonitor, ResolutionMonitor, etc.
2. A typo "arleady reached replication limit". Should be "already ....". 3. If a block in neededReplication does not belong to any file, we silently remove it from neededreplication. This is a cannot happen case and we could log a message in the log. 4. This patch prefers nodes-being-decommissioned to be source of replication requests. When a node changes to the decommmissioned state, the administrator is likely to shutdown that node. There is a higher probability that node is currently serving a replication request. That repliaction request will timeout because the machine was shutdown. This is probably acceptable. 5. FSNamesystem.chooseSourceDatanode() should always return a node if possible. In the current code, this is not guaranteed because r.nextBoolean() may return false for many invocations at a stretch. It might be a good idea to do the following at the end of chooseSourceDatanode: if (srcNode == null) { 6. There used to be an important log message that described a replication request: 7, Typo in NNThroughOPutbenchmark.isInPorgress(). It should be isInProgress(). +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378097/ReplicatorNew1.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 3 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1988/testReport/ This message is automatically generated. > 1. Interruptedexception or all other FSNamesystem deamons, e.g. DecommissionedMonitor, ResolutionMonitor, etc.
Yes, we should do that. And probably not only name system daemons. I'll file a jira. > 2. A typo > 4. This patch prefers nodes-being-decommissioned to be source of replication requests. > 5. FSNamesystem.chooseSourceDatanode() should always return a node if possible. > 6. > 7. I just committed this.
Integrated in Hadoop-trunk #433 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/433/
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
If a cluster has 2000 datanodes and 20K blocks per datanode, then decommissioning 40 nodes means that the size of the neededReplication list is almost 8 million. Thus, this list of 8 million is scanned 2000 times every 3 seconds. Heavy CPU consumption!