Issue Details (XML | Word | Printable)

Key: HADOOP-2767
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Hairong Kuang
Reporter: Mark Butler
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

org.apache.hadoop.net.NetworkTopology.InnerNode#getLeaf does not return the last node on a rack when used with an excluded node

Created: 01/Feb/08 03:58 PM   Updated: 08/Jul/09 04:42 PM
Return to search
Component/s: None
Affects Version/s: 0.15.3
Fix Version/s: 0.17.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works excludedLeaf.patch 2008-02-04 10:54 PM Hairong Kuang 3 kB
Text File Licensed for inclusion in ASF works excludedLeaf2.patch 2008-02-05 04:00 PM Mark Butler 6 kB
Text File Licensed for inclusion in ASF works excludedLeaf3.patch 2008-02-05 10:06 PM Hairong Kuang 6 kB
Text File Licensed for inclusion in ASF works excludedLeaf4.patch 2008-02-06 12:19 AM Hairong Kuang 6 kB
Java Source File Licensed for inclusion in ASF works NetworkTopologyTest.java 2008-02-01 06:07 PM Mark Butler 3 kB

Resolution Date: 07/Feb/08 02:54 PM


 Description  « Hide
I have written some test code that shows NetworkTopology.InnerNode#getLeaf will never return the last node on the rack if it is called with an excludedNode (for example the first node on the rack).

Consequently I suspect that NetworkTopology.chooseRandom() will never returns the last node on the remote rack for the second replica in DFS.

I have some test code that demonstrates this problem at the getLeaf level, although it is necessary to change the visibility of the NetworkTopology.InnerNode, NetworkTopology.InnerNode#getLeaf and NetworkTopology.getNode from private to package default to run the test.

TODO: Demonstrate problem at NetworkTopology.chooseRandom level, then submit candidate fix.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Mark Butler added a comment - 01/Feb/08 03:59 PM
This test case demonstrates the problem with last nodes on the rack at the getLeaf level.

Mark Butler added a comment - 01/Feb/08 06:07 PM
This amended test case demonstrates that NetworkTopology.chooseRandom(~node) will not return the last node on the rack.

This is called by

org.apache.hadoop.dfs.ReplicationTargetChooser#chooseRandom(int, String, List<Node>),

which in turn is called by

chooseRandom(int, String, List<Node>, long, int, List<DatanodeDescriptor>
chooseRandom(String, List<Node>, long, int, List<DatanodeDescriptor>

which are called by

chooseRemoteRack(int, DatanodeDescriptor, List<Node>, long, int, List<DatanodeDescriptor>)
chooseTarget(int, DatanodeDescriptor, List<Node>, long, int, List<DatanodeDescriptor>)

chooseTarget is an important method for placing replicas.


Mark Butler added a comment - 01/Feb/08 06:26 PM
Further examination shows the problem only occurs when you are excluding a node, not when you are excluding a rack. So it may not be a problem for DFS ...

However, if it does occur, it is not just a problem of not using a node, it also causes another node to be a "hotspot" as it is twice as likely as any other node to get a replica placement., which is clearly undesirable ..

Can someone more familiar with DFS replication code comment if this is likely to be a problem?


Hairong Kuang added a comment - 04/Feb/08 10:54 PM
Mark, thank you for your time investigating this issue. Yes, this is a bug, wich occurs when the excluded node is a leaf. Fortunately dfs block allocation strategy only excludes racks. So the bug does not have an effect on the dfs block allocation. Could you please check if the attached patch fixed the problem?

Mark Butler added a comment - 05/Feb/08 03:58 PM - edited
Hairong, thanks for the speedy reply.

As you note, looking at the code, DFS does not call this to exclude nodes. So it is not a problem for DFS. Specifically excluding racks happens in NetworkTopology.chooseRandom(String) while excluding nodes occurs in ReplicationTargetChooser.chooseRandom(int numReplicas, String nodes, List<Node> excludedNodes).

However you have written the patch, so my suggestion is we fix this. I took a look at the patch and there is a small error in it - it said

if (excludedNode instanceof InnerNode) { numOfExcludedLeaves = ((InnerNode)excludedNode).getNumOfLeaves(); isLeaf = true; }

but if it is an inner node, it can't be a leaf? Consequently it fails the test so I think it should be

if (excludedNode instanceof InnerNode) { numOfExcludedLeaves = ((InnerNode)excludedNode).getNumOfLeaves(); } else { isLeaf = true; }

I've produced an updated patch and also modified my test code and incorporated it into TestNetworkTopology.java. This should resolve the issue.

Thanks again!


Mark Butler added a comment - 05/Feb/08 04:00 PM
This is a modified patch file, it fixes a problem with the previous patch and adds unit tests for this change.

Hairong Kuang added a comment - 05/Feb/08 10:06 PM
Mark, thank you for providing junit tests for the patch. They look good. The patch that I am uploading have two minor changes.

Tsz Wo (Nicholas), SZE added a comment - 05/Feb/08 11:03 PM
excludedLeaf3.patch: +1
codes look good, I suggest the following change:
...
      int numOfExcludedLeaves = 1;
      boolean isLeaf = !(excludedNode instanceof InnerNode);
      if (!isLeaf) {
        numOfExcludedLeaves = ((InnerNode)excludedNode).getNumOfLeaves();
      }

Hairong Kuang added a comment - 06/Feb/08 12:19 AM
Incorporated Nicholas's comment.

Hadoop QA added a comment - 06/Feb/08 03:07 AM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12374834/excludedLeaf4.patch
against trunk revision 616796.

@author +1. The patch does not contain any @author tags.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new javac compiler warnings.

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1747/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1747/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1747/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1747/console

This message is automatically generated.


dhruba borthakur added a comment - 07/Feb/08 02:54 PM
I just committed this. Thanks Mark and Hairong!

Hudson added a comment - 08/Feb/08 12:01 PM