Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10320

Rack failures may result in NN terminate

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.6.0
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: None
    • Labels:
    • Target Version/s:

      Description

      If there're rack failures which end up leaving only 1 rack available, BlockPlacementPolicyDefault#chooseRandom may get InvalidTopologyException when calling NetworkTopology#chooseRandom, which then throws all the way out to BlockManager's ReplicationMonitor thread and terminate the NN.

      Log:

      2016-02-24 09:22:01,514  WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
      
      2016-02-24 09:22:01,958  ERROR org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: ReplicationMonitor thread received Runtime exception. 
      org.apache.hadoop.net.NetworkTopology$InvalidTopologyException: Failed to find datanode (scope="" excludedScope="/rack_a5").
      	at org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:729)
      	at org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:694)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:635)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:580)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:348)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:214)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:111)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:3746)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$200(BlockManager.java:3711)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1400)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1306)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3682)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3634)
      	at java.lang.Thread.run(Thread.java:745)
      
      1. HDFS-10320.01.patch
        2 kB
        Xiao Chen
      2. HDFS-10320.02.patch
        4 kB
        Xiao Chen
      3. HDFS-10320.03.patch
        21 kB
        Xiao Chen
      4. HDFS-10320.04.patch
        22 kB
        Xiao Chen
      5. HDFS-10320.05.patch
        22 kB
        Xiao Chen
      6. HDFS-10320.06.patch
        22 kB
        Xiao Chen

        Issue Links

          Activity

          Hide
          xiaochen Xiao Chen added a comment -

          The failure is due to a race, since in BPPD#chooseRandom we calculate available nodes before the while loop.
          This bug only happens under below condition:

          1. numOfAvailableNodes is calculated before the while loop
          2. Rack failure, only nodes left are on the same rack as the current replica. The occurrence we see is that the cluster only has 2 racks, and 1 rack failed.
          3. BPPD#chooseDataNode -> NetworkTopology#chooseRandom, current rack is in excludedScope, so no datanodes can be chosen.

          IMHO, the fix would be to fall back to current rack and log a warning message - HDFS doesn't have other options but to replicate on the only rack alive. Administrator is expected to recover the failed rack(s).

          Show
          xiaochen Xiao Chen added a comment - The failure is due to a race, since in BPPD#chooseRandom we calculate available nodes before the while loop. This bug only happens under below condition: numOfAvailableNodes is calculated before the while loop Rack failure, only nodes left are on the same rack as the current replica. The occurrence we see is that the cluster only has 2 racks, and 1 rack failed. BPPD#chooseDataNode -> NetworkTopology#chooseRandom , current rack is in excludedScope , so no datanodes can be chosen. IMHO, the fix would be to fall back to current rack and log a warning message - HDFS doesn't have other options but to replicate on the only rack alive. Administrator is expected to recover the failed rack(s).
          Hide
          xiaochen Xiao Chen added a comment -

          Patch 1 logs an error when NetworkTopology#chooseRandom cannot choose any nodes, then BPPD will try to fall back on local rack.

          I haven't come up with a decent unit test, since we'd want to test on BPPD's chooseRemoteRack to cover our change, but it's hard to reach said race condition in the test. NT#countNumOfAvailableNodes will return 0 if we fail the racks in the test beforehand, and mocking it to return non-zero will make the loop in BPPD#chooseRandom to never exit... Any comments are highly appreciated. Thanks!

          Show
          xiaochen Xiao Chen added a comment - Patch 1 logs an error when NetworkTopology#chooseRandom cannot choose any nodes, then BPPD will try to fall back on local rack. I haven't come up with a decent unit test, since we'd want to test on BPPD's chooseRemoteRack to cover our change, but it's hard to reach said race condition in the test. NT#countNumOfAvailableNodes will return 0 if we fail the racks in the test beforehand, and mocking it to return non-zero will make the loop in BPPD#chooseRandom to never exit... Any comments are highly appreciated. Thanks!
          Hide
          xiaochen Xiao Chen added a comment -

          I went through this again, and feels the current way of falling back to local rack in chooseRemoteRack is desirable - It follows current code logic (also comment of that method).

          Ming Ma may I get your view on this? Thanks in advance!

          Show
          xiaochen Xiao Chen added a comment - I went through this again, and feels the current way of falling back to local rack in chooseRemoteRack is desirable - It follows current code logic (also comment of that method). Ming Ma may I get your view on this? Thanks in advance!
          Hide
          mingma Ming Ma added a comment -

          Xiao Chen, regarding the expected behavior, besides your suggestion (continue to allocate on the same rack as other replicas to violate the policy), another behavior is to skip the allocation of this replica (BM asks for 3, but BPPD only returns 2). Normally when there is only one rack ( or switch your step 1 and step 2 above ), BPPD chooseTarget method skips additional replica allocation instead of forcefully allocate the additional replica on the same rack. The benefit of going with the skip behavior is it is more consistence regardless of race condition; it also keeps the invariant that BPPD never tries to violate its policy. What do you think? BTW, it appears your latest patch is for a different jira.

          Show
          mingma Ming Ma added a comment - Xiao Chen , regarding the expected behavior, besides your suggestion (continue to allocate on the same rack as other replicas to violate the policy), another behavior is to skip the allocation of this replica (BM asks for 3, but BPPD only returns 2). Normally when there is only one rack ( or switch your step 1 and step 2 above ), BPPD chooseTarget method skips additional replica allocation instead of forcefully allocate the additional replica on the same rack. The benefit of going with the skip behavior is it is more consistence regardless of race condition; it also keeps the invariant that BPPD never tries to violate its policy. What do you think? BTW, it appears your latest patch is for a different jira.
          Hide
          xiaochen Xiao Chen added a comment -

          Thanks Ming Ma for the advice. I didn't understand the placement code correctly. After your suggestion, I found out that although chooseRemoteRack tries to fall back, it doesn't really choose the local node due to constraints such as maxReplicasPerRack and excludedNodes. TestReplicationPolicy#testChooseTarget6 covers this case - only 1 node is chosen even if asked for 2.

          I now believe we should try our best not to violate policy. Thus, it's better to not fall back in chooseRemoteRack in case of InvalidTopologyException. Patch 2 reflects this idea. A unit test is hard to come by, due to the difficulty of reproducing the race condition inside BPPD#chooseRandom}} from a test class...

          (I removed the embarrassing wrong-jira patch.)

          Show
          xiaochen Xiao Chen added a comment - Thanks Ming Ma for the advice. I didn't understand the placement code correctly. After your suggestion, I found out that although chooseRemoteRack tries to fall back, it doesn't really choose the local node due to constraints such as maxReplicasPerRack and excludedNodes. TestReplicationPolicy#testChooseTarget6 covers this case - only 1 node is chosen even if asked for 2. I now believe we should try our best not to violate policy. Thus, it's better to not fall back in chooseRemoteRack in case of InvalidTopologyException . Patch 2 reflects this idea. A unit test is hard to come by, due to the difficulty of reproducing the race condition inside BPPD#chooseRandom}} from a test class... (I removed the embarrassing wrong-jira patch.)
          Hide
          mingma Ming Ma added a comment -

          Thanks Xiao Chen! Should we push the InvalidTopologyException handling to lower level inside chooseDataNode function? In addition, it seems chooseRandom has special handling to refresh numOfAvailableNodes to handle similar situation you raised, e.g., another thread could remove the node from the Topology around the same time. Maybe we can rework the function, instead of using numOfAvailableNodes, add a new function to Topology to support excludedNodes, something like:

          Topology.java
          // return null if no node can be chosen.
          public Node tryToChooseRandom(String scope, Collection<Node> excludedNodes);
          
          BlockPlacementPolicyDefault.java
            protected DatanodeDescriptor chooseDataNode(final String scope, final Collection<Node> excludedNodes)) {
              return (DatanodeDescriptor) clusterMap.tryToChooseRandom(scope, excludedNodes);
            }
          
          ...
          DatanodeDescriptor chosenNode = chooseDataNode(scope, excludedNodes);
          if (chosenNode == null) {
            break;
          }
          

          For the unit test, mock might be useful, for example, have the mock Topology remove the other racks as part of countNumOfAvailableNodes call.

          Show
          mingma Ming Ma added a comment - Thanks Xiao Chen ! Should we push the InvalidTopologyException handling to lower level inside chooseDataNode function? In addition, it seems chooseRandom has special handling to refresh numOfAvailableNodes to handle similar situation you raised, e.g., another thread could remove the node from the Topology around the same time. Maybe we can rework the function, instead of using numOfAvailableNodes, add a new function to Topology to support excludedNodes, something like: Topology.java // return null if no node can be chosen. public Node tryToChooseRandom(String scope, Collection<Node> excludedNodes); BlockPlacementPolicyDefault.java protected DatanodeDescriptor chooseDataNode(final String scope, final Collection<Node> excludedNodes)) { return (DatanodeDescriptor) clusterMap.tryToChooseRandom(scope, excludedNodes); } ... DatanodeDescriptor chosenNode = chooseDataNode(scope, excludedNodes); if (chosenNode == null) { break; } For the unit test, mock might be useful, for example, have the mock Topology remove the other racks as part of countNumOfAvailableNodes call.
          Hide
          xiaochen Xiao Chen added a comment -

          Hi Ming Ma,
          Thanks a lot for the enlightening suggestion. Given we've fixed some things along this line (e.g. HDFS-4937) and the code is tricky (I think), refactoring it for better maintenance sounds to be a great idea to me.

          Patch 3 attempts to do this, in the same way proposed above. (I tried to consider other options, but Ming's suggestion seems to be the best ).

          • Add an overload of NetworkTopology#chooseRandom to take in excludeNodes.
          • The 'choose a node that's not excluded' loop is now inside NetworkTopology
          • Kept the same way when choosing random as before. That is, simply get a random node and check it against excludeNodes, until a satisfying one is found. countNumOfAvailableNodes is checked before the loop, inside the lock. So we shouldn't have any infinite loops.
          • No NetworkTopology#InvalidTopologyException is thrown from chooseRandom. If no available node, return null. Of course this is incompatible, but since the class is marked as unstable, I think we're fine. This also fixes the initiate of this jira - failure to choose a node shouldn't terminate NN - since no exception is thrown now.
          • Updated necessary callers in Hadoop to handle the above change.
          Show
          xiaochen Xiao Chen added a comment - Hi Ming Ma , Thanks a lot for the enlightening suggestion. Given we've fixed some things along this line (e.g. HDFS-4937 ) and the code is tricky (I think), refactoring it for better maintenance sounds to be a great idea to me. Patch 3 attempts to do this, in the same way proposed above. (I tried to consider other options, but Ming's suggestion seems to be the best ). Add an overload of NetworkTopology#chooseRandom to take in excludeNodes. The 'choose a node that's not excluded' loop is now inside NetworkTopology Kept the same way when choosing random as before. That is, simply get a random node and check it against excludeNodes, until a satisfying one is found. countNumOfAvailableNodes is checked before the loop, inside the lock. So we shouldn't have any infinite loops. No NetworkTopology#InvalidTopologyException is thrown from chooseRandom . If no available node, return null . Of course this is incompatible, but since the class is marked as unstable, I think we're fine. This also fixes the initiate of this jira - failure to choose a node shouldn't terminate NN - since no exception is thrown now. Updated necessary callers in Hadoop to handle the above change.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 12s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          0 mvndep 0m 16s Maven dependency ordering for branch
          +1 mvninstall 6m 42s trunk passed
          +1 compile 6m 15s trunk passed with JDK v1.8.0_91
          +1 compile 6m 49s trunk passed with JDK v1.7.0_95
          +1 checkstyle 1m 7s trunk passed
          +1 mvnsite 1m 49s trunk passed
          +1 mvneclipse 0m 27s trunk passed
          +1 findbugs 3m 30s trunk passed
          +1 javadoc 1m 59s trunk passed with JDK v1.8.0_91
          +1 javadoc 2m 53s trunk passed with JDK v1.7.0_95
          0 mvndep 0m 58s Maven dependency ordering for patch
          +1 mvninstall 1m 33s the patch passed
          +1 compile 5m 56s the patch passed with JDK v1.8.0_91
          +1 javac 5m 56s the patch passed
          +1 compile 6m 48s the patch passed with JDK v1.7.0_95
          +1 javac 6m 48s the patch passed
          -1 checkstyle 1m 11s root: patch generated 1 new + 229 unchanged - 4 fixed = 230 total (was 233)
          +1 mvnsite 1m 51s the patch passed
          +1 mvneclipse 0m 28s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 4m 3s the patch passed
          +1 javadoc 2m 1s the patch passed with JDK v1.8.0_91
          +1 javadoc 2m 54s the patch passed with JDK v1.7.0_95
          +1 unit 7m 28s hadoop-common in the patch passed with JDK v1.8.0_91.
          -1 unit 59m 16s hadoop-hdfs in the patch failed with JDK v1.8.0_91.
          +1 unit 7m 44s hadoop-common in the patch passed with JDK v1.7.0_95.
          -1 unit 56m 27s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
          +1 asflicense 0m 26s Patch does not generate ASF License warnings.
          192m 24s



          Reason Tests
          JDK v1.8.0_91 Failed junit tests hadoop.hdfs.shortcircuit.TestShortCircuitCache
            hadoop.hdfs.TestHFlush
          JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestHFlush
            hadoop.hdfs.TestDFSClientRetries



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:cf2ee45
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12801819/HDFS-10320.03.patch
          JIRA Issue HDFS-10320
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 4273c13f1ba9 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 9e8411d
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15339/artifact/patchprocess/diff-checkstyle-root.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15339/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15339/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15339/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15339/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15339/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15339/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 12s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. 0 mvndep 0m 16s Maven dependency ordering for branch +1 mvninstall 6m 42s trunk passed +1 compile 6m 15s trunk passed with JDK v1.8.0_91 +1 compile 6m 49s trunk passed with JDK v1.7.0_95 +1 checkstyle 1m 7s trunk passed +1 mvnsite 1m 49s trunk passed +1 mvneclipse 0m 27s trunk passed +1 findbugs 3m 30s trunk passed +1 javadoc 1m 59s trunk passed with JDK v1.8.0_91 +1 javadoc 2m 53s trunk passed with JDK v1.7.0_95 0 mvndep 0m 58s Maven dependency ordering for patch +1 mvninstall 1m 33s the patch passed +1 compile 5m 56s the patch passed with JDK v1.8.0_91 +1 javac 5m 56s the patch passed +1 compile 6m 48s the patch passed with JDK v1.7.0_95 +1 javac 6m 48s the patch passed -1 checkstyle 1m 11s root: patch generated 1 new + 229 unchanged - 4 fixed = 230 total (was 233) +1 mvnsite 1m 51s the patch passed +1 mvneclipse 0m 28s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 4m 3s the patch passed +1 javadoc 2m 1s the patch passed with JDK v1.8.0_91 +1 javadoc 2m 54s the patch passed with JDK v1.7.0_95 +1 unit 7m 28s hadoop-common in the patch passed with JDK v1.8.0_91. -1 unit 59m 16s hadoop-hdfs in the patch failed with JDK v1.8.0_91. +1 unit 7m 44s hadoop-common in the patch passed with JDK v1.7.0_95. -1 unit 56m 27s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 26s Patch does not generate ASF License warnings. 192m 24s Reason Tests JDK v1.8.0_91 Failed junit tests hadoop.hdfs.shortcircuit.TestShortCircuitCache   hadoop.hdfs.TestHFlush JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestHFlush   hadoop.hdfs.TestDFSClientRetries Subsystem Report/Notes Docker Image:yetus/hadoop:cf2ee45 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12801819/HDFS-10320.03.patch JIRA Issue HDFS-10320 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 4273c13f1ba9 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 9e8411d Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15339/artifact/patchprocess/diff-checkstyle-root.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15339/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15339/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15339/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15339/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15339/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15339/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          mingma Ming Ma added a comment -

          Thanks Xiao Chen! The patch looks good overall. Couple questions:

          • Although it requires extra work, it might be useful to add a new unit test to verify the race condition you identified.
          • It seems the following line should always return true given it has explicitly exclude those nodes earlier.
                  if (excludedNodes.add(chosenNode)) { //was not in the excluded list
            
          • Is the following line still necessary given it was added to excludedNodes earlier?
            addToExcludedNodes(chosenNode, excludedNodes);
            
          • The following change in webHDFS seems right. But wonder if that means it didn't take excludes into consideration before under some scenario.
                return (DatanodeDescriptor)bm.getDatanodeManager().getNetworkTopology(
                    ).chooseRandom(NodeBase.ROOT, excludes);
            
          Show
          mingma Ming Ma added a comment - Thanks Xiao Chen ! The patch looks good overall. Couple questions: Although it requires extra work, it might be useful to add a new unit test to verify the race condition you identified. It seems the following line should always return true given it has explicitly exclude those nodes earlier. if (excludedNodes.add(chosenNode)) { //was not in the excluded list Is the following line still necessary given it was added to excludedNodes earlier? addToExcludedNodes(chosenNode, excludedNodes); The following change in webHDFS seems right. But wonder if that means it didn't take excludes into consideration before under some scenario. return (DatanodeDescriptor)bm.getDatanodeManager().getNetworkTopology( ).chooseRandom(NodeBase.ROOT, excludes);
          Hide
          xiaochen Xiao Chen added a comment -

          Thank you Ming Ma! Patch 4 is attached:

          Although it requires extra work, it might be useful to add a new unit test to verify the race condition you identified.

          I agree it's usually better to verify the race condition, since the code path is not tested in normal cases. In this case, however, our rewrite makes the race condition no longer 'special' - it's just returning null from chooseDataNode, which we know we can handle and the path is covered by existing tests. So I didn't add a unit test here. Feel free to point out if I missed a point.

          It seems the following line should always return true given it has explicitly exclude those nodes earlier.

          Correct, I left it only to add a precondition check. Refactored to do the precondition check beforehand and emit the if-else.

          Is the following line still necessary given it was added to excludedNodes earlier?

          Good catch, removed. Don't think it's needed before this patch either... since it's inside the if (excludedNodes.add(chosenNode)) block, right?

          The following change in webHDFS seems right. But wonder if that means it didn't take excludes into consideration before under some scenario.

          Seems to me this was a miss when HDFS-6616 added excludeDatanodes support for webhdfs. And it's probably a rare case so no one has run into it.

          Show
          xiaochen Xiao Chen added a comment - Thank you Ming Ma ! Patch 4 is attached: Although it requires extra work, it might be useful to add a new unit test to verify the race condition you identified. I agree it's usually better to verify the race condition, since the code path is not tested in normal cases. In this case, however, our rewrite makes the race condition no longer 'special' - it's just returning null from chooseDataNode , which we know we can handle and the path is covered by existing tests. So I didn't add a unit test here. Feel free to point out if I missed a point. It seems the following line should always return true given it has explicitly exclude those nodes earlier. Correct, I left it only to add a precondition check. Refactored to do the precondition check beforehand and emit the if-else. Is the following line still necessary given it was added to excludedNodes earlier? Good catch, removed. Don't think it's needed before this patch either... since it's inside the if (excludedNodes.add(chosenNode)) block, right? The following change in webHDFS seems right. But wonder if that means it didn't take excludes into consideration before under some scenario. Seems to me this was a miss when HDFS-6616 added excludeDatanodes support for webhdfs. And it's probably a rare case so no one has run into it.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 23m 14s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          0 mvndep 1m 24s Maven dependency ordering for branch
          +1 mvninstall 8m 41s trunk passed
          +1 compile 10m 41s trunk passed with JDK v1.8.0_91
          +1 compile 9m 23s trunk passed with JDK v1.7.0_95
          +1 checkstyle 1m 27s trunk passed
          +1 mvnsite 2m 30s trunk passed
          +1 mvneclipse 0m 38s trunk passed
          +1 findbugs 3m 58s trunk passed
          +1 javadoc 2m 29s trunk passed with JDK v1.8.0_91
          +1 javadoc 3m 6s trunk passed with JDK v1.7.0_95
          0 mvndep 0m 16s Maven dependency ordering for patch
          +1 mvninstall 1m 45s the patch passed
          +1 compile 9m 6s the patch passed with JDK v1.8.0_91
          +1 javac 9m 6s the patch passed
          +1 compile 7m 48s the patch passed with JDK v1.7.0_95
          +1 javac 7m 48s the patch passed
          -1 checkstyle 1m 10s root: patch generated 1 new + 228 unchanged - 5 fixed = 229 total (was 233)
          +1 mvnsite 2m 1s the patch passed
          +1 mvneclipse 0m 29s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 4m 18s the patch passed
          +1 javadoc 2m 22s the patch passed with JDK v1.8.0_91
          +1 javadoc 3m 6s the patch passed with JDK v1.7.0_95
          +1 unit 10m 22s hadoop-common in the patch passed with JDK v1.8.0_91.
          -1 unit 105m 35s hadoop-hdfs in the patch failed with JDK v1.8.0_91.
          +1 unit 10m 16s hadoop-common in the patch passed with JDK v1.7.0_95.
          -1 unit 90m 2s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
          +1 asflicense 0m 31s Patch does not generate ASF License warnings.
          318m 8s



          Reason Tests
          JDK v1.8.0_91 Failed junit tests hadoop.hdfs.security.TestDelegationTokenForProxyUser
            hadoop.hdfs.TestFileAppend
            hadoop.hdfs.TestRollingUpgrade
            hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithNodeGroup
            hadoop.hdfs.TestAsyncDFSRename
          JDK v1.8.0_91 Timed out junit tests org.apache.hadoop.fs.viewfs.TestViewFileSystemWithAcls
          JDK v1.7.0_95 Failed junit tests hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency
            hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl
            hadoop.hdfs.server.namenode.TestEditLog
            hadoop.metrics2.sink.TestRollingFileSystemSinkWithSecureHdfs
            hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithNodeGroup
            hadoop.hdfs.TestAsyncDFSRename



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:cf2ee45
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802136/HDFS-10320.04.patch
          JIRA Issue HDFS-10320
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 27c06dace28d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 355325b
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15356/artifact/patchprocess/diff-checkstyle-root.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15356/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15356/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15356/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15356/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15356/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15356/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 23m 14s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. 0 mvndep 1m 24s Maven dependency ordering for branch +1 mvninstall 8m 41s trunk passed +1 compile 10m 41s trunk passed with JDK v1.8.0_91 +1 compile 9m 23s trunk passed with JDK v1.7.0_95 +1 checkstyle 1m 27s trunk passed +1 mvnsite 2m 30s trunk passed +1 mvneclipse 0m 38s trunk passed +1 findbugs 3m 58s trunk passed +1 javadoc 2m 29s trunk passed with JDK v1.8.0_91 +1 javadoc 3m 6s trunk passed with JDK v1.7.0_95 0 mvndep 0m 16s Maven dependency ordering for patch +1 mvninstall 1m 45s the patch passed +1 compile 9m 6s the patch passed with JDK v1.8.0_91 +1 javac 9m 6s the patch passed +1 compile 7m 48s the patch passed with JDK v1.7.0_95 +1 javac 7m 48s the patch passed -1 checkstyle 1m 10s root: patch generated 1 new + 228 unchanged - 5 fixed = 229 total (was 233) +1 mvnsite 2m 1s the patch passed +1 mvneclipse 0m 29s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 4m 18s the patch passed +1 javadoc 2m 22s the patch passed with JDK v1.8.0_91 +1 javadoc 3m 6s the patch passed with JDK v1.7.0_95 +1 unit 10m 22s hadoop-common in the patch passed with JDK v1.8.0_91. -1 unit 105m 35s hadoop-hdfs in the patch failed with JDK v1.8.0_91. +1 unit 10m 16s hadoop-common in the patch passed with JDK v1.7.0_95. -1 unit 90m 2s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 31s Patch does not generate ASF License warnings. 318m 8s Reason Tests JDK v1.8.0_91 Failed junit tests hadoop.hdfs.security.TestDelegationTokenForProxyUser   hadoop.hdfs.TestFileAppend   hadoop.hdfs.TestRollingUpgrade   hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithNodeGroup   hadoop.hdfs.TestAsyncDFSRename JDK v1.8.0_91 Timed out junit tests org.apache.hadoop.fs.viewfs.TestViewFileSystemWithAcls JDK v1.7.0_95 Failed junit tests hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency   hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl   hadoop.hdfs.server.namenode.TestEditLog   hadoop.metrics2.sink.TestRollingFileSystemSinkWithSecureHdfs   hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithNodeGroup   hadoop.hdfs.TestAsyncDFSRename Subsystem Report/Notes Docker Image:yetus/hadoop:cf2ee45 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802136/HDFS-10320.04.patch JIRA Issue HDFS-10320 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 27c06dace28d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 355325b Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/15356/artifact/patchprocess/diff-checkstyle-root.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15356/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15356/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15356/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15356/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15356/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15356/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          xiaochen Xiao Chen added a comment -

          After a good night's sleep, I see the addToExcludedNodes(chosenNode, excludedNodes); line is needed, for subclasses. Jenkins is complaining about it as well... Patch 5 adds that back. Sorry I missed that in my last patch.

          Show
          xiaochen Xiao Chen added a comment - After a good night's sleep, I see the addToExcludedNodes(chosenNode, excludedNodes); line is needed, for subclasses. Jenkins is complaining about it as well... Patch 5 adds that back. Sorry I missed that in my last patch.
          Hide
          mingma Ming Ma added a comment -

          for subclasses

          Good catch. Maybe rename the method or add more comments there.

          Another thing, in NetworkTopology.java's {{chooseRandom(final String scope, String excludedScope,
          final Collection<Node> excludedNodes)}}, there is an existing numOfDatanodes, can it take excludes into account or do you still need to recompute another availableNodes?

          Show
          mingma Ming Ma added a comment - for subclasses Good catch. Maybe rename the method or add more comments there. Another thing, in NetworkTopology.java's {{chooseRandom(final String scope, String excludedScope, final Collection<Node> excludedNodes)}}, there is an existing numOfDatanodes , can it take excludes into account or do you still need to recompute another availableNodes ?
          Hide
          xiaochen Xiao Chen added a comment -

          Maybe rename the method or add more comments there.

          Sure, I updated the original comment before that addToExcludedNodes call.

          numOfDatanodes v.s. availableNodes in NetworkTopology.java's chooseRandom

          This is the fun part. They're different things. The implementation of InnerNode#getNumOfLeaves is to return the total leaves, and the 'randomly choose 1' is done by innerNode.getLeaf(leaveIndex, node), providing an (randomly generated) index, and the (most ancestor) node from excludedScope. I checked all the way in for feasibility of adding excludedNodes to getLeaf when coming up with patch 3, but decided to have current implementation for 2 reasons:

          • Less change. We don't have to change all the way into InnerNode for this bug fix, hence less effort.
          • It is more consistent with current behavior. Currently we loop in BPPD, if we get a node that's already excluded, we call chooseDataNode again. This patch simply moves this loop inside.

          1 implementation detail I also considered is, in NetworkTopology.java's chooseRandom, without changing InnerNode, we could maintain the index mapping of available nodes, and randomly choose the index from the mapping, then get the node using the index. If this node is in excludeNodes, we remove that index from the mapping. Although this would make the loop run less iterations (since each time a different node will be coming from the set), for HDFS with enormous number of DNs, the space consumption and the overhead of setting up the index mapping overrules the benefit. I assume this is why we have that simple loop in BPPD at first.

          Please let me know what you think. Thanks!

          Show
          xiaochen Xiao Chen added a comment - Maybe rename the method or add more comments there. Sure, I updated the original comment before that addToExcludedNodes call. numOfDatanodes v.s. availableNodes in NetworkTopology.java's chooseRandom This is the fun part. They're different things. The implementation of InnerNode#getNumOfLeaves is to return the total leaves, and the 'randomly choose 1' is done by innerNode.getLeaf(leaveIndex, node) , providing an (randomly generated) index, and the (most ancestor) node from excludedScope . I checked all the way in for feasibility of adding excludedNodes to getLeaf when coming up with patch 3, but decided to have current implementation for 2 reasons: Less change. We don't have to change all the way into InnerNode for this bug fix, hence less effort. It is more consistent with current behavior. Currently we loop in BPPD, if we get a node that's already excluded, we call chooseDataNode again. This patch simply moves this loop inside. 1 implementation detail I also considered is, in NetworkTopology.java's chooseRandom, without changing InnerNode , we could maintain the index mapping of available nodes, and randomly choose the index from the mapping, then get the node using the index. If this node is in excludeNodes, we remove that index from the mapping. Although this would make the loop run less iterations (since each time a different node will be coming from the set), for HDFS with enormous number of DNs, the space consumption and the overhead of setting up the index mapping overrules the benefit. I assume this is why we have that simple loop in BPPD at first. Please let me know what you think. Thanks!
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 15s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          0 mvndep 1m 39s Maven dependency ordering for branch
          +1 mvninstall 7m 53s trunk passed
          +1 compile 6m 33s trunk passed with JDK v1.8.0_91
          +1 compile 7m 8s trunk passed with JDK v1.7.0_95
          +1 checkstyle 1m 9s trunk passed
          +1 mvnsite 1m 51s trunk passed
          +1 mvneclipse 0m 29s trunk passed
          +1 findbugs 3m 46s trunk passed
          +1 javadoc 2m 8s trunk passed with JDK v1.8.0_91
          +1 javadoc 3m 7s trunk passed with JDK v1.7.0_95
          0 mvndep 0m 15s Maven dependency ordering for patch
          +1 mvninstall 1m 37s the patch passed
          +1 compile 6m 39s the patch passed with JDK v1.8.0_91
          +1 javac 6m 39s the patch passed
          +1 compile 6m 44s the patch passed with JDK v1.7.0_95
          +1 javac 6m 44s the patch passed
          +1 checkstyle 1m 6s root: patch generated 0 new + 229 unchanged - 5 fixed = 229 total (was 234)
          +1 mvnsite 1m 47s the patch passed
          +1 mvneclipse 0m 28s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 4m 2s the patch passed
          +1 javadoc 1m 57s the patch passed with JDK v1.8.0_91
          +1 javadoc 2m 52s the patch passed with JDK v1.7.0_95
          -1 unit 6m 33s hadoop-common in the patch failed with JDK v1.8.0_91.
          -1 unit 58m 40s hadoop-hdfs in the patch failed with JDK v1.8.0_91.
          +1 unit 7m 24s hadoop-common in the patch passed with JDK v1.7.0_95.
          -1 unit 54m 43s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
          +1 asflicense 0m 29s Patch does not generate ASF License warnings.
          192m 40s



          Reason Tests
          JDK v1.8.0_91 Failed junit tests hadoop.net.TestDNS
            hadoop.hdfs.TestHFlush
            hadoop.hdfs.server.datanode.TestDataNodeLifeline
            hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery
          JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestHFlush



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:cf2ee45
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802225/HDFS-10320.05.patch
          JIRA Issue HDFS-10320
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 1ea158177cf0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 36972d6
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15360/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_91.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15360/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15360/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15360/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15360/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15360/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15360/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15360/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 15s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. 0 mvndep 1m 39s Maven dependency ordering for branch +1 mvninstall 7m 53s trunk passed +1 compile 6m 33s trunk passed with JDK v1.8.0_91 +1 compile 7m 8s trunk passed with JDK v1.7.0_95 +1 checkstyle 1m 9s trunk passed +1 mvnsite 1m 51s trunk passed +1 mvneclipse 0m 29s trunk passed +1 findbugs 3m 46s trunk passed +1 javadoc 2m 8s trunk passed with JDK v1.8.0_91 +1 javadoc 3m 7s trunk passed with JDK v1.7.0_95 0 mvndep 0m 15s Maven dependency ordering for patch +1 mvninstall 1m 37s the patch passed +1 compile 6m 39s the patch passed with JDK v1.8.0_91 +1 javac 6m 39s the patch passed +1 compile 6m 44s the patch passed with JDK v1.7.0_95 +1 javac 6m 44s the patch passed +1 checkstyle 1m 6s root: patch generated 0 new + 229 unchanged - 5 fixed = 229 total (was 234) +1 mvnsite 1m 47s the patch passed +1 mvneclipse 0m 28s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 4m 2s the patch passed +1 javadoc 1m 57s the patch passed with JDK v1.8.0_91 +1 javadoc 2m 52s the patch passed with JDK v1.7.0_95 -1 unit 6m 33s hadoop-common in the patch failed with JDK v1.8.0_91. -1 unit 58m 40s hadoop-hdfs in the patch failed with JDK v1.8.0_91. +1 unit 7m 24s hadoop-common in the patch passed with JDK v1.7.0_95. -1 unit 54m 43s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 29s Patch does not generate ASF License warnings. 192m 40s Reason Tests JDK v1.8.0_91 Failed junit tests hadoop.net.TestDNS   hadoop.hdfs.TestHFlush   hadoop.hdfs.server.datanode.TestDataNodeLifeline   hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestHFlush Subsystem Report/Notes Docker Image:yetus/hadoop:cf2ee45 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802225/HDFS-10320.05.patch JIRA Issue HDFS-10320 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 1ea158177cf0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 36972d6 Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/15360/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_91.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15360/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15360/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15360/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15360/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15360/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15360/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15360/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 13s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          0 mvndep 0m 15s Maven dependency ordering for branch
          +1 mvninstall 6m 45s trunk passed
          +1 compile 5m 54s trunk passed with JDK v1.8.0_91
          +1 compile 6m 52s trunk passed with JDK v1.7.0_95
          +1 checkstyle 1m 10s trunk passed
          +1 mvnsite 1m 50s trunk passed
          +1 mvneclipse 0m 29s trunk passed
          +1 findbugs 3m 30s trunk passed
          +1 javadoc 2m 2s trunk passed with JDK v1.8.0_91
          +1 javadoc 2m 51s trunk passed with JDK v1.7.0_95
          0 mvndep 0m 14s Maven dependency ordering for patch
          +1 mvninstall 1m 28s the patch passed
          +1 compile 5m 40s the patch passed with JDK v1.8.0_91
          +1 javac 5m 40s the patch passed
          +1 compile 6m 43s the patch passed with JDK v1.7.0_95
          +1 javac 6m 43s the patch passed
          +1 checkstyle 1m 8s root: patch generated 0 new + 229 unchanged - 5 fixed = 229 total (was 234)
          +1 mvnsite 1m 49s the patch passed
          +1 mvneclipse 0m 28s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 4m 3s the patch passed
          +1 javadoc 2m 1s the patch passed with JDK v1.8.0_91
          +1 javadoc 2m 53s the patch passed with JDK v1.7.0_95
          -1 unit 6m 52s hadoop-common in the patch failed with JDK v1.8.0_91.
          -1 unit 58m 51s hadoop-hdfs in the patch failed with JDK v1.8.0_91.
          +1 unit 7m 20s hadoop-common in the patch passed with JDK v1.7.0_95.
          -1 unit 54m 37s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
          +1 asflicense 0m 25s Patch does not generate ASF License warnings.
          187m 47s



          Reason Tests
          JDK v1.8.0_91 Failed junit tests hadoop.ha.TestZKFailoverController
            hadoop.net.TestDNS
            hadoop.hdfs.TestFileAppend
            hadoop.hdfs.server.namenode.TestEditLog
          JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestHFlush



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:cf2ee45
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802270/HDFS-10320.06.patch
          JIRA Issue HDFS-10320
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 70afe1921c84 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 7bd418e
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15364/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_91.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15364/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15364/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15364/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15364/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15364/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15364/testReport/
          modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: .
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15364/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. 0 mvndep 0m 15s Maven dependency ordering for branch +1 mvninstall 6m 45s trunk passed +1 compile 5m 54s trunk passed with JDK v1.8.0_91 +1 compile 6m 52s trunk passed with JDK v1.7.0_95 +1 checkstyle 1m 10s trunk passed +1 mvnsite 1m 50s trunk passed +1 mvneclipse 0m 29s trunk passed +1 findbugs 3m 30s trunk passed +1 javadoc 2m 2s trunk passed with JDK v1.8.0_91 +1 javadoc 2m 51s trunk passed with JDK v1.7.0_95 0 mvndep 0m 14s Maven dependency ordering for patch +1 mvninstall 1m 28s the patch passed +1 compile 5m 40s the patch passed with JDK v1.8.0_91 +1 javac 5m 40s the patch passed +1 compile 6m 43s the patch passed with JDK v1.7.0_95 +1 javac 6m 43s the patch passed +1 checkstyle 1m 8s root: patch generated 0 new + 229 unchanged - 5 fixed = 229 total (was 234) +1 mvnsite 1m 49s the patch passed +1 mvneclipse 0m 28s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 4m 3s the patch passed +1 javadoc 2m 1s the patch passed with JDK v1.8.0_91 +1 javadoc 2m 53s the patch passed with JDK v1.7.0_95 -1 unit 6m 52s hadoop-common in the patch failed with JDK v1.8.0_91. -1 unit 58m 51s hadoop-hdfs in the patch failed with JDK v1.8.0_91. +1 unit 7m 20s hadoop-common in the patch passed with JDK v1.7.0_95. -1 unit 54m 37s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 25s Patch does not generate ASF License warnings. 187m 47s Reason Tests JDK v1.8.0_91 Failed junit tests hadoop.ha.TestZKFailoverController   hadoop.net.TestDNS   hadoop.hdfs.TestFileAppend   hadoop.hdfs.server.namenode.TestEditLog JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestHFlush Subsystem Report/Notes Docker Image:yetus/hadoop:cf2ee45 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802270/HDFS-10320.06.patch JIRA Issue HDFS-10320 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 70afe1921c84 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 7bd418e Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/15364/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_91.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15364/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15364/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15364/artifact/patchprocess/patch-unit-hadoop-common-project_hadoop-common-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15364/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15364/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15364/testReport/ modules C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs U: . Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15364/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          mingma Ming Ma added a comment -

          Xiao Chen thanks for the explanation! +1 for the latest patch.

          Show
          mingma Ming Ma added a comment - Xiao Chen thanks for the explanation! +1 for the latest patch.
          Hide
          mingma Ming Ma added a comment -

          Xiao Chen thanks for the contribution. I have committed the patch to trunk, branch-2 and branch-2.8.

          Show
          mingma Ming Ma added a comment - Xiao Chen thanks for the contribution. I have committed the patch to trunk, branch-2 and branch-2.8.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #9718 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9718/)
          HDFS-10320. Rack failures may result in NN terminate. (Xiao Chen via (mingma: rev 1268cf5fbe4458fa75ad0662512d352f9e8d3470)

          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/AvailableSpaceBlockPlacementPolicy.java
          • hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/web/resources/NamenodeWebHdfsMethods.java
          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/net/TestNetworkTopology.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9718 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9718/ ) HDFS-10320 . Rack failures may result in NN terminate. (Xiao Chen via (mingma: rev 1268cf5fbe4458fa75ad0662512d352f9e8d3470) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/AvailableSpaceBlockPlacementPolicy.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/web/resources/NamenodeWebHdfsMethods.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/net/TestNetworkTopology.java
          Hide
          xiaochen Xiao Chen added a comment -

          Thanks you Ming Ma for the valuable thoughts, reviews and commit!

          Show
          xiaochen Xiao Chen added a comment - Thanks you Ming Ma for the valuable thoughts, reviews and commit!

            People

            • Assignee:
              xiaochen Xiao Chen
              Reporter:
              xiaochen Xiao Chen
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development