Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1332

When unable to place replicas, BlockPlacementPolicy should log reasons nodes were excluded

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.23.0
    • Component/s: namenode
    • Labels:
    • Hadoop Flags:
      Reviewed

      Description

      Whenever the block placement policy determines that a node is not a "good target" it could add the reason for exclusion to a list, and then when we log "Not able to place enough replicas" we could say why each node was refused. This would help new users who are having issues on pseudo-distributed (eg because their data dir is on /tmp and /tmp is full). Right now it's very difficult to figure out the issue.

        Issue Links

          Activity

          Hide
          Harsh J added a comment -

          Also, I think we should add what the target total replica amount was (it only tells us the # it could not replicate right now).

          Show
          Harsh J added a comment - Also, I think we should add what the target total replica amount was (it only tells us the # it could not replicate right now).
          Hide
          Ted Yu added a comment -

          throwNotEnoughReplicasExceptionWithReason() composes the string containing flattened (data node, reason string) pairs. It returns DatanodeDescriptor so that the following method compiles:

            private DatanodeDescriptor chooseRandom(
                                                    String nodes,
                                                    HashMap<Node, Node> excludedNodes,
                                                    long blocksize,
                                                    int maxNodesPerRack,
                                                    List<DatanodeDescriptor> results) 
          

          isGoodTarget() now returns reason string if the data node isn't a good target. Otherwise null is returned.

          Show
          Ted Yu added a comment - throwNotEnoughReplicasExceptionWithReason() composes the string containing flattened (data node, reason string) pairs. It returns DatanodeDescriptor so that the following method compiles: private DatanodeDescriptor chooseRandom( String nodes, HashMap<Node, Node> excludedNodes, long blocksize, int maxNodesPerRack, List<DatanodeDescriptor> results) isGoodTarget() now returns reason string if the data node isn't a good target. Otherwise null is returned.
          Hide
          Ted Yu added a comment -

          Added null pointer check for nodeDescToReasonMap

          Show
          Ted Yu added a comment - Added null pointer check for nodeDescToReasonMap
          Hide
          Ted Yu added a comment -

          When I ran TestDecommission.testDecommission(), I saw the contents of nodeDescToReasonMap:

          {127.0.0.1:49443=Node /default-rack/127.0.0.1:49443 is not chosen because the node is (being) decommissioned}
          

          in:

          BlockPlacementPolicyDefault.chooseRandom(int, String, HashMap<Node,Node>, long, int, List<DatanodeDescriptor>) line: 386	
          
          Show
          Ted Yu added a comment - When I ran TestDecommission.testDecommission(), I saw the contents of nodeDescToReasonMap: {127.0.0.1:49443=Node / default -rack/127.0.0.1:49443 is not chosen because the node is (being) decommissioned} in: BlockPlacementPolicyDefault.chooseRandom( int , String , HashMap<Node,Node>, long , int , List<DatanodeDescriptor>) line: 386
          Hide
          Todd Lipcon added a comment -

          I think there's at least one more case worth adding "explanation" to. In testing this, I tried creating a file when there were no datanodes running:

          java.io.IOException: File /user/todd/x2 could only be replicated to 0 nodes, instead of 1

          This is another case that new users run into a lot. It would be great to change the code in FSNamesystem which throws this exception to check the total number of available nodes in the system, and if it's < minReplication, include that in the message. Perhaps something like:

          "File /foo/bar could only be replicated to 0 nodes, instead of 1, because there are 0 datanodes running."

          Also, I tried testing this code by putting a file with a very large blocksize and my DN configured on a small partition. I got the following NPE:

          java.io.IOException: java.lang.NullPointerException
          at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyDefault.throwNotEnoughReplicasExceptionWithReason(BlockPlacementPolicyDefault.java:324)
          at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:367)
          at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:262)
          at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyDefault.chooseLocalNode(BlockPlacementPolicyDefault.java:236)
          at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:171)
          at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:139)
          at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:89)
          at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:78)
          at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:234)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1637)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:841)

          Show
          Todd Lipcon added a comment - I think there's at least one more case worth adding "explanation" to. In testing this, I tried creating a file when there were no datanodes running: java.io.IOException: File /user/todd/x2 could only be replicated to 0 nodes, instead of 1 This is another case that new users run into a lot. It would be great to change the code in FSNamesystem which throws this exception to check the total number of available nodes in the system, and if it's < minReplication, include that in the message. Perhaps something like: "File /foo/bar could only be replicated to 0 nodes, instead of 1, because there are 0 datanodes running." Also, I tried testing this code by putting a file with a very large blocksize and my DN configured on a small partition. I got the following NPE: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyDefault.throwNotEnoughReplicasExceptionWithReason(BlockPlacementPolicyDefault.java:324) at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:367) at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:262) at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyDefault.chooseLocalNode(BlockPlacementPolicyDefault.java:236) at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:171) at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:139) at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:89) at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:78) at org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:234) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1637) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:841)
          Hide
          Ted Yu added a comment -

          I added a new test in TestDecommission:

            @Test
            public void testDecommissionForReasonableExceptionMsg() throws IOException {
              startCluster(1, 2, conf);
              FileSystem fileSys = cluster.getFileSystem(0);
          
              // Decommission one node. Verify that node is decommissioned.
              DatanodeInfo decomNode = decommissionNode(0, null,
                  AdminStates.DECOMMISSIONED);
              Path file1 = new Path("testDecommission.dat");
              try {
                writeFile(fileSys, file1, 2);
              } catch (Exception nere) {
                String msg = nere.getMessage();
                assertTrue("Expected detail about not enough replicas",
                    msg.contains("is not chosen"));
              }
            }
          

          But the NotEnoughReplicasException is caught by chooseTarget().
          At least I saw the following:

          11/05/11 15:54:58 WARN namenode.FSNamesystem: Not able to place enough replicas, still in need of 1 to reach 2
          Not able to place enough replicas.[127.0.0.1:50055: Node /default-rack/127.0.0.1:50055 is not chosen because the node is (being) decommissioned ]
          
          Show
          Ted Yu added a comment - I added a new test in TestDecommission: @Test public void testDecommissionForReasonableExceptionMsg() throws IOException { startCluster(1, 2, conf); FileSystem fileSys = cluster.getFileSystem(0); // Decommission one node. Verify that node is decommissioned. DatanodeInfo decomNode = decommissionNode(0, null , AdminStates.DECOMMISSIONED); Path file1 = new Path( "testDecommission.dat" ); try { writeFile(fileSys, file1, 2); } catch (Exception nere) { String msg = nere.getMessage(); assertTrue( "Expected detail about not enough replicas" , msg.contains( "is not chosen" )); } } But the NotEnoughReplicasException is caught by chooseTarget(). At least I saw the following: 11/05/11 15:54:58 WARN namenode.FSNamesystem: Not able to place enough replicas, still in need of 1 to reach 2 Not able to place enough replicas.[127.0.0.1:50055: Node / default -rack/127.0.0.1:50055 is not chosen because the node is (being) decommissioned ]
          Hide
          Todd Lipcon added a comment -

          I'm referring to this code in FSNamesystem:

              DatanodeDescriptor targets[] = blockManager.replicator.chooseTarget(
                  src, replication, clientNode, excludedNodes, blockSize);
              if (targets.length < blockManager.minReplication) {
                throw new IOException("File " + src + " could only be replicated to " +
                                      targets.length + " nodes, instead of " +
                                      blockManager.minReplication);
              }
          
          Show
          Todd Lipcon added a comment - I'm referring to this code in FSNamesystem: DatanodeDescriptor targets[] = blockManager.replicator.chooseTarget( src, replication, clientNode, excludedNodes, blockSize); if (targets.length < blockManager.minReplication) { throw new IOException( "File " + src + " could only be replicated to " + targets.length + " nodes, instead of " + blockManager.minReplication); }
          Hide
          Ted Yu added a comment -

          Enhanced exception message for getAdditionalBlock() as suggested by Todd.

          Show
          Ted Yu added a comment - Enhanced exception message for getAdditionalBlock() as suggested by Todd.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12478909/HDFS-1332.patch
          against trunk revision 1102118.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.TestDFSShell
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/489//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/489//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/489//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478909/HDFS-1332.patch against trunk revision 1102118. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.TestDFSShell org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/489//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/489//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/489//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          With the new null check, it's possible that the exception will say "Detail " and then have no details included. That seems like a bad error message.

          Under what circumstances will the map be null but have failed to find any replicas?

          Show
          Todd Lipcon added a comment - With the new null check, it's possible that the exception will say "Detail " and then have no details included. That seems like a bad error message. Under what circumstances will the map be null but have failed to find any replicas?
          Hide
          Ted Yu added a comment -

          When debugging testDecommissionForReasonableExceptionMsg(), I saw the following scenario.
          In BlockPlacementPolicyDefault.chooseRandom(int, String, HashMap<Node,Node>, long, int, List<DatanodeDescriptor>), numOfReplicas was 1 and numOfAvailableNodes was 0. So the while loop in chooseRandom() at line 387 wasn't executed.
          In this case nodeDescToReasonMap was null but there was no datanode being considered good target.

          The current patch removed the term 'Detail' in the message for NotEnoughReplicasException.

          Show
          Ted Yu added a comment - When debugging testDecommissionForReasonableExceptionMsg(), I saw the following scenario. In BlockPlacementPolicyDefault.chooseRandom(int, String, HashMap<Node,Node>, long, int, List<DatanodeDescriptor>), numOfReplicas was 1 and numOfAvailableNodes was 0. So the while loop in chooseRandom() at line 387 wasn't executed. In this case nodeDescToReasonMap was null but there was no datanode being considered good target. The current patch removed the term 'Detail' in the message for NotEnoughReplicasException.
          Show
          Ted Yu added a comment - For https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/489//testReport/org.apache.hadoop.hdfs/TestDFSStorageStateRecovery/testBlockPoolStorageStates/ , I ran TestDFSStorageStateRecovery in Eclipse and the tests passed.
          Hide
          Ted Yu added a comment -

          For https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/489//testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/testUnfinishedBlockCRCErrorNormalTransferVerySmallWrite/, I ran TestFileConcurrentReader in Eclipse and there was no timeout.

          The other three test failures happened in current trunk build as well.

          Show
          Ted Yu added a comment - For https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/489//testReport/org.apache.hadoop.hdfs/TestFileConcurrentReader/testUnfinishedBlockCRCErrorNormalTransferVerySmallWrite/ , I ran TestFileConcurrentReader in Eclipse and there was no timeout. The other three test failures happened in current trunk build as well.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12478925/HDFS-1332.patch
          against trunk revision 1102153.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.namenode.TestOverReplicatedBlocks
          org.apache.hadoop.hdfs.TestDFSShell
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.hdfs.TestHDFSTrash
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/493//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/493//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/493//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478925/HDFS-1332.patch against trunk revision 1102153. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.namenode.TestOverReplicatedBlocks org.apache.hadoop.hdfs.TestDFSShell org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.TestHDFSTrash org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/493//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/493//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/493//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12478925/HDFS-1332.patch
          against trunk revision 1102153.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.namenode.TestNodeCount
          org.apache.hadoop.hdfs.TestDFSShell
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.hdfs.TestHDFSTrash
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/496//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/496//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/496//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478925/HDFS-1332.patch against trunk revision 1102153. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.namenode.TestNodeCount org.apache.hadoop.hdfs.TestDFSShell org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.TestHDFSTrash org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/496//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/496//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/496//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -
          • Why not creating the reason string directly but first creating a HashMap map?
          • "reason" does not seem a good variable name. How about "failingReason"?
          • Have you tested your patch?
          Show
          Tsz Wo Nicholas Sze added a comment - Why not creating the reason string directly but first creating a HashMap map? "reason" does not seem a good variable name. How about "failingReason"? Have you tested your patch?
          Hide
          Ted Yu added a comment -

          I created the HashMap because there could be multiple datanodes that were not good target.

          When I tried to access https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/496/, I saw it seemed to be stuck. I couldn't see the exact cause for individual test failure.
          I ran all the newly reported failed tests in Eclipse: org.apache.hadoop.hdfs.server.namenode.TestNodeCount, TestHDFSTrash along with TestFileConcurrentReader and TestDFSStorageStateRecovery I mentioned yesterday.

          I have renamed the reason variable in my next patch.

          Thanks for the review.

          Show
          Ted Yu added a comment - I created the HashMap because there could be multiple datanodes that were not good target. When I tried to access https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/496/ , I saw it seemed to be stuck. I couldn't see the exact cause for individual test failure. I ran all the newly reported failed tests in Eclipse: org.apache.hadoop.hdfs.server.namenode.TestNodeCount, TestHDFSTrash along with TestFileConcurrentReader and TestDFSStorageStateRecovery I mentioned yesterday. I have renamed the reason variable in my next patch. Thanks for the review.
          Hide
          Ted Yu added a comment -

          Updated name of reason variable according to Nicolas' comment.

          Show
          Ted Yu added a comment - Updated name of reason variable according to Nicolas' comment.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          It seems that logging the reasons is very expensive. So we should only log it in when debug is enabled.

          Show
          Tsz Wo Nicholas Sze added a comment - It seems that logging the reasons is very expensive. So we should only log it in when debug is enabled.
          Hide
          Ted Yu added a comment -

          For TestHDFSTrash, I removed my changes in BlockPlacementPolicyDefault, recompiled and reran the test on commandline using MacBook.
          I got:

          Testcase: testTrashEmptier took 0.001 sec
                  Caused an ERROR
          Timeout occurred. Please note the time in the report does not reflect the time until the timeout.
          junit.framework.AssertionFailedError: Timeout occurred. Please note the time in the report does not reflect the time until the timeout.
          
          Show
          Ted Yu added a comment - For TestHDFSTrash, I removed my changes in BlockPlacementPolicyDefault, recompiled and reran the test on commandline using MacBook. I got: Testcase: testTrashEmptier took 0.001 sec Caused an ERROR Timeout occurred. Please note the time in the report does not reflect the time until the timeout. junit.framework.AssertionFailedError: Timeout occurred. Please note the time in the report does not reflect the time until the timeout.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Ted, although your patch is only adding log messages, it actually may cause significant performance degradation in namenode since BlockPlacementPolicyDefault is invoked frequently. It seems that creating the HashMap and the strings for logging are too expensive to be executed. So, all such activities should be executed only in debug mode.

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Ted, although your patch is only adding log messages, it actually may cause significant performance degradation in namenode since BlockPlacementPolicyDefault is invoked frequently. It seems that creating the HashMap and the strings for logging are too expensive to be executed. So, all such activities should be executed only in debug mode.
          Hide
          Ted Yu added a comment -

          How about adding a static boolean, blockPlacementDebug, in BlockPlacementPolicyDefault which is set to true based on System.getenv("BLOCK_PLACEMENT_DEBUG") carrying "true" ?
          Then creating the HashMap and the strings for logging would be done only if this boolean is true.

          Show
          Ted Yu added a comment - How about adding a static boolean, blockPlacementDebug, in BlockPlacementPolicyDefault which is set to true based on System.getenv("BLOCK_PLACEMENT_DEBUG") carrying "true" ? Then creating the HashMap and the strings for logging would be done only if this boolean is true.
          Hide
          Todd Lipcon added a comment -

          Hey Nicholas. I thought about the performance impact as well, but I came to the conlusion that the node-selection code is not a hot code path. In my experience, the NN spends much much more time on read operations than on block allocation. For example, on one production NN whose metrics I have access to, it has performed 3.6M addBlock operations vs 105M FileInfoOps, 30M GetListing ops, 27M GetBlockLocations ops.

          Additionally, the new code will only get run for nodes which are decommissioning, out of space, or highly loaded. Thus it's not likely that it will add any appreciable overhead to most chooseTarget operations.

          Looking at the existing code, it's hardly optimized at all. For example, each invocation of chooseRandom() invokes countNumOfAvailableNodes which takes and releases locks, computes String substrings, etc.

          Show
          Todd Lipcon added a comment - Hey Nicholas. I thought about the performance impact as well, but I came to the conlusion that the node-selection code is not a hot code path. In my experience, the NN spends much much more time on read operations than on block allocation. For example, on one production NN whose metrics I have access to, it has performed 3.6M addBlock operations vs 105M FileInfoOps, 30M GetListing ops, 27M GetBlockLocations ops. Additionally, the new code will only get run for nodes which are decommissioning, out of space, or highly loaded. Thus it's not likely that it will add any appreciable overhead to most chooseTarget operations. Looking at the existing code, it's hardly optimized at all. For example, each invocation of chooseRandom() invokes countNumOfAvailableNodes which takes and releases locks, computes String substrings, etc.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12478995/HDFS-1332.patch
          against trunk revision 1102153.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics
          org.apache.hadoop.hdfs.TestDFSShell
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/504//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/504//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/504//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12478995/HDFS-1332.patch against trunk revision 1102153. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics org.apache.hadoop.hdfs.TestDFSShell org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/504//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/504//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/504//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          How about adding BlockPlacementPolicyDefault.LOG and use it to print the messages when debug is enabled?

          Show
          Tsz Wo Nicholas Sze added a comment - How about adding BlockPlacementPolicyDefault.LOG and use it to print the messages when debug is enabled?
          Hide
          Todd Lipcon added a comment -

          I don't think restricting nice error messages to the case when the NN is in debug mode is a good idea. We should endeavor to always have error messages that provide enough information to the user to understand and rectify the problem. New users are unlikely to know the tricks to switch over to debug mode using the cryptic "daemonlog" interface, and new users are the ones who need nice errors the most.

          Show
          Todd Lipcon added a comment - I don't think restricting nice error messages to the case when the NN is in debug mode is a good idea. We should endeavor to always have error messages that provide enough information to the user to understand and rectify the problem. New users are unlikely to know the tricks to switch over to debug mode using the cryptic "daemonlog" interface, and new users are the ones who need nice errors the most.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Todd, it is questionable whether this is a nice error message. Too many error messages confuse users.

          Replication also uses BlockPlacementPolicy. Have you counted it? Also, your example is just one example. It may not be a good representative.

          Moreover, the performance degradation is twofold:

          1. it takes time to create the messages/objects, and
          2. it creates additional objects for GC.
          Show
          Tsz Wo Nicholas Sze added a comment - Hi Todd, it is questionable whether this is a nice error message. Too many error messages confuse users. Replication also uses BlockPlacementPolicy . Have you counted it? Also, your example is just one example. It may not be a good representative. Moreover, the performance degradation is twofold: it takes time to create the messages/objects, and it creates additional objects for GC.
          Hide
          Ted Yu added a comment -

          Todd suggested using the NNThroughputBenchmark to show that the patch doesn't cause significant downgrade.
          How does that sound ?

          Show
          Ted Yu added a comment - Todd suggested using the NNThroughputBenchmark to show that the patch doesn't cause significant downgrade. How does that sound ?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Ted, please correct me if I am wrong: NNThroughputBenchmark does not count replication and GC.

          > How about adding a static boolean, blockPlacementDebug, ...

          I actually like your idea. In Hadoop, we use log levels for such purpose. That why I suggest adding BlockPlacementPolicyDefault.LOG (It probably is better to add the log in BlockPlacementPolicy.LOG). How does it sound to you?

          Show
          Tsz Wo Nicholas Sze added a comment - Ted, please correct me if I am wrong: NNThroughputBenchmark does not count replication and GC. > How about adding a static boolean, blockPlacementDebug, ... I actually like your idea. In Hadoop, we use log levels for such purpose. That why I suggest adding BlockPlacementPolicyDefault.LOG (It probably is better to add the log in BlockPlacementPolicy.LOG ). How does it sound to you?
          Hide
          Ted Yu added a comment -

          In this method:

            private void chooseRandom(int numOfReplicas,
                                      String nodes,
                                      HashMap<Node, Node> excludedNodes,
                                      long blocksize,
                                      int maxNodesPerRack,
                                      List<DatanodeDescriptor> results)
          

          it is possible that numOfAvailableNodes is greater than numOfReplicas. Meaning, if we see a bad target, NotEnoughReplicasException may not be thrown at the end of the method.
          Using LOG without creating map would report unnecessary alarm.

          Show
          Ted Yu added a comment - In this method: private void chooseRandom( int numOfReplicas, String nodes, HashMap<Node, Node> excludedNodes, long blocksize, int maxNodesPerRack, List<DatanodeDescriptor> results) it is possible that numOfAvailableNodes is greater than numOfReplicas. Meaning, if we see a bad target, NotEnoughReplicasException may not be thrown at the end of the method. Using LOG without creating map would report unnecessary alarm.
          Hide
          Todd Lipcon added a comment -

          I think there's some confusion what Nicholas's suggestion is. Is it (a) that we use the log level to decide whether to add detail to the exception message? or (b) that we don't change the exception message at all, but rather add debug level messages for each of the cases where a node is ignored?

          Show
          Todd Lipcon added a comment - I think there's some confusion what Nicholas's suggestion is. Is it (a) that we use the log level to decide whether to add detail to the exception message? or (b) that we don't change the exception message at all, but rather add debug level messages for each of the cases where a node is ignored?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          From the summary and description, I think this is about log messages. So my suggestion is

          (c) If BlockPlacementPolicy.LOG.isDebugEnabled(), construct a log message and then print it before throwing the exception. Otherwise, no HashMap nor strings should be created.

          Show
          Tsz Wo Nicholas Sze added a comment - From the summary and description, I think this is about log messages. So my suggestion is (c) If BlockPlacementPolicy.LOG.isDebugEnabled() , construct a log message and then print it before throwing the exception. Otherwise, no HashMap nor strings should be created.
          Hide
          Todd Lipcon added a comment -

          Hey Nicholas. How do you feel about the following compromise:

          • For the simple case that there are no datanodes in the cluster, we include some additional detail in the exception message indicating as much. This will help the common case of a new user whose datanodes failed to start and is confused why he can't write blocks. This should be in the IOException itself so that it propagates to the client.
          • if debug is enabled, we construct the HashMap as above, and log the "failure to allocate block" type messages at WARN level
          • if debug is not enabled, then we log a message that says something like "failure to allocate block ... For more information, please enable DEBUG level logging on the o.a.h.BlockPlacementPolicyDefault logger."

          This should avoid any performance impact, but also point users down the right path to solving the issues.

          Show
          Todd Lipcon added a comment - Hey Nicholas. How do you feel about the following compromise: For the simple case that there are no datanodes in the cluster, we include some additional detail in the exception message indicating as much. This will help the common case of a new user whose datanodes failed to start and is confused why he can't write blocks. This should be in the IOException itself so that it propagates to the client. if debug is enabled, we construct the HashMap as above, and log the "failure to allocate block" type messages at WARN level if debug is not enabled, then we log a message that says something like "failure to allocate block ... For more information, please enable DEBUG level logging on the o.a.h.BlockPlacementPolicyDefault logger." This should avoid any performance impact, but also point users down the right path to solving the issues.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Todd, it sounds good to me.

          Some optional optimizations:

          • I believe the use of HashMap can be eliminated since whenever we put an entry to the map, we may simply append it to the string builder.
          • Also, we may make the string builder ThreadLocal to minimize object creation.
          Show
          Tsz Wo Nicholas Sze added a comment - Todd, it sounds good to me. Some optional optimizations: I believe the use of HashMap can be eliminated since whenever we put an entry to the map, we may simply append it to the string builder. Also, we may make the string builder ThreadLocal to minimize object creation.
          Hide
          Todd Lipcon added a comment -

          Good ideas, both. Ted, want to update the patch?

          Show
          Todd Lipcon added a comment - Good ideas, both. Ted, want to update the patch?
          Hide
          Ted Yu added a comment -

          From http://sbdevel.wordpress.com/2009/03/12/threadlocal-stringbuilders-for-fast-text-processing/, Caveat Emptor:

          So what’s the catch? You will be keeping one string builder around for each new thread that ever enters processRecord(). This could potentially end up as lots of string builders ...

          Show
          Ted Yu added a comment - From http://sbdevel.wordpress.com/2009/03/12/threadlocal-stringbuilders-for-fast-text-processing/ , Caveat Emptor: So what’s the catch? You will be keeping one string builder around for each new thread that ever enters processRecord(). This could potentially end up as lots of string builders ...
          Hide
          Todd Lipcon added a comment -

          Just a sample of why this JIRA is important, from #hadoop IRC a minute ago:

          15:23 < user> Hi all. I have a strange problem - I can't seem to run a simple copyFromLocal command from 
                                my local machine, but everything works fine if i log into the master as the same user and 
                                issue command
          15:25 < user> the error it throws is cryptic - java.io.IOException: File /foo/bar could only be 
                                replicated to 0 nodes, instead of 1
          15:26 < user> i checked the namenode and tasktracker logs - nothing of interest there (except the error i 
                                mentioned, of course)
          
          Show
          Todd Lipcon added a comment - Just a sample of why this JIRA is important, from #hadoop IRC a minute ago: 15:23 < user> Hi all. I have a strange problem - I can't seem to run a simple copyFromLocal command from my local machine, but everything works fine if i log into the master as the same user and issue command 15:25 < user> the error it throws is cryptic - java.io.IOException: File /foo/bar could only be replicated to 0 nodes, instead of 1 15:26 < user> i checked the namenode and tasktracker logs - nothing of interest there (except the error i mentioned, of course)
          Hide
          Todd Lipcon added a comment -

          So what’s the catch? You will be keeping one string builder around for each new thread that ever enters processRecord(). This could potentially end up as lots of string builders ...

          In this case, there is a bounded set of IPC handler threads in the NameNode. So, each of those will keep around a stringbuilder, which might have a 1KB buffer. So, total memory usage might be a few hundred KB.

          Show
          Todd Lipcon added a comment - So what’s the catch? You will be keeping one string builder around for each new thread that ever enters processRecord(). This could potentially end up as lots of string builders ... In this case, there is a bounded set of IPC handler threads in the NameNode. So, each of those will keep around a stringbuilder, which might have a 1KB buffer. So, total memory usage might be a few hundred KB.
          Hide
          Ted Yu added a comment -

          Updated patch that removes HashMap and subject additional logging to LOG.isDebugEnabled()

          Show
          Ted Yu added a comment - Updated patch that removes HashMap and subject additional logging to LOG.isDebugEnabled()
          Hide
          Ted Yu added a comment -

          Clears StringBuilder after we have used the contents of it.

          Show
          Ted Yu added a comment - Clears StringBuilder after we have used the contents of it.
          Hide
          Ted Yu added a comment -

          This version is concise in that chooseRandom() attaches details for bad datanodes to NotEnoughReplicasException if LOG.isDebugEnabled() returned true.

          This reduces amount of logging.

          Show
          Ted Yu added a comment - This version is concise in that chooseRandom() attaches details for bad datanodes to NotEnoughReplicasException if LOG.isDebugEnabled() returned true. This reduces amount of logging.
          Hide
          Todd Lipcon added a comment -

          Hi Ted - I think Nicholas's concern here is about the performance impact of constructing the reason strings. This patch still constructs those strings even if they won't be used, right?

          Show
          Todd Lipcon added a comment - Hi Ted - I think Nicholas's concern here is about the performance impact of constructing the reason strings. This patch still constructs those strings even if they won't be used, right?
          Hide
          Hairong Kuang added a comment -

          Constructing a HashMap to keep track of error messages on a critical path of NameNode seems to be an overkill to me. Imagine you are running this a 5000 node cluster, what's the gc and CPU overhead to NN.

          How about adding this to the error message: "The cluster is most likely running out of space." I think this covers 99% of the cases from my experience.

          Show
          Hairong Kuang added a comment - Constructing a HashMap to keep track of error messages on a critical path of NameNode seems to be an overkill to me. Imagine you are running this a 5000 node cluster, what's the gc and CPU overhead to NN. How about adding this to the error message: "The cluster is most likely running out of space." I think this covers 99% of the cases from my experience.
          Hide
          Ted Yu added a comment -

          This version moves construction of failingReason String in the if block of FSNamesystem.LOG.isDebugEnabled()
          It also removes the newly introduced but unused logger.
          Now additional logging hinges on FSNamesystem.LOG.isDebugEnabled().

          Show
          Ted Yu added a comment - This version moves construction of failingReason String in the if block of FSNamesystem.LOG.isDebugEnabled() It also removes the newly introduced but unused logger. Now additional logging hinges on FSNamesystem.LOG.isDebugEnabled().
          Hide
          Ted Yu added a comment -

          @Hairong:
          I understand BlockPlacementPolicyDefault is on critical path. So no new HashMap is constructed in either of the two current versions in this ticket.
          All additional strings depend on FSNamesystem.LOG.isDebugEnabled() returning true.

          Thanks

          Show
          Ted Yu added a comment - @Hairong: I understand BlockPlacementPolicyDefault is on critical path. So no new HashMap is constructed in either of the two current versions in this ticket. All additional strings depend on FSNamesystem.LOG.isDebugEnabled() returning true. Thanks
          Hide
          Tsz Wo Nicholas Sze added a comment -

          We should not change isGoodTarget(..) to return a string. Similar to the HashMap case, whenever the string returned by isGoodTarget(..) is not null, we may simply append it to the StringBuilder.

          Show
          Tsz Wo Nicholas Sze added a comment - We should not change isGoodTarget(..) to return a string. Similar to the HashMap case, whenever the string returned by isGoodTarget(..) is not null, we may simply append it to the StringBuilder .
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... whenever the string returned by isGoodTarget(..) is not null, we may simply append it to the StringBuilder.

          Let me clarify it: whenever the string returned by isGoodTarget(..) is not null in the current patch, we may simply return false and then append the string to the StringBuilder if debug is enabled.

          Show
          Tsz Wo Nicholas Sze added a comment - > ... whenever the string returned by isGoodTarget(..) is not null, we may simply append it to the StringBuilder. Let me clarify it: whenever the string returned by isGoodTarget(..) is not null in the current patch, we may simply return false and then append the string to the StringBuilder if debug is enabled.
          Hide
          Ted Yu added a comment -

          Updated patch according to Nicholas' comments.

          Show
          Ted Yu added a comment - Updated patch according to Nicholas' comments.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Thanks Ted, the new patch is much better. Some comments.

          • The following should be removed.
            +    @Override
            +    public StringBuilder get() {
            +      StringBuilder b = super.get();
            +      return b;
            +    }
            
          • FSNamesystem.LOG.isDebugEnabled() is a dynamic property. So the following does not work.
            +  static boolean isDebugEnabled = FSNamesystem.LOG.isDebugEnabled();
            
          • How about change "\n" + e.getMessage() to "; " + e);?
            +               + numOfReplicas + " to reach " + totalReplicasExpected + "\n"
            +               + e.getMessage());
            
          • The first two lines below should be moved inside the if-statement.
            +      String failingReason = "Node "+NodeBase.getPath(node)+
            +            " is not chosen because the rack has too many chosen nodes";
                   if(FSNamesystem.LOG.isDebugEnabled()) {
            -        FSNamesystem.LOG.debug("Node "+NodeBase.getPath(node)+
            -            " is not chosen because the rack has too many chosen nodes");
            +        FSNamesystem.LOG.debug(failingReason);
            +        threadLocalBuilder.get().append(node.toString()).append(": ")
            +          .append(failingReason).append(" ");
                   }
            
          • I think it is hard to create new tests for this. Could you test it manually and post some sample log messages?
          Show
          Tsz Wo Nicholas Sze added a comment - Thanks Ted, the new patch is much better. Some comments. The following should be removed. + @Override + public StringBuilder get() { + StringBuilder b = super .get(); + return b; + } FSNamesystem.LOG.isDebugEnabled() is a dynamic property. So the following does not work. + static boolean isDebugEnabled = FSNamesystem.LOG.isDebugEnabled(); How about change "\n" + e.getMessage() to "; " + e); ? + + numOfReplicas + " to reach " + totalReplicasExpected + "\n" + + e.getMessage()); The first two lines below should be moved inside the if-statement. + String failingReason = "Node " +NodeBase.getPath(node)+ + " is not chosen because the rack has too many chosen nodes" ; if (FSNamesystem.LOG.isDebugEnabled()) { - FSNamesystem.LOG.debug( "Node " +NodeBase.getPath(node)+ - " is not chosen because the rack has too many chosen nodes" ); + FSNamesystem.LOG.debug(failingReason); + threadLocalBuilder.get().append(node.toString()).append( ": " ) + .append(failingReason).append( " " ); } I think it is hard to create new tests for this. Could you test it manually and post some sample log messages?
          Hide
          Ted Yu added a comment -

          Updated again according to code review.
          I kept "\n" in the catch block of chooseTarget() because the exception message that follows could be very long. However I don't have strong opinion on this delimiter.

          Show
          Ted Yu added a comment - Updated again according to code review. I kept "\n" in the catch block of chooseTarget() because the exception message that follows could be very long. However I don't have strong opinion on this delimiter.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 patch looks good. Please test it manually and post some sample log messages.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 patch looks good. Please test it manually and post some sample log messages.
          Hide
          Ted Yu added a comment -

          Running TestDecommission, I found this in TEST-org.apache.hadoop.hdfs.TestDecommission.txt when DEBUG logging is enabled for FSNamesystem:

          2011-05-16 16:37:04,230 WARN  namenode.FSNamesystem (BlockPlacementPolicyDefault.java:chooseTarget(212)) - Not able to place enough replicas, still in need of 1 to reach 2
          Not able to place enough replicas.[127.0.0.1:49864: Node /default-rack/127.0.0.1:49864 is not chosen because the node is (being) decommissioned ]
          
          Show
          Ted Yu added a comment - Running TestDecommission, I found this in TEST-org.apache.hadoop.hdfs.TestDecommission.txt when DEBUG logging is enabled for FSNamesystem: 2011-05-16 16:37:04,230 WARN namenode.FSNamesystem (BlockPlacementPolicyDefault.java:chooseTarget(212)) - Not able to place enough replicas, still in need of 1 to reach 2 Not able to place enough replicas.[127.0.0.1:49864: Node / default -rack/127.0.0.1:49864 is not chosen because the node is (being) decommissioned ]
          Hide
          Tsz Wo Nicholas Sze added a comment -
          • Just found one problem: In the two chooseRandom(..) methods (line 333-363 and line 378-422), if the first FSNamesystem.LOG.isDebugEnabled() return false and the second FSNamesystem.LOG.isDebugEnabled() returns true, we will have a NullPointerException since builder is null. We should check null for the second if-statement.
          • There are repeated "Not able to place enough replicas" in the log message. I think we might have new NotEnoughReplicasException(detail) and then use "\n" + e instead of "\n" + e.getMessage() in LOG.warn(..). Then, the log will become something like
            2011-05-16 16:37:04,230 WARN  namenode.FSNamesystem (BlockPlacementPolicyDefault.java:chooseTarget(212)) - Not able to place enough replicas, still in need of 1 to reach 2
            NotEnoughReplicasException: [127.0.0.1:49864: Node /default-rack/127.0.0.1:49864 is not chosen because the node is (being) decommissioned ]
            
          Show
          Tsz Wo Nicholas Sze added a comment - Just found one problem: In the two chooseRandom(..) methods (line 333-363 and line 378-422), if the first FSNamesystem.LOG.isDebugEnabled() return false and the second FSNamesystem.LOG.isDebugEnabled() returns true, we will have a NullPointerException since builder is null. We should check null for the second if-statement. There are repeated "Not able to place enough replicas" in the log message. I think we might have new NotEnoughReplicasException(detail) and then use "\n" + e instead of "\n" + e.getMessage() in LOG.warn(..) . Then, the log will become something like 2011-05-16 16:37:04,230 WARN namenode.FSNamesystem (BlockPlacementPolicyDefault.java:chooseTarget(212)) - Not able to place enough replicas, still in need of 1 to reach 2 NotEnoughReplicasException: [127.0.0.1:49864: Node /default-rack/127.0.0.1:49864 is not chosen because the node is (being) decommissioned ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479403/HDFS-1332-concise.patch
          against trunk revision 1103920.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/529//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/529//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/529//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479403/HDFS-1332-concise.patch against trunk revision 1103920. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/529//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/529//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/529//console This message is automatically generated.
          Hide
          Ted Yu added a comment -

          Using "\n" + e would show the following:

          2011-05-16 19:37:09,605 WARN  namenode.FSNamesystem (BlockPlacementPolicyDefault.java:chooseTarget(212)) - Not able to place enough replicas, still in need of 1 to reach 2
          org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicy$NotEnoughReplicasException:
          
          Show
          Ted Yu added a comment - Using "\n" + e would show the following: 2011-05-16 19:37:09,605 WARN namenode.FSNamesystem (BlockPlacementPolicyDefault.java:chooseTarget(212)) - Not able to place enough replicas, still in need of 1 to reach 2 org.apache.hadoop.hdfs.server.namenode.BlockPlacementPolicy$NotEnoughReplicasException:
          Hide
          Ted Yu added a comment -

          Here is the log for the latest patch:

          2011-05-16 19:42:29,292 WARN  namenode.FSNamesystem (BlockPlacementPolicyDefault.java:chooseTarget(212)) - Not able to place enough replicas, still in need of 1 to reach 2
          [127.0.0.1:50989: Node /default-rack/127.0.0.1:50989 is not chosen because the node is (being) decommissioned ]
          
          Show
          Ted Yu added a comment - Here is the log for the latest patch: 2011-05-16 19:42:29,292 WARN namenode.FSNamesystem (BlockPlacementPolicyDefault.java:chooseTarget(212)) - Not able to place enough replicas, still in need of 1 to reach 2 [127.0.0.1:50989: Node / default -rack/127.0.0.1:50989 is not chosen because the node is (being) decommissioned ]
          Hide
          Ted Yu added a comment -

          Modified according to Nicholas' comments.

          Show
          Ted Yu added a comment - Modified according to Nicholas' comments.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479425/HDFS-1332-concise.patch
          against trunk revision 1103987.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.hdfs.TestHDFSTrash
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/538//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/538//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/538//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479425/HDFS-1332-concise.patch against trunk revision 1103987. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.TestHDFSTrash org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/538//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/538//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/538//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479425/HDFS-1332-concise.patch
          against trunk revision 1103987.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.TestDatanodeReport
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.hdfs.TestHDFSTrash
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/540//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/540//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/540//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479425/HDFS-1332-concise.patch against trunk revision 1103987. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestDatanodeReport org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.TestHDFSTrash org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/540//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/540//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/540//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Ted, could you also show the other log messages you changed/added?

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Ted, could you also show the other log messages you changed/added?
          Hide
          Ted Yu added a comment -

          Here is the updated log from TestFileCreation for getAdditionalBlock():

          2011-05-17 13:51:20,203 INFO  ipc.Server (Server.java:run(1434)) - IPC Server handler 2 on 61961, call addBlock(/filestatus.dat, DFSClient_NONMAPREDUCE_-1833786488_1, null, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@59727745), rpc version=1, client version=67, methodsFingerPrint=-1645111634 from 127.0.0.1:61967: error: java.io.IOException: File /filestatus.dat could only be replicated to 0 nodes, instead of 1, because there are 1 datanodes running.
          
          Show
          Ted Yu added a comment - Here is the updated log from TestFileCreation for getAdditionalBlock(): 2011-05-17 13:51:20,203 INFO ipc.Server (Server.java:run(1434)) - IPC Server handler 2 on 61961, call addBlock(/filestatus.dat, DFSClient_NONMAPREDUCE_-1833786488_1, null , [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@59727745), rpc version=1, client version=67, methodsFingerPrint=-1645111634 from 127.0.0.1:61967: error: java.io.IOException: File /filestatus.dat could only be replicated to 0 nodes, instead of 1, because there are 1 datanodes running.
          Hide
          Ted Yu added a comment -

          Here is the log if DEBUG wasn't enabled:

          2011-05-17 13:58:21,942 WARN  namenode.FSNamesystem (BlockPlacementPolicyDefault.java:chooseTarget(212)) - Not able to place enough replicas, still in need of 1 to reach 2
          For more information, please enable DEBUG level logging on the org.apache.hadoop.hdfs.server.namenode.FSNamesystem logger.
          
          Show
          Ted Yu added a comment - Here is the log if DEBUG wasn't enabled: 2011-05-17 13:58:21,942 WARN namenode.FSNamesystem (BlockPlacementPolicyDefault.java:chooseTarget(212)) - Not able to place enough replicas, still in need of 1 to reach 2 For more information, please enable DEBUG level logging on the org.apache.hadoop.hdfs.server.namenode.FSNamesystem logger.
          Hide
          Todd Lipcon added a comment -

          java.io.IOException: File /filestatus.dat could only be replicated to 0 nodes, instead of 1, because there are 1 datanodes running

          hmm, that makes no sense... why does this happen?

          Show
          Todd Lipcon added a comment - java.io.IOException: File /filestatus.dat could only be replicated to 0 nodes, instead of 1, because there are 1 datanodes running hmm, that makes no sense... why does this happen?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Thanks Ted,

          • This part looks mysterious. Since the entire theme of this issue is for support new users, they may ask "Why it only can be replicated to 0 nodes but there is 1 datanode?"
            ... File /filestatus.dat could only be replicated to 0 nodes, instead of 1, because there are 1 datanodes running.
            
          • We only have to append the failingReason to the string builder. The log should be removed, otherwise, we will have repeated messages.
            +          FSNamesystem.LOG.debug(failingReason);
            
          Show
          Tsz Wo Nicholas Sze added a comment - Thanks Ted, This part looks mysterious. Since the entire theme of this issue is for support new users, they may ask "Why it only can be replicated to 0 nodes but there is 1 datanode?" ... File /filestatus.dat could only be replicated to 0 nodes, instead of 1, because there are 1 datanodes running. We only have to append the failingReason to the string builder. The log should be removed, otherwise, we will have repeated messages. + FSNamesystem.LOG.debug(failingReason);
          Hide
          Ted Yu added a comment -

          Here is the log snippet leading to the IOException:

          2011-05-17 13:51:20,193 DEBUG hdfs.DFSClient (DFSOutputStream.java:run(452)) - Allocating new block
          2011-05-17 13:51:20,197 INFO  hdfs.StateChange (FSNamesystem.java:allocateBlock(1867)) - BLOCK* NameSystem.allocateBlock: /filestatus.dat. BP-498881698-10.32.41.28-1305665466806 blk_7545771604732668451_1001{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[127.0.0.1:61963|RBW]]}
          2011-05-17 13:51:20,197 DEBUG hdfs.DFSClient (DFSOutputStream.java:createBlockOutputStream(997)) - pipeline = 127.0.0.1:61963
          2011-05-17 13:51:20,198 DEBUG hdfs.DFSClient (DFSOutputStream.java:createSocketForPipeline(1152)) - Connecting to datanode 127.0.0.1:61963
          2011-05-17 13:51:20,198 INFO  hdfs.DFSClient (DFSOutputStream.java:createBlockOutputStream(1044)) - Exception in createBlockOutputStream java.net.ConnectException: Connection refused
          2011-05-17 13:51:20,198 INFO  hdfs.DFSClient (DFSOutputStream.java:nextBlockOutputStream(974)) - Abandoning block BP-498881698-10.32.41.28-1305665466806:blk_7545771604732668451_1001
          2011-05-17 13:51:20,200 INFO  hdfs.DFSClient (DFSOutputStream.java:nextBlockOutputStream(977)) - Excluding datanode 127.0.0.1:61963
          2011-05-17 13:51:20,202 WARN  namenode.FSNamesystem (BlockPlacementPolicyDefault.java:chooseTarget(212)) - Not able to place enough replicas, still in need of 1 to reach 1
          
          2011-05-17 13:51:20,203 INFO  ipc.Server (Server.java:run(1434)) - IPC Server handler 2 on 61961, call addBlock(/filestatus.dat, DFSClient_NONMAPREDUCE_-1833786488_1, null, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@59727745), rpc version=1, client version=67, methodsFingerPrint=-1645111634 from 127.0.0.1:61967: error: java.io.IOException: File /filestatus.dat could 
          only be replicated to 0 nodes, instead of 1, because there are 1 datanodes running.
          
          Show
          Ted Yu added a comment - Here is the log snippet leading to the IOException: 2011-05-17 13:51:20,193 DEBUG hdfs.DFSClient (DFSOutputStream.java:run(452)) - Allocating new block 2011-05-17 13:51:20,197 INFO hdfs.StateChange (FSNamesystem.java:allocateBlock(1867)) - BLOCK* NameSystem.allocateBlock: /filestatus.dat. BP-498881698-10.32.41.28-1305665466806 blk_7545771604732668451_1001{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[127.0.0.1:61963|RBW]]} 2011-05-17 13:51:20,197 DEBUG hdfs.DFSClient (DFSOutputStream.java:createBlockOutputStream(997)) - pipeline = 127.0.0.1:61963 2011-05-17 13:51:20,198 DEBUG hdfs.DFSClient (DFSOutputStream.java:createSocketForPipeline(1152)) - Connecting to datanode 127.0.0.1:61963 2011-05-17 13:51:20,198 INFO hdfs.DFSClient (DFSOutputStream.java:createBlockOutputStream(1044)) - Exception in createBlockOutputStream java.net.ConnectException: Connection refused 2011-05-17 13:51:20,198 INFO hdfs.DFSClient (DFSOutputStream.java:nextBlockOutputStream(974)) - Abandoning block BP-498881698-10.32.41.28-1305665466806:blk_7545771604732668451_1001 2011-05-17 13:51:20,200 INFO hdfs.DFSClient (DFSOutputStream.java:nextBlockOutputStream(977)) - Excluding datanode 127.0.0.1:61963 2011-05-17 13:51:20,202 WARN namenode.FSNamesystem (BlockPlacementPolicyDefault.java:chooseTarget(212)) - Not able to place enough replicas, still in need of 1 to reach 1 2011-05-17 13:51:20,203 INFO ipc.Server (Server.java:run(1434)) - IPC Server handler 2 on 61961, call addBlock(/filestatus.dat, DFSClient_NONMAPREDUCE_-1833786488_1, null , [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@59727745), rpc version=1, client version=67, methodsFingerPrint=-1645111634 from 127.0.0.1:61967: error: java.io.IOException: File /filestatus.dat could only be replicated to 0 nodes, instead of 1, because there are 1 datanodes running.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > hmm, that makes no sense... why does this happen?

          It is because there are excluded datanodes.

          Show
          Tsz Wo Nicholas Sze added a comment - > hmm, that makes no sense... why does this happen? It is because there are excluded datanodes.
          Hide
          Ted Yu added a comment -

          Remove duplicate debug log

          Show
          Ted Yu added a comment - Remove duplicate debug log
          Hide
          Tsz Wo Nicholas Sze added a comment -
          • For the mysterious message,
                   throw new IOException("File " + src + " could only be replicated to " +
                                         targets.length + " nodes, instead of " +
            -                            blockManager.minReplication);
            +                            blockManager.minReplication + ", because there are "
            +                            + clusterMap.getNumOfLeaves() + " datanodes running.");
            

            How about change it to the following?

            +                            blockManager.minReplication + ".  There are "
            +                            + clusterMap.getNumOfLeaves() + " datanode(s) running but "
            +                            + excludedNodes.size() + " node(s) are excluded in this operation.");
            
          • For the log, since failingReason is only used once. We should eliminate it, otherwise, there are unnecessary object creation.
            -          FSNamesystem.LOG.debug("Node "+NodeBase.getPath(node)+
            -              " is not chosen because the node is too busy");
            +          String failingReason = "Node "+NodeBase.getPath(node)+
            +              " is not chosen because the node is too busy";
            +          threadLocalBuilder.get().append(node.toString()).append(": ")
            +            .append(failingReason).append(" ");
            

            BTW, we don't need to call toString() in node.toString().

          Show
          Tsz Wo Nicholas Sze added a comment - For the mysterious message, throw new IOException( "File " + src + " could only be replicated to " + targets.length + " nodes, instead of " + - blockManager.minReplication); + blockManager.minReplication + ", because there are " + + clusterMap.getNumOfLeaves() + " datanodes running." ); How about change it to the following? + blockManager.minReplication + ". There are " + + clusterMap.getNumOfLeaves() + " datanode(s) running but " + + excludedNodes.size() + " node(s) are excluded in this operation." ); For the log, since failingReason is only used once. We should eliminate it, otherwise, there are unnecessary object creation. - FSNamesystem.LOG.debug( "Node " +NodeBase.getPath(node)+ - " is not chosen because the node is too busy" ); + String failingReason = "Node " +NodeBase.getPath(node)+ + " is not chosen because the node is too busy" ; + threadLocalBuilder.get().append(node.toString()).append( ": " ) + .append(failingReason).append( " " ); BTW, we don't need to call toString() in node.toString() .
          Hide
          Ted Yu added a comment -

          This time the log is clearer:

          2011-05-17 16:20:28,920 INFO  ipc.Server (Server.java:run(1434)) - IPC Server handler 2 on 64506, call addBlock(/filestatus.dat, DFSClient_NONMAPREDUCE_-2042930756_1, null, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@27cc7f4b), rpc version=1, client version=67, methodsFingerPrint=-1645111634 from 127.0.0.1:64512: error: java.io.IOException: File /filestatus.dat could only be replicated to 0 nodes, instead of 1. There are 1 datanode(s) running but 1 node(s) are excluded in this operation.
          
          Show
          Ted Yu added a comment - This time the log is clearer: 2011-05-17 16:20:28,920 INFO ipc.Server (Server.java:run(1434)) - IPC Server handler 2 on 64506, call addBlock(/filestatus.dat, DFSClient_NONMAPREDUCE_-2042930756_1, null , [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@27cc7f4b), rpc version=1, client version=67, methodsFingerPrint=-1645111634 from 127.0.0.1:64512: error: java.io.IOException: File /filestatus.dat could only be replicated to 0 nodes, instead of 1. There are 1 datanode(s) running but 1 node(s) are excluded in this operation.
          Hide
          Ted Yu added a comment -

          Removed failingReason and folded it into StringBuilder operation.

          Show
          Ted Yu added a comment - Removed failingReason and folded it into StringBuilder operation.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12479534/HDFS-1332-concise.patch
          against trunk revision 1104579.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.TestDFSStorageStateRecovery
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.tools.TestJMXGet

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/551//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/551//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/551//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12479534/HDFS-1332-concise.patch against trunk revision 1104579. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestDFSStorageStateRecovery org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.tools.TestJMXGet +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/551//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/551//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/551//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 patch looks good.

          I have a feeling that this is misclassified as a "newbie" issue. However, Ted, you did a great job! Thanks.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 patch looks good. I have a feeling that this is misclassified as a "newbie" issue. However, Ted, you did a great job! Thanks.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed this. Thanks, Ted!

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed this. Thanks, Ted!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #665 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/665/)
          HDFS-1332. Include more information in exceptions and debug messages when BlockPlacementPolicy cannot be satisfied. Contributed by Ted Yu

          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1104649
          Files :

          • /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/namenode/BlockPlacementPolicyDefault.java
          • /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
          • /hadoop/hdfs/trunk/CHANGES.txt
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #665 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/665/ ) HDFS-1332 . Include more information in exceptions and debug messages when BlockPlacementPolicy cannot be satisfied. Contributed by Ted Yu szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1104649 Files : /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/namenode/BlockPlacementPolicyDefault.java /hadoop/hdfs/trunk/src/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/hdfs/trunk/CHANGES.txt
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #673 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/673/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #673 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/673/ )
          Hide
          Tsz Wo Nicholas Sze added a comment -

          It turned out that this introduced a NullPointerException since excludedNodes can possibly be null; see HDFS-2245.

          Show
          Tsz Wo Nicholas Sze added a comment - It turned out that this introduced a NullPointerException since excludedNodes can possibly be null; see HDFS-2245 .

            People

            • Assignee:
              Ted Yu
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development