Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.4, 1.1.1, 2.0.2-alpha
    • Fix Version/s: 2.1.0-beta, 1.2.1
    • Component/s: balancer
    • Labels:
      None

      Description

      When I manually ran TestBalancerWithNodeGroup, it always timed out in my machine. Looking at the Jerkins report build #3573, TestBalancerWithNodeGroup somehow was skipped so that the problem was not detected.

      1. test-balancer-with-node-group-timeout.txt
        58 kB
        Aaron T. Myers
      2. org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup-output.txt.win
        232 kB
        Chris Nauroth
      3. org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup-output.txt.mac
        256 kB
        Chris Nauroth
      4. jstack-win-5488
        89 kB
        Chris Nauroth
      5. jstack-mac-18567
        88 kB
        Chris Nauroth
      6. HDFS-4261-v8.patch
        11 kB
        Junping Du
      7. HDFS-4261-v7.patch
        10 kB
        Junping Du
      8. HDFS-4261-v6.patch
        9 kB
        Junping Du
      9. HDFS-4261-v5.patch
        8 kB
        Junping Du
      10. HDFS-4261-v4.patch
        8 kB
        Junping Du
      11. HDFS-4261-v3.patch
        7 kB
        Junping Du
      12. HDFS-4261-v2.patch
        6 kB
        Junping Du
      13. HDFS-4261-branch-2.patch
        11 kB
        Junping Du
      14. HDFS-4261-branch-1-v2.patch
        4 kB
        Junping Du
      15. HDFS-4261-branch-1.patch
        4 kB
        Junping Du
      16. HDFS-4261.patch
        5 kB
        Junping Du

        Issue Links

          Activity

          Hide
          Aaron T. Myers added a comment -

          Oof, we should really fix test-patch to notice this sort of failure. This isn't the first time that we've missed a test failure in this way. Nicholas, do you know if such a JIRA already exists? If not, I'll go ahead and file one.

          Show
          Aaron T. Myers added a comment - Oof, we should really fix test-patch to notice this sort of failure. This isn't the first time that we've missed a test failure in this way. Nicholas, do you know if such a JIRA already exists? If not, I'll go ahead and file one.
          Hide
          Junping Du added a comment -

          Upload a patch with test case to reproduce time-out and fix it.

          Show
          Junping Du added a comment - Upload a patch with test case to reproduce time-out and fix it.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          The patch does work. Some minor comments:

          • needs to import org.junit.Assert.assertTrue;
          • NO_CHANGE_ITERATION_MAP should be non-static. After removed "static", please rename it to noChangeIterationMap.
          Show
          Tsz Wo Nicholas Sze added a comment - The patch does work. Some minor comments: needs to import org.junit.Assert.assertTrue; NO_CHANGE_ITERATION_MAP should be non-static. After removed "static", please rename it to noChangeIterationMap.
          Hide
          Junping Du added a comment -

          Thanks Nicolas for the review. I address your comments on asserTrue in v2 patch. However, for non-static change, it won't be work as caller will new a balancer for every iteration. Thoughts?

          Show
          Junping Du added a comment - Thanks Nicolas for the review. I address your comments on asserTrue in v2 patch. However, for non-static change, it won't be work as caller will new a balancer for every iteration. Thoughts?
          Hide
          Junping Du added a comment -

          This is a Balancer bug cause infinite loop (and time out) when there are no suitable candidates for balancing but cluster is still not balanced.

          Show
          Junping Du added a comment - This is a Balancer bug cause infinite loop (and time out) when there are no suitable candidates for balancing but cluster is still not balanced.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          You are right. It cannot be changed to non-static. How about moving notChangedIterations to NameNodeConnector? If we do it, we may as well move the related code.

          //NameNodeConnector
            static final int MAX_NOT_CHANGED_INTERATIONS = 5;
            private int notChangedIterations = 0;
          
            boolean shouldContinue(long dispatchBlockMoveBytes) {
              if (dispatchBlockMoveBytes > 0) {
                notChangedIterations = 0;
              } else {
                notChangedIterations++;
                if (notChangedIterations >= MAX_NOT_CHANGED_INTERATIONS) {
                  System.out.println("No block has been moved for "
                      + notChangedIterations + " iterations. Exiting...");
                  return false;
                }
              }
              return true;
            }
          
          Show
          Tsz Wo Nicholas Sze added a comment - You are right. It cannot be changed to non-static. How about moving notChangedIterations to NameNodeConnector? If we do it, we may as well move the related code. //NameNodeConnector static final int MAX_NOT_CHANGED_INTERATIONS = 5; private int notChangedIterations = 0; boolean shouldContinue( long dispatchBlockMoveBytes) { if (dispatchBlockMoveBytes > 0) { notChangedIterations = 0; } else { notChangedIterations++; if (notChangedIterations >= MAX_NOT_CHANGED_INTERATIONS) { System .out.println( "No block has been moved for " + notChangedIterations + " iterations. Exiting..." ); return false ; } } return true ; }
          Hide
          Junping Du added a comment -

          Yes. It looks like a better idea. I already addressed this in v3 patch. Hope pre-commit test is ready now (it seems to be down yesterday).
          BTW, do you think we should replace "System.out" to LOG.WARN() here (and some other places in code base) so that user won't miss important message in their logs? If so, I would go ahead to file a separated JIRA to track this.

          Show
          Junping Du added a comment - Yes. It looks like a better idea. I already addressed this in v3 patch. Hope pre-commit test is ready now (it seems to be down yesterday). BTW, do you think we should replace "System.out" to LOG.WARN() here (and some other places in code base) so that user won't miss important message in their logs? If so, I would go ahead to file a separated JIRA to track this.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 patch looks good.

          Balancer is a shell command so that it uses System.out/err directly. It also has a LOG so that the log level can be changed. We may want to carefully think about when to use System.out/err and when to use LOG for each message. Let's don't change the messages here. We may change them in another JIRA if it is desirable.

          BTW, the test case is excellent!

          Show
          Tsz Wo Nicholas Sze added a comment - +1 patch looks good. Balancer is a shell command so that it uses System.out/err directly. It also has a LOG so that the log level can be changed. We may want to carefully think about when to use System.out/err and when to use LOG for each message. Let's don't change the messages here. We may change them in another JIRA if it is desirable. BTW, the test case is excellent!
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12556191/HDFS-4261-v3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup
          org.apache.hadoop.hdfs.server.namenode.TestEditLog

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3603//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3603//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12556191/HDFS-4261-v3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup org.apache.hadoop.hdfs.server.namenode.TestEditLog +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3603//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3603//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12556191/HDFS-4261-v3.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.TestEditLog
          org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3608//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3608//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12556191/HDFS-4261-v3.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestEditLog org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3608//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3608//console This message is automatically generated.
          Hide
          Aaron T. Myers added a comment -

          Hey Junping, though not changed by the current patch, can you comment on whether or not this change to the balancer was correct?

          -    if (BlockPlacementPolicy.getInstance(conf, null, null).getClass() != 
          -        BlockPlacementPolicyDefault.class) {
          -      throw new UnsupportedActionException("Balancer without BlockPlacementPolicyDefault");
          +    if (BlockPlacementPolicy.getInstance(conf, null, null) instanceof 
          +        BlockPlacementPolicyDefault) {
          +      throw new UnsupportedActionException(
          +          "Balancer without BlockPlacementPolicyDefault");
          

          Whereas before the balancer would throw an error if the BlockPlacementPolicy was not the default, the balancer will now throw an error if the BlockPlacementPolicy is the default. That doesn't seem quite right to me, but perhaps I'm missing something.

          I realize this isn't related to this patch per se, but it was committed recently as a part of HDFS-3495, so perhaps we could address it in this JIRA?

          Show
          Aaron T. Myers added a comment - Hey Junping, though not changed by the current patch, can you comment on whether or not this change to the balancer was correct? - if (BlockPlacementPolicy.getInstance(conf, null , null ).getClass() != - BlockPlacementPolicyDefault.class) { - throw new UnsupportedActionException( "Balancer without BlockPlacementPolicyDefault" ); + if (BlockPlacementPolicy.getInstance(conf, null , null ) instanceof + BlockPlacementPolicyDefault) { + throw new UnsupportedActionException( + "Balancer without BlockPlacementPolicyDefault" ); Whereas before the balancer would throw an error if the BlockPlacementPolicy was not the default, the balancer will now throw an error if the BlockPlacementPolicy is the default. That doesn't seem quite right to me, but perhaps I'm missing something. I realize this isn't related to this patch per se, but it was committed recently as a part of HDFS-3495 , so perhaps we could address it in this JIRA?
          Hide
          Junping Du added a comment -

          Hi, ATM. You are right it should be a bug in HDFS-3495 and the previous tends for changing is to be compatible with any subclass of BlockPlacementPolicyDefault (like BlockPlacementPolicyDefaultWithNodeGroup). Yes. We can address it in this patch as well. I will wrap up a new patch soon. Thanks!

          Show
          Junping Du added a comment - Hi, ATM. You are right it should be a bug in HDFS-3495 and the previous tends for changing is to be compatible with any subclass of BlockPlacementPolicyDefault (like BlockPlacementPolicyDefaultWithNodeGroup). Yes. We can address it in this patch as well. I will wrap up a new patch soon. Thanks!
          Hide
          Aaron T. Myers added a comment -

          Thanks, Junping. Sounds good.

          Show
          Aaron T. Myers added a comment - Thanks, Junping. Sounds good.
          Hide
          Junping Du added a comment -

          Address ATM's comments to fix a bug involved in previous JIRA patch.

          Show
          Junping Du added a comment - Address ATM's comments to fix a bug involved in previous JIRA patch.
          Hide
          Aaron T. Myers added a comment -

          Thanks, Junping. The way you addressed my comment looks good to me. I'll defer to Nicholas regarding the rest of the patch.

          Show
          Aaron T. Myers added a comment - Thanks, Junping. The way you addressed my comment looks good to me. I'll defer to Nicholas regarding the rest of the patch.
          Hide
          Junping Du added a comment -

          Thanks ATM for the important reminder!

          Show
          Junping Du added a comment - Thanks ATM for the important reminder!
          Hide
          Chris Nauroth added a comment -

          Thanks for working on a fix for this. This will partially fix the Windows test failures mentioned in HDFS-4275.

          The problem is even worse on Windows, because a timed-out test will leave a MiniDFSCluster running with file handles open on the test data directory, and then Windows file locking behavior will disallow subsequent tests from deleting and reinitializing the test data directory. This means that on Windows, not only this test had a failure, but then all subsequent MiniDFSCluster-based tests failed after that too.

          I tested your patch on Windows, and it did prevent the infinite loop/timeout condition, which made the overall test run more successful. However, I now see a failure in TestBalancerWithNodeGroup#testBalancerWithRackLocality on Windows. It's unexpectedly returning ReturnStatus.NO_MOVE_PROGRESS. I do not see this failure when running on Mac.

          Running org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup
          Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 105.156 sec <<<
          FAILURE!
          testBalancerWithRackLocality(org.apache.hadoop.hdfs.server.balancer.TestBalancer
          WithNodeGroup)  Time elapsed: 35171 sec  <<< FAILURE!
          junit.framework.AssertionFailedError: expected:<1> but was:<-3>
                  at junit.framework.Assert.fail(Assert.java:47)
                  at junit.framework.Assert.failNotEquals(Assert.java:283)
                  at junit.framework.Assert.assertEquals(Assert.java:64)
                  at junit.framework.Assert.assertEquals(Assert.java:195)
                  at junit.framework.Assert.assertEquals(Assert.java:201)
                  at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.runB
          alancer(TestBalancerWithNodeGroup.java:170)
                  at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.test
          BalancerWithRackLocality(TestBalancerWithNodeGroup.java:232)
                  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
          java:39)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
          sorImpl.java:25)
                  at java.lang.reflect.Method.invoke(Method.java:597)
                  at junit.framework.TestCase.runTest(TestCase.java:168)
                  at junit.framework.TestCase.runBare(TestCase.java:134)
                  at junit.framework.TestResult$1.protect(TestResult.java:110)
                  at junit.framework.TestResult.runProtected(TestResult.java:128)
                  at junit.framework.TestResult.run(TestResult.java:113)
                  at junit.framework.TestCase.run(TestCase.java:124)
                  at junit.framework.TestSuite.runTest(TestSuite.java:243)
                  at junit.framework.TestSuite.run(TestSuite.java:238)
                  at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.
          java:83)
                  at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provide
          r.java:252)
                  at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4
          Provider.java:141)
                  at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider
          .java:112)
                  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
          java:39)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
          sorImpl.java:25)
                  at java.lang.reflect.Method.invoke(Method.java:597)
                  at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(
          ReflectionUtils.java:189)
                  at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke
          (ProviderFactory.java:165)
                  at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(Provi
          derFactory.java:85)
                  at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(Fork
          edBooter.java:115)
                  at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:
          75)
          
          
          Results :
          
          Failed tests:   testBalancerWithRackLocality(org.apache.hadoop.hdfs.server.balan
          cer.TestBalancerWithNodeGroup): expected:<1> but was:<-3>
          
          Tests run: 3, Failures: 1, Errors: 0, Skipped: 0
          
          Show
          Chris Nauroth added a comment - Thanks for working on a fix for this. This will partially fix the Windows test failures mentioned in HDFS-4275 . The problem is even worse on Windows, because a timed-out test will leave a MiniDFSCluster running with file handles open on the test data directory, and then Windows file locking behavior will disallow subsequent tests from deleting and reinitializing the test data directory. This means that on Windows, not only this test had a failure, but then all subsequent MiniDFSCluster -based tests failed after that too. I tested your patch on Windows, and it did prevent the infinite loop/timeout condition, which made the overall test run more successful. However, I now see a failure in TestBalancerWithNodeGroup#testBalancerWithRackLocality on Windows. It's unexpectedly returning ReturnStatus.NO_MOVE_PROGRESS . I do not see this failure when running on Mac. Running org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 105.156 sec <<< FAILURE! testBalancerWithRackLocality(org.apache.hadoop.hdfs.server.balancer.TestBalancer WithNodeGroup) Time elapsed: 35171 sec <<< FAILURE! junit.framework.AssertionFailedError: expected:<1> but was:<-3> at junit.framework.Assert.fail(Assert.java:47) at junit.framework.Assert.failNotEquals(Assert.java:283) at junit.framework.Assert.assertEquals(Assert.java:64) at junit.framework.Assert.assertEquals(Assert.java:195) at junit.framework.Assert.assertEquals(Assert.java:201) at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.runB alancer(TestBalancerWithNodeGroup.java:170) at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.test BalancerWithRackLocality(TestBalancerWithNodeGroup.java:232) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:243) at junit.framework.TestSuite.run(TestSuite.java:238) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner. java:83) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provide r.java:252) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4 Provider.java:141) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider .java:112) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray( ReflectionUtils.java:189) at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke (ProviderFactory.java:165) at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(Provi derFactory.java:85) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(Fork edBooter.java:115) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java: 75) Results : Failed tests: testBalancerWithRackLocality(org.apache.hadoop.hdfs.server.balan cer.TestBalancerWithNodeGroup): expected:<1> but was:<-3> Tests run: 3, Failures: 1, Errors: 0, Skipped: 0
          Hide
          Chris Nauroth added a comment -

          I reviewed the Windows failure more closely and found this:

          java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: replica.getBytesOnDisk() !=
           block.getNumBytes(), block=BP-TEST:blk_1000_2000, replica=ReplicaUnderRecovery,
           blk_1000_2000, RUR
          

          That came from this check in FsDatasetImpl#updateReplicaUnderRecovery:

              //check replica's byte on disk
              if (replica.getBytesOnDisk() != oldBlock.getNumBytes()) {
                throw new IOException("THIS IS NOT SUPPOSED TO HAPPEN:"
                    + " replica.getBytesOnDisk() != block.getNumBytes(), block="
                    + oldBlock + ", replica=" + replica);
              }
          

          This is causing the current balancer iteration to move 0 bytes. Then, the new logic returns NO_MOVE_PROGRESS after exceeding the maximum iterations.

          This looks to be an unrelated Windows-specific issue, so I have filed a separate jira to track it: HDFS-4289.

          Show
          Chris Nauroth added a comment - I reviewed the Windows failure more closely and found this: java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: replica.getBytesOnDisk() != block.getNumBytes(), block=BP-TEST:blk_1000_2000, replica=ReplicaUnderRecovery, blk_1000_2000, RUR That came from this check in FsDatasetImpl#updateReplicaUnderRecovery : //check replica's byte on disk if (replica.getBytesOnDisk() != oldBlock.getNumBytes()) { throw new IOException( "THIS IS NOT SUPPOSED TO HAPPEN:" + " replica.getBytesOnDisk() != block.getNumBytes(), block=" + oldBlock + ", replica=" + replica); } This is causing the current balancer iteration to move 0 bytes. Then, the new logic returns NO_MOVE_PROGRESS after exceeding the maximum iterations. This looks to be an unrelated Windows-specific issue, so I have filed a separate jira to track it: HDFS-4289 .
          Hide
          Aaron T. Myers added a comment -

          Thanks a lot for the investigation, Chris. I agree that tracking this specific Windows issue warrants another JIRA.

          Nicholas: does Junping's latest patch look OK to you? If so, I'll go ahead and commit it.

          Show
          Aaron T. Myers added a comment - Thanks a lot for the investigation, Chris. I agree that tracking this specific Windows issue warrants another JIRA. Nicholas: does Junping's latest patch look OK to you? If so, I'll go ahead and commit it.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Yes, the patch looks good to me. Good catch on the bug, thanks!

          I just have started a Jerkins build. Let's wait for it.

          Show
          Tsz Wo Nicholas Sze added a comment - Yes, the patch looks good to me. Good catch on the bug, thanks! I just have started a Jerkins build. Let's wait for it.
          Hide
          Junping Du added a comment -

          Great.Thanks Chris for investigation and ATM and Nicholas for reviewing.

          Show
          Junping Du added a comment - Great.Thanks Chris for investigation and ATM and Nicholas for reviewing.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          The Jerkins builds somehow do not work anymore. I started a few builds recently but all failed with the following.

          ======================================================================
          ======================================================================
              Testing patch for HADOOP-4261.
          ======================================================================
          ======================================================================
          
          
          At revision 1419190.
          HADOOP-4261 is not "Patch Available".  Exiting.
          
          Show
          Tsz Wo Nicholas Sze added a comment - The Jerkins builds somehow do not work anymore. I started a few builds recently but all failed with the following. ====================================================================== ====================================================================== Testing patch for HADOOP-4261. ====================================================================== ====================================================================== At revision 1419190. HADOOP-4261 is not "Patch Available". Exiting.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I ran TestEditLog, TestBalancer and TestBalancerWithNodeGroup manually. The patch passed all of them.

          Show
          Tsz Wo Nicholas Sze added a comment - I ran TestEditLog, TestBalancer and TestBalancerWithNodeGroup manually. The patch passed all of them.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed this. Thanks, Junping.

          Also thanks everyone who has looked at this.

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed this. Thanks, Junping. Also thanks everyone who has looked at this.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #3101 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3101/)
          HDFS-4261. Fix bugs in Balancer that it does not terminate in some cases and it checks BlockPlacementPolicy instance incorrectly. Contributed by Junping Du (Revision 1419192)

          Result = SUCCESS
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1419192
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Show
          Hudson added a comment - Integrated in Hadoop-trunk-Commit #3101 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3101/ ) HDFS-4261 . Fix bugs in Balancer that it does not terminate in some cases and it checks BlockPlacementPolicy instance incorrectly. Contributed by Junping Du (Revision 1419192) Result = SUCCESS szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1419192 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #61 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/61/)
          HDFS-4261. Fix bugs in Balancer that it does not terminate in some cases and it checks BlockPlacementPolicy instance incorrectly. Contributed by Junping Du (Revision 1419192)

          Result = SUCCESS
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1419192
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Show
          Hudson added a comment - Integrated in Hadoop-Yarn-trunk #61 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/61/ ) HDFS-4261 . Fix bugs in Balancer that it does not terminate in some cases and it checks BlockPlacementPolicy instance incorrectly. Contributed by Junping Du (Revision 1419192) Result = SUCCESS szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1419192 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1250 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1250/)
          HDFS-4261. Fix bugs in Balancer that it does not terminate in some cases and it checks BlockPlacementPolicy instance incorrectly. Contributed by Junping Du (Revision 1419192)

          Result = FAILURE
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1419192
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1250 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1250/ ) HDFS-4261 . Fix bugs in Balancer that it does not terminate in some cases and it checks BlockPlacementPolicy instance incorrectly. Contributed by Junping Du (Revision 1419192) Result = FAILURE szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1419192 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1281 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1281/)
          HDFS-4261. Fix bugs in Balancer that it does not terminate in some cases and it checks BlockPlacementPolicy instance incorrectly. Contributed by Junping Du (Revision 1419192)

          Result = FAILURE
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1419192
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1281 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1281/ ) HDFS-4261 . Fix bugs in Balancer that it does not terminate in some cases and it checks BlockPlacementPolicy instance incorrectly. Contributed by Junping Du (Revision 1419192) Result = FAILURE szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1419192 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Although TestBalancerWithNodeGroup is passing in my machine, it fails in the recent Jenkins. I will revert the committed patch..

          Show
          Tsz Wo Nicholas Sze added a comment - Although TestBalancerWithNodeGroup is passing in my machine, it fails in the recent Jenkins. I will revert the committed patch..
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Reverted the patch.

          Junping, could you take a look the failure, e.g. build #3626?

          Show
          Tsz Wo Nicholas Sze added a comment - Reverted the patch. Junping, could you take a look the failure, e.g. build #3626 ?
          Hide
          Junping Du added a comment -

          Sure. I am looking into it now.

          Show
          Junping Du added a comment - Sure. I am looking into it now.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #3110 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3110/)
          svn -c -1419192 . for reverting HDFS-4261 since TestBalancerWithNodeGroup failed in the recent Jenkins builds. (Revision 1420010)

          Result = SUCCESS
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1420010
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Show
          Hudson added a comment - Integrated in Hadoop-trunk-Commit #3110 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3110/ ) svn -c -1419192 . for reverting HDFS-4261 since TestBalancerWithNodeGroup failed in the recent Jenkins builds. (Revision 1420010) Result = SUCCESS szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1420010 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #62 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/62/)
          svn -c -1419192 . for reverting HDFS-4261 since TestBalancerWithNodeGroup failed in the recent Jenkins builds. (Revision 1420010)

          Result = FAILURE
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1420010
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Show
          Hudson added a comment - Integrated in Hadoop-Yarn-trunk #62 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/62/ ) svn -c -1419192 . for reverting HDFS-4261 since TestBalancerWithNodeGroup failed in the recent Jenkins builds. (Revision 1420010) Result = FAILURE szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1420010 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1251 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1251/)
          svn -c -1419192 . for reverting HDFS-4261 since TestBalancerWithNodeGroup failed in the recent Jenkins builds. (Revision 1420010)

          Result = FAILURE
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1420010
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1251 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1251/ ) svn -c -1419192 . for reverting HDFS-4261 since TestBalancerWithNodeGroup failed in the recent Jenkins builds. (Revision 1420010) Result = FAILURE szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1420010 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1282 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1282/)
          svn -c -1419192 . for reverting HDFS-4261 since TestBalancerWithNodeGroup failed in the recent Jenkins builds. (Revision 1420010)

          Result = SUCCESS
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1420010
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1282 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1282/ ) svn -c -1419192 . for reverting HDFS-4261 since TestBalancerWithNodeGroup failed in the recent Jenkins builds. (Revision 1420010) Result = SUCCESS szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1420010 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Hide
          Junping Du added a comment -

          I cannot reproduce the failure in may local environment. However, I see the similar issue by adjusting sequence of testcases in TestBalancerWithNodeGroup which is caused by no cleanup if previous test is end with NO_MOVE_PROGRESS. So I adjust the clean up (resetData()) in v5 patch. Let's wait and see the pre-commit result.

          Show
          Junping Du added a comment - I cannot reproduce the failure in may local environment. However, I see the similar issue by adjusting sequence of testcases in TestBalancerWithNodeGroup which is caused by no cleanup if previous test is end with NO_MOVE_PROGRESS. So I adjust the clean up (resetData()) in v5 patch. Let's wait and see the pre-commit result.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12560385/HDFS-4261-v5.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3635//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3635//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12560385/HDFS-4261-v5.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3635//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3635//console This message is automatically generated.
          Hide
          Junping Du added a comment -

          In v6 patch, remove checking balanced for test case of TestBalancerWithRackLocality.

          Show
          Junping Du added a comment - In v6 patch, remove checking balanced for test case of TestBalancerWithRackLocality.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12560476/HDFS-4261-v6.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3643//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3643//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12560476/HDFS-4261-v6.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3643//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3643//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Question: since a Balancer object is created in each run, do we still need resetData(..)?

          Show
          Tsz Wo Nicholas Sze added a comment - Question: since a Balancer object is created in each run, do we still need resetData(..)?
          Hide
          Junping Du added a comment -

          Yes. For each iteration, a new Balancer object is created but the parameter p is the old one and passed to constructor of balancer every time. As p contains Balancing policy which do accumulation of nodes' capacity, it needs to be cleaned up in each iteration. So may be resetData() is still necessary here?
          Also, it seems odd that testBalancerWithRackLocality() in Jenkins ends with NO_MOVE_PROGRESS but run perfect in my env. Shall we address here or file a separated JIRA to track this?

          Show
          Junping Du added a comment - Yes. For each iteration, a new Balancer object is created but the parameter p is the old one and passed to constructor of balancer every time. As p contains Balancing policy which do accumulation of nodes' capacity, it needs to be cleaned up in each iteration. So may be resetData() is still necessary here? Also, it seems odd that testBalancerWithRackLocality() in Jenkins ends with NO_MOVE_PROGRESS but run perfect in my env. Shall we address here or file a separated JIRA to track this?
          Hide
          Chris Nauroth added a comment -

          I've been running with v6 of the patch, and I've occasionally seen it timeout waiting for move completion. I've seen it happen on both Mac and Windows, though it seems to be much more prevalent on Windows. I captured thread dumps. If it's helpful to see those, let me know, and I'll post them.

          Show
          Chris Nauroth added a comment - I've been running with v6 of the patch, and I've occasionally seen it timeout waiting for move completion. I've seen it happen on both Mac and Windows, though it seems to be much more prevalent on Windows. I captured thread dumps. If it's helpful to see those, let me know, and I'll post them.
          Hide
          Junping Du added a comment -

          Chris, Thanks for comments. Can you tell me the occasionally timeout is happened before or after applying v6 patch?

          Show
          Junping Du added a comment - Chris, Thanks for comments. Can you tell me the occasionally timeout is happened before or after applying v6 patch?
          Hide
          Chris Nauroth added a comment -

          This happened after I applied the v6 patch.

          Show
          Chris Nauroth added a comment - This happened after I applied the v6 patch.
          Hide
          Junping Du added a comment -

          Thanks. It is great if you can post the thread dumps.

          Show
          Junping Du added a comment - Thanks. It is great if you can post the thread dumps.
          Hide
          Chris Nauroth added a comment -

          I'm attaching multiple files.

          There are thread dumps from 2 separate runs that timed out with the v6 patch, one on Mac and one on Windows. Both thread dumps show the same thing: stuck in Balancer#waitForMoveCompletion waiting for the pending queue to reach empty. Perhaps there is a race condition preventing the queue from getting drained?

          I've also attached the log output from each test run.

          Show
          Chris Nauroth added a comment - I'm attaching multiple files. There are thread dumps from 2 separate runs that timed out with the v6 patch, one on Mac and one on Windows. Both thread dumps show the same thing: stuck in Balancer#waitForMoveCompletion waiting for the pending queue to reach empty. Perhaps there is a race condition preventing the queue from getting drained? I've also attached the log output from each test run.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12560600/org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup-output.txt.win
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3649//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12560600/org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup-output.txt.win against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3649//console This message is automatically generated.
          Hide
          Chris Nauroth added a comment -

          Junping, I also wanted to mention that I can reproduce this fairly consistently. It happens for ~20% of my test runs. If you're having trouble getting a repro in your own environment, then I'd be happy to test new patches for you.

          Show
          Chris Nauroth added a comment - Junping, I also wanted to mention that I can reproduce this fairly consistently. It happens for ~20% of my test runs. If you're having trouble getting a repro in your own environment, then I'd be happy to test new patches for you.
          Hide
          Junping Du added a comment -

          Thanks Chris. The jstack you attached here is very helpful. Like you said, it seems to be pending on: org.apache.hadoop.hdfs.server.balancer.Balancer.waitForMoveCompletion(Balancer.java:1139).
          I once doubt if it could be List issue of pendingBlocks as it might add the same PendingBlockMove multiple times but delete once. However, the logic of add/delete there is right (even for adding pendingBlock to SourceProxy). So, I am searching for other hints there.

          Show
          Junping Du added a comment - Thanks Chris. The jstack you attached here is very helpful. Like you said, it seems to be pending on: org.apache.hadoop.hdfs.server.balancer.Balancer.waitForMoveCompletion(Balancer.java:1139). I once doubt if it could be List issue of pendingBlocks as it might add the same PendingBlockMove multiple times but delete once. However, the logic of add/delete there is right (even for adding pendingBlock to SourceProxy). So, I am searching for other hints there.
          Hide
          Junping Du added a comment -

          I think I found the reason. There is another balancer bug (for corner case only) revealed by TestBalancerWithNodeGroup.testBalancerWithNodeGroup(). In current code base, when a source node (>avg) has a NodeTask to move some size data to target node (< avg), it will fall into infinite loop if no any block can be moved due to restraint from replica placement policy. This is not caused by v6 patch and I can repeat on my linux system even without v6 patch (10%-20% possibility). I will have a quick fix to it (v7 patch). Chris, please help to verify v7 patch do works for you (mac and windows). Thanks!

          Show
          Junping Du added a comment - I think I found the reason. There is another balancer bug (for corner case only) revealed by TestBalancerWithNodeGroup.testBalancerWithNodeGroup(). In current code base, when a source node (>avg) has a NodeTask to move some size data to target node (< avg), it will fall into infinite loop if no any block can be moved due to restraint from replica placement policy. This is not caused by v6 patch and I can repeat on my linux system even without v6 patch (10%-20% possibility). I will have a quick fix to it (v7 patch). Chris, please help to verify v7 patch do works for you (mac and windows). Thanks!
          Hide
          Junping Du added a comment -

          Fix another infinite loop in v7 patch due to no blocks can be moved in a Source BalancerNode's node task.

          Show
          Junping Du added a comment - Fix another infinite loop in v7 patch due to no blocks can be moved in a Source BalancerNode's node task.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12561504/HDFS-4261-v7.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3679//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3679//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12561504/HDFS-4261-v7.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3679//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3679//console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          Chris, please help to verify v7 patch do works for you

          Chris is off next few days. So testing has to be done without his help.

          Show
          Suresh Srinivas added a comment - Chris, please help to verify v7 patch do works for you Chris is off next few days. So testing has to be done without his help.
          Hide
          Junping Du added a comment -

          Thanks Suresh for comments. I think I already did enough tests on my local env and I believe previous bug is not related to platform (Win or Mac). I ping Chris just because he was volunteering to double check the patch.

          Show
          Junping Du added a comment - Thanks Suresh for comments. I think I already did enough tests on my local env and I believe previous bug is not related to platform (Win or Mac). I ping Chris just because he was volunteering to double check the patch.
          Hide
          Aaron T. Myers added a comment -

          Hey Junping, I just applied the latest patch and looped it on my machine a few times and in one of the test runs I saw the following failure:

          Running org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup
          Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 54.806 sec <<< FAILURE!
          testBalancerWithNodeGroup(org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup)  Time elapsed: 37809 sec  <<< ERROR!
          java.util.concurrent.TimeoutException: Rebalancing expected avg utilization to become 0.16, but on datanode 127.0.0.1:44127 it remains at 0.04 after more than 20000 msec.
          	at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.waitForBalancer(TestBalancerWithNodeGroup.java:149)
          	at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.runBalancer(TestBalancerWithNodeGroup.java:176)
          	at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.testBalancerWithNodeGroup(TestBalancerWithNodeGroup.java:300)
          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          	at java.lang.reflect.Method.invoke(Method.java:597)
          	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
          	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
          	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
          	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
          	at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
          	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
          	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
          	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
          	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
          	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
          	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
          	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
          	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
          	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
          	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
          	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          	at java.lang.reflect.Method.invoke(Method.java:597)
          	at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
          	at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
          	at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
          	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
          	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)
          

          Entirely possible this again isn't actually due to your changes, but have you seen this error before?

          Show
          Aaron T. Myers added a comment - Hey Junping, I just applied the latest patch and looped it on my machine a few times and in one of the test runs I saw the following failure: Running org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 54.806 sec <<< FAILURE! testBalancerWithNodeGroup(org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup) Time elapsed: 37809 sec <<< ERROR! java.util.concurrent.TimeoutException: Rebalancing expected avg utilization to become 0.16, but on datanode 127.0.0.1:44127 it remains at 0.04 after more than 20000 msec. at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.waitForBalancer(TestBalancerWithNodeGroup.java:149) at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.runBalancer(TestBalancerWithNodeGroup.java:176) at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.testBalancerWithNodeGroup(TestBalancerWithNodeGroup.java:300) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189) at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165) at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75) Entirely possible this again isn't actually due to your changes, but have you seen this error before?
          Hide
          Junping Du added a comment -

          Hi ATM, Thanks for your input. I run several rounds test (>20) on my local env but haven't seen this error before.
          In general, this error happens when the cluster is not balanced after run balancer. We expected this happen in testBalancerEndInNoMoveProgress() but it shouldn't happen in TestBalancerWithNodeGroup.testBalancerWithNodeGroup() case. It is possible to be related to my latest changes as it jump out of thread of SourceBalancerNode if no blocks can be moved to target node (it is possible in this boundary test case) to get rid of infinite loop. It is possible to cause some balancerNode to end in unbalanced situation, but should get balanced in next balancing iteration (except it always get the same target node).
          I need to do more investigation on it.

          Show
          Junping Du added a comment - Hi ATM, Thanks for your input. I run several rounds test (>20) on my local env but haven't seen this error before. In general, this error happens when the cluster is not balanced after run balancer. We expected this happen in testBalancerEndInNoMoveProgress() but it shouldn't happen in TestBalancerWithNodeGroup.testBalancerWithNodeGroup() case. It is possible to be related to my latest changes as it jump out of thread of SourceBalancerNode if no blocks can be moved to target node (it is possible in this boundary test case) to get rid of infinite loop. It is possible to cause some balancerNode to end in unbalanced situation, but should get balanced in next balancing iteration (except it always get the same target node). I need to do more investigation on it.
          Hide
          Chris Nauroth added a comment -

          I just tried testing the v7 patch applied to current branch-trunk-win. I couldn't repro on Mac, but I'm still seeing the infinite loop in about 50% of test runs on Windows. The thread dumps look similar to last time: stuck in Balancer#waitForMoveCompletion. (The line numbers are different since Balancer.java has changed since last time.)

          Show
          Chris Nauroth added a comment - I just tried testing the v7 patch applied to current branch-trunk-win. I couldn't repro on Mac, but I'm still seeing the infinite loop in about 50% of test runs on Windows. The thread dumps look similar to last time: stuck in Balancer#waitForMoveCompletion . (The line numbers are different since Balancer.java has changed since last time.)
          Hide
          Eli Collins added a comment -

          Any update Junping? TestBalancerWithNodeGroup currently fails 100% of the time on my local jenkins slave running trunk. We should annotate these test methods with timeouts ala HDFS-4061 and HDFS-4008 so we get clean test failures in case this regresses.

          Show
          Eli Collins added a comment - Any update Junping? TestBalancerWithNodeGroup currently fails 100% of the time on my local jenkins slave running trunk. We should annotate these test methods with timeouts ala HDFS-4061 and HDFS-4008 so we get clean test failures in case this regresses.
          Hide
          Junping Du added a comment -

          Hi Eli, with v7 patch, TestBalancerWithNodeGroup can always be successful on my local env, and I cannot reproduce ATM and Chris' issue (I tried 30+ times on my env already). I think at least 4 issues are identified and fixed for balancer here:
          1. NoChangeIterations (for counting iteration of no block movement) is not working before. Comparing with branch-1, it seems to be involved by Namenode Federation.
          2. balancer's Balancing policy is static so we need to cleanup (reset) in every iteration of balancing although we create a new balancer instance.
          3. checkReplicaPlacementPolicy() issue which is identified by ATM.
          4. the loop in dispatchBlocks() could be infinite in some occasional cases.
          +1 on adding timeout annotation, I will add it in v8 patch.

          Show
          Junping Du added a comment - Hi Eli, with v7 patch, TestBalancerWithNodeGroup can always be successful on my local env, and I cannot reproduce ATM and Chris' issue (I tried 30+ times on my env already). I think at least 4 issues are identified and fixed for balancer here: 1. NoChangeIterations (for counting iteration of no block movement) is not working before. Comparing with branch-1, it seems to be involved by Namenode Federation. 2. balancer's Balancing policy is static so we need to cleanup (reset) in every iteration of balancing although we create a new balancer instance. 3. checkReplicaPlacementPolicy() issue which is identified by ATM. 4. the loop in dispatchBlocks() could be infinite in some occasional cases. +1 on adding timeout annotation, I will add it in v8 patch.
          Hide
          Junping Du added a comment -

          Chris, can you help to verify it again in your env? If issue only happen on specific platform, I think we can file a separated jira to track this as above 3# is a blocking issue.

          Show
          Junping Du added a comment - Chris, can you help to verify it again in your env? If issue only happen on specific platform, I think we can file a separated jira to track this as above 3# is a blocking issue.
          Hide
          Colin Patrick McCabe added a comment -

          I put up a patch just to add the junit timeout. That way we can at least identify Jenkins failures that are due to this issue. (there have been a bunch lately)

          Show
          Colin Patrick McCabe added a comment - I put up a patch just to add the junit timeout. That way we can at least identify Jenkins failures that are due to this issue. (there have been a bunch lately)
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12563782/HDFS-4261-v8.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3794//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3794//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12563782/HDFS-4261-v8.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3794//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3794//console This message is automatically generated.
          Hide
          Chris Nauroth added a comment -

          +1 for the v8 patch.

          I tested it on Windows, and I couldn't repro the infinite loop this time. I don't know that it's completely resolved, but it's certainly passing more consistently than current trunk.

          Show
          Chris Nauroth added a comment - +1 for the v8 patch. I tested it on Windows, and I couldn't repro the infinite loop this time. I don't know that it's completely resolved, but it's certainly passing more consistently than current trunk.
          Hide
          Aaron T. Myers added a comment -

          I just looped all of the balancer tests on my machine for an hour and a half and did end up with one timeout in TestBalancerWithNodeGroup. I'm attaching the thread dump to this JIRA.

          Despite this, I think we should probably go ahead and commit this patch and file a new JIRA for this intermittent failure. This latest patch definitely fixes a few issues in the balancer, improves the balancer tests, and makes the tests fail much less frequently.

          Unless anyone objects, I'll commit this patch later today and file a new JIRA for the intermittent failure.

          Show
          Aaron T. Myers added a comment - I just looped all of the balancer tests on my machine for an hour and a half and did end up with one timeout in TestBalancerWithNodeGroup. I'm attaching the thread dump to this JIRA. Despite this, I think we should probably go ahead and commit this patch and file a new JIRA for this intermittent failure. This latest patch definitely fixes a few issues in the balancer, improves the balancer tests, and makes the tests fail much less frequently. Unless anyone objects, I'll commit this patch later today and file a new JIRA for the intermittent failure.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12563827/test-balancer-with-node-group-timeout.txt
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3801//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12563827/test-balancer-with-node-group-timeout.txt against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3801//console This message is automatically generated.
          Hide
          Junping Du added a comment -

          Thanks Chris and Aaron for verification. +1 to open another JIRA to track this very occasional timeout (only one time in looped tests in 1.5 hour).

          Show
          Junping Du added a comment - Thanks Chris and Aaron for verification. +1 to open another JIRA to track this very occasional timeout (only one time in looped tests in 1.5 hour).
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 patch looks good.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 patch looks good.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed this. Thanks, Junping!

          Also, thanks everyone for helping out here.

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed this. Thanks, Junping! Also, thanks everyone for helping out here.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #3202 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3202/)
          HDFS-4261. Fix bugs in Balaner causing infinite loop and TestBalancerWithNodeGroup timeing out. Contributed by Junping Du (Revision 1430917)

          Result = SUCCESS
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1430917
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Show
          Hudson added a comment - Integrated in Hadoop-trunk-Commit #3202 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3202/ ) HDFS-4261 . Fix bugs in Balaner causing infinite loop and TestBalancerWithNodeGroup timeing out. Contributed by Junping Du (Revision 1430917) Result = SUCCESS szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1430917 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Hide
          Aaron T. Myers added a comment -

          Thanks for committing this, Nicholas. I've filed this JIRA to track the intermittent timeout which still occurs: HDFS-4376.

          Show
          Aaron T. Myers added a comment - Thanks for committing this, Nicholas. I've filed this JIRA to track the intermittent timeout which still occurs: HDFS-4376 .
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #92 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/92/)
          HDFS-4261. Fix bugs in Balaner causing infinite loop and TestBalancerWithNodeGroup timeing out. Contributed by Junping Du (Revision 1430917)

          Result = SUCCESS
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1430917
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Show
          Hudson added a comment - Integrated in Hadoop-Yarn-trunk #92 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/92/ ) HDFS-4261 . Fix bugs in Balaner causing infinite loop and TestBalancerWithNodeGroup timeing out. Contributed by Junping Du (Revision 1430917) Result = SUCCESS szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1430917 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1309 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1309/)
          HDFS-4261. Fix bugs in Balaner causing infinite loop and TestBalancerWithNodeGroup timeing out. Contributed by Junping Du (Revision 1430917)

          Result = FAILURE
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1430917
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1309 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1309/ ) HDFS-4261 . Fix bugs in Balaner causing infinite loop and TestBalancerWithNodeGroup timeing out. Contributed by Junping Du (Revision 1430917) Result = FAILURE szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1430917 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1281 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1281/)
          HDFS-4261. Fix bugs in Balaner causing infinite loop and TestBalancerWithNodeGroup timeing out. Contributed by Junping Du (Revision 1430917)

          Result = FAILURE
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1430917
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1281 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1281/ ) HDFS-4261 . Fix bugs in Balaner causing infinite loop and TestBalancerWithNodeGroup timeing out. Contributed by Junping Du (Revision 1430917) Result = FAILURE szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1430917 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithNodeGroup.java
          Hide
          Ted Yu added a comment -

          Minor:

          +      if (notChangedIterations >= MAX_NOT_CHANGED_INTERATIONS) {
          

          Looks the constant is misspelled - an extra N following I.

          Show
          Ted Yu added a comment - Minor: + if (notChangedIterations >= MAX_NOT_CHANGED_INTERATIONS) { Looks the constant is misspelled - an extra N following I.
          Hide
          Aaron T. Myers added a comment -

          Whoops! Good catch, Ted. Want to file a JIRA?

          Show
          Aaron T. Myers added a comment - Whoops! Good catch, Ted. Want to file a JIRA?
          Hide
          Ted Yu added a comment -

          Created HDFS-4382.

          Will upload a patch soon.

          Is there plan to fix hanging TestBalancerWithNodeGroup in hadoop 1.1 ?

          See HBase QA report:
          https://issues.apache.org/jira/browse/HBASE-7529?focusedCommentId=13549790&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13549790

          -1 core zombie tests. There are 1 zombie test(s): at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.testBalancerWithRackLocality(TestBalancerWithNodeGroup.java:220)

          Show
          Ted Yu added a comment - Created HDFS-4382 . Will upload a patch soon. Is there plan to fix hanging TestBalancerWithNodeGroup in hadoop 1.1 ? See HBase QA report: https://issues.apache.org/jira/browse/HBASE-7529?focusedCommentId=13549790&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13549790 -1 core zombie tests. There are 1 zombie test(s): at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.testBalancerWithRackLocality(TestBalancerWithNodeGroup.java:220)
          Hide
          Junping Du added a comment -

          Thanks Ted to fix the typo. I will file a JIRA to backport this patch to branch-1.

          Show
          Junping Du added a comment - Thanks Ted to fix the typo. I will file a JIRA to backport this patch to branch-1.
          Hide
          Suresh Srinivas added a comment -

          Please attach a patch to this Jira. I will commit it. No need for a separate Jira for such a simple change.

          Show
          Suresh Srinivas added a comment - Please attach a patch to this Jira. I will commit it. No need for a separate Jira for such a simple change.
          Hide
          Junping Du added a comment -

          Oops. I already filed it before your comments...

          Show
          Junping Du added a comment - Oops. I already filed it before your comments...
          Hide
          Aaron T. Myers added a comment -

          Have any of the node group-related balancer changes been back-ported to branch-1? I was under the impression that none of them have even been back-ported to branch-2, let alone branch-1.

          Show
          Aaron T. Myers added a comment - Have any of the node group-related balancer changes been back-ported to branch-1? I was under the impression that none of them have even been back-ported to branch-2, let alone branch-1.
          Hide
          Junping Du added a comment -

          Hi Aaron, Yes. The backport work is tracked by: https://issues.apache.org/jira/browse/HADOOP-8817.

          Show
          Junping Du added a comment - Hi Aaron, Yes. The backport work is tracked by: https://issues.apache.org/jira/browse/HADOOP-8817 .
          Hide
          Aaron T. Myers added a comment -

          Gotcha, OK. I wasn't aware of that JIRA. Thanks for pointing it out.

          Please make sure that whatever gets back-ported to branch-1 also gets back-ported to branch-2. I was under the impression that all of this stuff was only going to trunk.

          Show
          Aaron T. Myers added a comment - Gotcha, OK. I wasn't aware of that JIRA. Thanks for pointing it out. Please make sure that whatever gets back-ported to branch-1 also gets back-ported to branch-2. I was under the impression that all of this stuff was only going to trunk.
          Hide
          Junping Du added a comment -

          backport patch to branch-1.

          Show
          Junping Du added a comment - backport patch to branch-1.
          Hide
          Junping Du added a comment -

          I manually add TestBalancerWithNodeGroup on my test-commit, and following are result:

          +1 overall.  
          
              +1 @author.  The patch does not contain any @author tags.
          
              +1 tests included.  The patch appears to include 3 new or modified tests.
          
              +1 javadoc.  The javadoc tool did not generate any warning messages.
          
              +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
          
              +1 findbugs.  The patch does not introduce any new Findbugs (version 2.0.1) warnings.
          
              +1 commit tests.  The patch passed commit unit tests.
          
          Show
          Junping Du added a comment - I manually add TestBalancerWithNodeGroup on my test-commit, and following are result: +1 overall. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.1) warnings. +1 commit tests. The patch passed commit unit tests.
          Hide
          Junping Du added a comment -

          Plan to lunch test-commit (with TestBalancerWithNodeGroup) 30 times to see the results. My local env have Jenkins backed with 4 VMs, so can run 4 times in paralleling, but still need more time.

          Show
          Junping Du added a comment - Plan to lunch test-commit (with TestBalancerWithNodeGroup) 30 times to see the results. My local env have Jenkins backed with 4 VMs, so can run 4 times in paralleling, but still need more time.
          Hide
          Tsz Wo Nicholas Sze added a comment -
          • The branch-1 patch also has the typo MAX_NO_PENDING_BLOCK_INTERATIONS. Let's fix it here so that we don't have to backport HDFS-4382.
          • Should resetData() be inside the while-loop?
          Show
          Tsz Wo Nicholas Sze added a comment - The branch-1 patch also has the typo MAX_NO_PENDING_BLOCK_INTERATIONS. Let's fix it here so that we don't have to backport HDFS-4382 . Should resetData() be inside the while-loop?
          Hide
          Junping Du added a comment -

          Oh, sorry to forget to correct stupid typo here.
          Yes. resetData() should be in while-loop to work with initNodes(). But initNodes() seems to build everything from 0 (different with trunk) so previous tests didn't complain. Move this resetData() within while-loop but put it to front in case we want to change behaviours in initNodes() as trunk.
          Update to v2 patch. Thanks Nicholas for carefully review!

          Show
          Junping Du added a comment - Oh, sorry to forget to correct stupid typo here. Yes. resetData() should be in while-loop to work with initNodes(). But initNodes() seems to build everything from 0 (different with trunk) so previous tests didn't complain. Move this resetData() within while-loop but put it to front in case we want to change behaviours in initNodes() as trunk. Update to v2 patch. Thanks Nicholas for carefully review!
          Hide
          Junping Du added a comment -

          To Aaron's previous comments:

          Please make sure that whatever gets back-ported to branch-1 also gets back-ported to branch-2. I was under the impression that all of this stuff was only going to trunk
          

          Thanks for reminder. We can start the backport work soon. BTW, shall we reuse JIRAs in HADOOP-8817 (or in HADOOP-8468) or file new ones?

          Show
          Junping Du added a comment - To Aaron's previous comments: Please make sure that whatever gets back-ported to branch-1 also gets back-ported to branch-2. I was under the impression that all of this stuff was only going to trunk Thanks for reminder. We can start the backport work soon. BTW, shall we reuse JIRAs in HADOOP-8817 (or in HADOOP-8468 ) or file new ones?
          Hide
          Aaron T. Myers added a comment -

          BTW, shall we reuse JIRAs...

          I think we should use the same JIRAs as were used for the trunk work, and just change the "fix version" field appropriately when the change is committed to branch-2. (Not reusing the JIRAs is the reason why I wasn't aware of the branch-1 back-ports.)

          Show
          Aaron T. Myers added a comment - BTW, shall we reuse JIRAs... I think we should use the same JIRAs as were used for the trunk work, and just change the "fix version" field appropriately when the change is committed to branch-2. (Not reusing the JIRAs is the reason why I wasn't aware of the branch-1 back-ports.)
          Hide
          Suresh Srinivas added a comment -

          I think we should use the same JIRAs as were used for the trunk work, and just change the "fix version" field appropriately when the change is committed to branch-2.

          I agree, when patches apply to earlier branches with trivial changes. However, when a significant rework is needed to port a patch back, I have been asking folks to create a separate jira, with an appropriate link between the jiras.

          Show
          Suresh Srinivas added a comment - I think we should use the same JIRAs as were used for the trunk work, and just change the "fix version" field appropriately when the change is committed to branch-2. I agree, when patches apply to earlier branches with trivial changes. However, when a significant rework is needed to port a patch back, I have been asking folks to create a separate jira, with an appropriate link between the jiras.
          Hide
          Aaron T. Myers added a comment -

          I agree, when patches apply to earlier branches with trivial changes. However, when a significant rework is needed to port a patch back, I have been asking folks to create a separate jira, with an appropriate link between the jiras.

          That's totally reasonable. trunk/branch-2 are so close to each other that the back-port is almost always clean or trivial, but you're right that branch-1 back-ports often warrant a separate JIRA.

          Show
          Aaron T. Myers added a comment - I agree, when patches apply to earlier branches with trivial changes. However, when a significant rework is needed to port a patch back, I have been asking folks to create a separate jira, with an appropriate link between the jiras. That's totally reasonable. trunk/branch-2 are so close to each other that the back-port is almost always clean or trivial, but you're right that branch-1 back-ports often warrant a separate JIRA.
          Hide
          Junping Du added a comment -

          I see. That make sense. Thanks ATM and Suresh.

          Show
          Junping Du added a comment - I see. That make sense. Thanks ATM and Suresh.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 HDFS-4261-branch-1-v2.patch looks good. Please post the results once you have finished running all the tests. Thanks.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 HDFS-4261 -branch-1-v2.patch looks good. Please post the results once you have finished running all the tests. Thanks.
          Hide
          Junping Du added a comment -

          Nicholas, Thanks for your review. However, I notice about 4 times failure (not timeout, but TestBalancerWithNodeGroup ends in NoMovementInProgress) in running about 30 times. I need some time to figure out the reason and update with another patch.

          Show
          Junping Du added a comment - Nicholas, Thanks for your review. However, I notice about 4 times failure (not timeout, but TestBalancerWithNodeGroup ends in NoMovementInProgress) in running about 30 times. I need some time to figure out the reason and update with another patch.
          Hide
          Suresh Srinivas added a comment -

          Junping, can we wrap this up for branch-1, given rest of the code has already gone into that branch?

          Show
          Suresh Srinivas added a comment - Junping, can we wrap this up for branch-1, given rest of the code has already gone into that branch?
          Hide
          Junping Du added a comment -

          Suresh, I am good to get this in branch-1 and we can figure out very occasional failure later.

          Show
          Junping Du added a comment - Suresh, I am good to get this in branch-1 and we can figure out very occasional failure later.
          Hide
          Matt Foley added a comment -

          Re-opening for branch-1 fix. Please target 1.2.0 (branch-1). Thanks.

          Show
          Matt Foley added a comment - Re-opening for branch-1 fix. Please target 1.2.0 (branch-1). Thanks.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Junping, TestBalancerWithNodeGroup timed out in build #3889. Could you take a look?

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Junping, TestBalancerWithNodeGroup timed out in build #3889 . Could you take a look?
          Hide
          Junping Du added a comment -

          Sure. Nicholas. I will work on it soon.

          Show
          Junping Du added a comment - Sure. Nicholas. I will work on it soon.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Junping, any update on this?

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Junping, any update on this?
          Hide
          Matt Foley added a comment -

          Changed Target Version to 1.3.0 upon release of 1.2.0. Please change to 1.2.1 if you intend to submit a fix for branch-1.2.

          Show
          Matt Foley added a comment - Changed Target Version to 1.3.0 upon release of 1.2.0. Please change to 1.2.1 if you intend to submit a fix for branch-1.2.
          Hide
          Junping Du added a comment -

          Ok. Will try to catch up with 1.2.1. Thanks Matt!

          Show
          Junping Du added a comment - Ok. Will try to catch up with 1.2.1. Thanks Matt!
          Hide
          Junping Du added a comment -

          Backport v8 patch to branch-2.

          Show
          Junping Du added a comment - Backport v8 patch to branch-2.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12585723/HDFS-4261-branch-2.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4467//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12585723/HDFS-4261-branch-2.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4467//console This message is automatically generated.
          Hide
          Junping Du added a comment -

          Hi Nicholas, I am backporting NodeGroup layer topology to branch-2. Given the patch here already fixed several issues, shall we backport to branch-2 first to shorten the gap between branch-2 and trunk? I think we can address the intermittent failure in HDFS-4376. What do you think?

          Show
          Junping Du added a comment - Hi Nicholas, I am backporting NodeGroup layer topology to branch-2. Given the patch here already fixed several issues, shall we backport to branch-2 first to shorten the gap between branch-2 and trunk? I think we can address the intermittent failure in HDFS-4376 . What do you think?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Sure, let's fix the failure in HDFS-4376. Thanks for the update.

          Show
          Tsz Wo Nicholas Sze added a comment - Sure, let's fix the failure in HDFS-4376 . Thanks for the update.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Merged this to branch-2 and also committed the branch-1 patch. Thanks, Junping!

          Show
          Tsz Wo Nicholas Sze added a comment - Merged this to branch-2 and also committed the branch-1 patch. Thanks, Junping!
          Hide
          Junping Du added a comment -

          Thanks Nicholas!

          Show
          Junping Du added a comment - Thanks Nicholas!
          Hide
          Suresh Srinivas added a comment -

          I merged this to branch-1.2 for release 1.2.1

          Show
          Suresh Srinivas added a comment - I merged this to branch-1.2 for release 1.2.1
          Hide
          Junping Du added a comment -

          Ok. Thanks Suresh!

          Show
          Junping Du added a comment - Ok. Thanks Suresh!

            People

            • Assignee:
              Junping Du
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development