Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-6506

Newly moved block replica been invalidated and deleted in TestBalancer

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.6.0
    • Component/s: balancer & mover, test
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      TestBalancerWithNodeGroup#testBalancerWithNodeGroup fails recently
      https://builds.apache.org/job/PreCommit-HDFS-Build/7045//testReport/
      from the error log, the reason seems to be that newly moved block replicas been invalidated and deleted, so some work of the balancer are reversed.

      2014-06-06 18:15:51,681 INFO  balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741834_1010 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159
      2014-06-06 18:15:51,683 INFO  balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741833_1009 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159
      2014-06-06 18:15:51,683 INFO  balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741830_1006 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159
      2014-06-06 18:15:51,683 INFO  balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741831_1007 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159
      2014-06-06 18:15:51,682 INFO  balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741832_1008 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159
      2014-06-06 18:15:54,702 INFO  balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741827_1003 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159
      2014-06-06 18:15:54,702 INFO  balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741828_1004 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159
      2014-06-06 18:15:54,701 INFO  balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741829_1005 with size=100 fr
      2014-06-06 18:15:54,706 INFO  BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741833_1009) is added to invalidated blocks set
      2014-06-06 18:15:54,709 INFO  BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741834_1010) is added to invalidated blocks set
      2014-06-06 18:15:56,421 INFO  BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741833_1009, blk_1073741834_1010]
      2014-06-06 18:15:57,717 INFO  BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741832_1008) is added to invalidated blocks set
      2014-06-06 18:15:57,720 INFO  BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741827_1003) is added to invalidated blocks set
      2014-06-06 18:15:57,721 INFO  BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741830_1006) is added to invalidated blocks set
      2014-06-06 18:15:57,722 INFO  BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741831_1007) is added to invalidated blocks set
      2014-06-06 18:15:57,723 INFO  BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741829_1005) is added to invalidated blocks set
      2014-06-06 18:15:59,422 INFO  BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741827_1003, blk_1073741829_1005, blk_1073741830_1006, blk_1073741831_1007, blk_1073741832_1008]
      2014-06-06 18:16:02,423 INFO  BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741845_1021]
      

      Normally this should not happen, when moving a block from src to dest, replica on src should be invalided not the dest, there should be bug inside related logic.
      I don't think TestBalancerWithNodeGroup#testBalancerWithNodeGroup caused this.

      1. HDFS-6506.v1.patch
        2 kB
        Binglin Chang
      2. HDFS-6506.v2.patch
        3 kB
        Binglin Chang
      3. HDFS-6506.v3.patch
        3 kB
        Binglin Chang

        Issue Links

          Activity

          Hide
          Binglin Chang added a comment -

          Thanks for the review Chris and Junping.

          Show
          Binglin Chang added a comment - Thanks for the review Chris and Junping.
          Hide
          Chris Nauroth added a comment -

          I committed this to trunk and branch-2. Binglin, thank you for contributing the patch. Junping, thank you for doing code review.

          Show
          Chris Nauroth added a comment - I committed this to trunk and branch-2. Binglin, thank you for contributing the patch. Junping, thank you for doing code review.
          Hide
          Chris Nauroth added a comment -

          +1 for the patch. Junping Du, do you have any other feedback? I plan to commit this on Tuesday, 9/9, unless I hear otherwise.

          Show
          Chris Nauroth added a comment - +1 for the patch. Junping Du , do you have any other feedback? I plan to commit this on Tuesday, 9/9, unless I hear otherwise.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12667086/HDFS-6506.v3.patch
          against trunk revision a23144f.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract
          org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7940//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7940//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667086/HDFS-6506.v3.patch against trunk revision a23144f. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContract org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7940//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7940//console This message is automatically generated.
          Hide
          Binglin Chang added a comment -

          Rebase patch to lastest trunk

          Show
          Binglin Chang added a comment - Rebase patch to lastest trunk
          Hide
          Chris Nauroth added a comment -

          Unfortunately, it appears this patch has gone stale. Binglin Chang, would you mind updating the patch? Junping Du, would you mind +1'ing a new patch quickly if you don't have any other feedback? I'm happy to take care of the commit if you're busy. It would be nice to get this in and hopefully put an end to the spurious failures in the balancer tests. Thanks!

          Show
          Chris Nauroth added a comment - Unfortunately, it appears this patch has gone stale. Binglin Chang , would you mind updating the patch? Junping Du , would you mind +1'ing a new patch quickly if you don't have any other feedback? I'm happy to take care of the commit if you're busy. It would be nice to get this in and hopefully put an end to the spurious failures in the balancer tests. Thanks!
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12651956/HDFS-6506.v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7577//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7577//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12651956/HDFS-6506.v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7577//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7577//console This message is automatically generated.
          Hide
          Junping Du added a comment -

          Sorry for late response. Patch looks good to me in overall. Kick off Jenkins's test again as patch doesn't sync for long time.

          Show
          Junping Du added a comment - Sorry for late response. Patch looks good to me in overall. Kick off Jenkins's test again as patch doesn't sync for long time.
          Hide
          Junping Du added a comment -

          Sure. I will be around to review soon. Thanks, Binglin!

          Show
          Junping Du added a comment - Sure. I will be around to review soon. Thanks, Binglin!
          Hide
          Binglin Chang added a comment -

          Hi Junping Du, this bug is related to TestBalancerWithNodeGroup, could you help review this? Thanks

          Show
          Binglin Chang added a comment - Hi Junping Du , this bug is related to TestBalancerWithNodeGroup, could you help review this? Thanks
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12651956/HDFS-6506.v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.TestRefreshCallQueue

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7210//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7210//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12651956/HDFS-6506.v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.TestRefreshCallQueue +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7210//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7210//console This message is automatically generated.
          Hide
          Binglin Chang added a comment -

          Update patch to add fix of bug in HDFS-6586, TestBalancer is affected by balancer.id file.

          Show
          Binglin Chang added a comment - Update patch to add fix of bug in HDFS-6586 , TestBalancer is affected by balancer.id file.
          Hide
          Binglin Chang added a comment -

          The failed test is not related and is tracked in HDFS-3930, actually recent build also failed because of this.
          https://builds.apache.org/job/Hadoop-Hdfs-trunk/1770/consoleText

          Show
          Binglin Chang added a comment - The failed test is not related and is tracked in HDFS-3930 , actually recent build also failed because of this. https://builds.apache.org/job/Hadoop-Hdfs-trunk/1770/consoleText
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12649548/HDFS-6506.v1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.datanode.TestBPOfferService

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7072//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7072//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12649548/HDFS-6506.v1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.datanode.TestBPOfferService +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7072//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7072//console This message is automatically generated.
          Hide
          Binglin Chang added a comment -

          Balancer already sleep 2*DFS_HEARTBEAT_INTERVAL seconds between rounds, but in TestBalancer.java:

              conf.setLong(DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY, 1L);
          

          replica state update speed is related to DFS_NAMENODE_REPLICATION_INTERVAL too, which is 3 by default.
          TestBalancer only change heartbeat interval(which changes heartbeat interval and balancer iteration sleep time), but doesn't change ReplicationMonitor check interval, so the sleep time is too small to wait for movements getting committed.
          The other thing is 2*DFS_HEARTBEAT_INTERVAL still seems a little dangerous. maybe change it to 2*DFS_HEARTBEAT_INTERVAL + DFS_NAMENODE_REPLICATION_INTERVAL

          Show
          Binglin Chang added a comment - Balancer already sleep 2*DFS_HEARTBEAT_INTERVAL seconds between rounds, but in TestBalancer.java: conf.setLong(DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY, 1L); replica state update speed is related to DFS_NAMENODE_REPLICATION_INTERVAL too, which is 3 by default. TestBalancer only change heartbeat interval(which changes heartbeat interval and balancer iteration sleep time), but doesn't change ReplicationMonitor check interval, so the sleep time is too small to wait for movements getting committed. The other thing is 2*DFS_HEARTBEAT_INTERVAL still seems a little dangerous. maybe change it to 2*DFS_HEARTBEAT_INTERVAL + DFS_NAMENODE_REPLICATION_INTERVAL
          Hide
          Binglin Chang added a comment -

          Look at the log and code more throughly. The reason some block replica is invalidated is:
          1. balancer round 1: move blk0 from dn0 to dn1, at this time block map haven't updated yet(so dn0 still have blk0)
          2. balancer round 2 starts, and try to move blk0 from dn0 to dn2
          3. dn2 copy data from dn0
          4. dn0 heartbeat and get cmd to delete blk0
          5. try to move blk0 from dn0 to dn2 , it canot find dn0, but it has to delete a replica, so it delete dn1

          To prevent this, balancer need to wait some time to make sure the block movements in last round is fully committed, otherwise the movements in last round may be invalided.

          Show
          Binglin Chang added a comment - Look at the log and code more throughly. The reason some block replica is invalidated is: 1. balancer round 1: move blk0 from dn0 to dn1, at this time block map haven't updated yet(so dn0 still have blk0) 2. balancer round 2 starts, and try to move blk0 from dn0 to dn2 3. dn2 copy data from dn0 4. dn0 heartbeat and get cmd to delete blk0 5. try to move blk0 from dn0 to dn2 , it canot find dn0, but it has to delete a replica, so it delete dn1 To prevent this, balancer need to wait some time to make sure the block movements in last round is fully committed, otherwise the movements in last round may be invalided.

            People

            • Assignee:
              Binglin Chang
              Reporter:
              Binglin Chang
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development