Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-11377

Balancer hung due to no available mover threads

    Details

      Description

      When running balancer on large cluster which have more than 3000 Datanodes, it might be hung due to "No mover threads available".
      The stack trace shows it waiting forever like below.

      "main" #1 prio=5 os_prio=0 tid=0x00007ff6cc014800 nid=0x6b2c waiting on condition [0x00007ff6d1bad000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
              at java.lang.Thread.sleep(Native Method)
              at org.apache.hadoop.hdfs.server.balancer.Dispatcher.waitForMoveCompletion(Dispatcher.java:1043)
              at org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchBlockMoves(Dispatcher.java:1017)
              at org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchAndCheckContinue(Dispatcher.java:981)
              at org.apache.hadoop.hdfs.server.balancer.Balancer.runOneIteration(Balancer.java:611)
              at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:663)
              at org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:776)
              at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
              at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:905)
      

      In the log, there are lots of WARN about "No mover threads available".

      2017-01-26 15:36:40,085 WARN org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads available: skip moving blk_13700554102_1112815018180 with size=268435456 from 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through 10.115.67.137:50010
      2017-01-26 15:36:40,085 WARN org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads available: skip moving blk_4009558842_1103118359883 with size=268435456 from 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through 10.115.67.137:50010
      2017-01-26 15:36:40,085 WARN org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads available: skip moving blk_13881956058_1112996460026 with size=133509566 from 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through 10.115.67.36:50010

      What happened here is, when there are no mover threads available, DDatanode.isPendingQEmpty() will return false, so Balancer hung.

      1. HDFS-11377.001.patch
        0.8 kB
        yunjiong zhao
      2. HDFS-11377.002.patch
        1 kB
        yunjiong zhao

        Issue Links

          Activity

          Hide
          zhaoyunjiong yunjiong zhao added a comment -

          Remove PendingMove if after "No mover threads available" in this patch.

          By setting dfs.balancer.moverThreads to a value big than dfs.datanode.balance.max.concurrent.moves * <numberOfDatanodes> also works.

          Show
          zhaoyunjiong yunjiong zhao added a comment - Remove PendingMove if after "No mover threads available" in this patch. By setting dfs.balancer.moverThreads to a value big than dfs.datanode.balance.max.concurrent.moves * <numberOfDatanodes> also works.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 13s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 13m 33s trunk passed
          +1 compile 0m 51s trunk passed
          +1 checkstyle 0m 28s trunk passed
          +1 mvnsite 1m 5s trunk passed
          +1 mvneclipse 0m 15s trunk passed
          +1 findbugs 2m 0s trunk passed
          +1 javadoc 0m 42s trunk passed
          +1 mvninstall 0m 58s the patch passed
          +1 compile 0m 52s the patch passed
          +1 javac 0m 52s the patch passed
          +1 checkstyle 0m 31s the patch passed
          +1 mvnsite 1m 1s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 57s the patch passed
          +1 javadoc 0m 36s the patch passed
          -1 unit 79m 40s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 27s The patch does not generate ASF License warnings.
          106m 41s



          Reason Tests
          Failed junit tests hadoop.hdfs.TestAclsEndToEnd
            hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure
            hadoop.hdfs.server.namenode.TestCacheDirectives



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HDFS-11377
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12849622/HDFS-11377.001.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux fcf3ffa082ad 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 7bc333a
          Default Java 1.8.0_121
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/18280/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18280/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18280/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 13m 33s trunk passed +1 compile 0m 51s trunk passed +1 checkstyle 0m 28s trunk passed +1 mvnsite 1m 5s trunk passed +1 mvneclipse 0m 15s trunk passed +1 findbugs 2m 0s trunk passed +1 javadoc 0m 42s trunk passed +1 mvninstall 0m 58s the patch passed +1 compile 0m 52s the patch passed +1 javac 0m 52s the patch passed +1 checkstyle 0m 31s the patch passed +1 mvnsite 1m 1s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 57s the patch passed +1 javadoc 0m 36s the patch passed -1 unit 79m 40s hadoop-hdfs in the patch failed. +1 asflicense 0m 27s The patch does not generate ASF License warnings. 106m 41s Reason Tests Failed junit tests hadoop.hdfs.TestAclsEndToEnd   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure   hadoop.hdfs.server.namenode.TestCacheDirectives Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HDFS-11377 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12849622/HDFS-11377.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux fcf3ffa082ad 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 7bc333a Default Java 1.8.0_121 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/18280/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18280/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18280/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          manojg Manoj Govindassamy added a comment - - edited

          Thanks yunjiong zhao for reporting this and for the patch.

          What happened here is, when there are no mover threads available, DDatanode.isPendingQEmpty() will return false, so Balancer hung.

               if (moveExecutor == null) {
                 LOG.warn("No mover threads available: skip moving " + p);
          +      targetDn.removePendingBlock(p);
          +      p.proxySource.removePendingBlock(p);
                 return;
               }
          

          By removing pendingBlocks() on threads not available will let the Dispatcher.waitForMoveCompletion() unblock and return false. But, the caller Dispatcher.dispatchBlockMoves() is not handling the failure return code from waitForMoveCompletion(). Is that safe ?

          Show
          manojg Manoj Govindassamy added a comment - - edited Thanks yunjiong zhao for reporting this and for the patch. What happened here is, when there are no mover threads available, DDatanode.isPendingQEmpty() will return false, so Balancer hung. if (moveExecutor == null) { LOG.warn("No mover threads available: skip moving " + p); + targetDn.removePendingBlock(p); + p.proxySource.removePendingBlock(p); return; } By removing pendingBlocks() on threads not available will let the Dispatcher.waitForMoveCompletion() unblock and return false. But, the caller Dispatcher.dispatchBlockMoves() is not handling the failure return code from waitForMoveCompletion(). Is that safe ?
          Hide
          zhaoyunjiong yunjiong zhao added a comment -

          Thanks Manoj Govindassamy review this issue.
          For

          the caller Dispatcher.dispatchBlockMoves() is not handling the failure return code from waitForMoveCompletion()

          , I think it is safe. Because few datanodes failure during balancer should be OK.

          Show
          zhaoyunjiong yunjiong zhao added a comment - Thanks Manoj Govindassamy review this issue. For the caller Dispatcher.dispatchBlockMoves() is not handling the failure return code from waitForMoveCompletion() , I think it is safe. Because few datanodes failure during balancer should be OK.
          Hide
          manojg Manoj Govindassamy added a comment -

          Thanks yunjiong zhao. In Dispatcher#dispatchBlockMoves, even though the return failure code for waitForMoveCompletion is not handled, the method returns total bytes moved and no bytes moved case is handled by the caller. LGTM. +1 (unbinding).

          Show
          manojg Manoj Govindassamy added a comment - Thanks yunjiong zhao . In Dispatcher#dispatchBlockMoves , even though the return failure code for waitForMoveCompletion is not handled, the method returns total bytes moved and no bytes moved case is handled by the caller. LGTM. +1 (unbinding).
          Hide
          linyiqun Yiqun Lin added a comment -

          Thanks yunjiong zhao for reporting this and thanks Manoj Govindassamy for the review. It's a nice finding. The patch also looks good to me. One minor comment, can you also remove the unused variable MAX_NO_PENDING_MOVE_ITERATIONS in Dispatcher? This hardcoded value has been replaced by option -idleiterations. +1 once addressed.

          Show
          linyiqun Yiqun Lin added a comment - Thanks yunjiong zhao for reporting this and thanks Manoj Govindassamy for the review. It's a nice finding. The patch also looks good to me. One minor comment, can you also remove the unused variable MAX_NO_PENDING_MOVE_ITERATIONS in Dispatcher ? This hardcoded value has been replaced by option -idleiterations . +1 once addressed.
          Hide
          zhaoyunjiong yunjiong zhao added a comment - - edited

          Removed unused variable MAX_NO_PENDING_MOVE_ITERATIONS.
          Thanks Yiqun Lin for your time.

          Show
          zhaoyunjiong yunjiong zhao added a comment - - edited Removed unused variable MAX_NO_PENDING_MOVE_ITERATIONS. Thanks Yiqun Lin for your time.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 18s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 13m 31s trunk passed
          +1 compile 0m 51s trunk passed
          +1 checkstyle 0m 28s trunk passed
          +1 mvnsite 0m 57s trunk passed
          +1 mvneclipse 0m 12s trunk passed
          +1 findbugs 1m 49s trunk passed
          +1 javadoc 0m 41s trunk passed
          +1 mvninstall 0m 51s the patch passed
          +1 compile 0m 46s the patch passed
          +1 javac 0m 46s the patch passed
          +1 checkstyle 0m 28s the patch passed
          +1 mvnsite 0m 58s the patch passed
          +1 mvneclipse 0m 12s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 58s the patch passed
          +1 javadoc 0m 41s the patch passed
          -1 unit 112m 25s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 25s The patch does not generate ASF License warnings.
          138m 56s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency
            hadoop.hdfs.TestDFSClientExcludedNodes
            hadoop.hdfs.server.namenode.ha.TestHAAppend
          Timed out junit tests org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting
            org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HDFS-11377
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12850459/HDFS-11377.002.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux b53c73150f1c 3.13.0-96-generic #143-Ubuntu SMP Mon Aug 29 20:15:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 59c5f18
          Default Java 1.8.0_121
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/18307/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18307/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18307/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 18s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 13m 31s trunk passed +1 compile 0m 51s trunk passed +1 checkstyle 0m 28s trunk passed +1 mvnsite 0m 57s trunk passed +1 mvneclipse 0m 12s trunk passed +1 findbugs 1m 49s trunk passed +1 javadoc 0m 41s trunk passed +1 mvninstall 0m 51s the patch passed +1 compile 0m 46s the patch passed +1 javac 0m 46s the patch passed +1 checkstyle 0m 28s the patch passed +1 mvnsite 0m 58s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 58s the patch passed +1 javadoc 0m 41s the patch passed -1 unit 112m 25s hadoop-hdfs in the patch failed. +1 asflicense 0m 25s The patch does not generate ASF License warnings. 138m 56s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency   hadoop.hdfs.TestDFSClientExcludedNodes   hadoop.hdfs.server.namenode.ha.TestHAAppend Timed out junit tests org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting   org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HDFS-11377 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12850459/HDFS-11377.002.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux b53c73150f1c 3.13.0-96-generic #143-Ubuntu SMP Mon Aug 29 20:15:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 59c5f18 Default Java 1.8.0_121 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/18307/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18307/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18307/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          linyiqun Yiqun Lin added a comment -

          LGTM, +1. Thanks yunjiong zhao for updating the patch. Will commit on weekend in case someone has other comment for this.

          Show
          linyiqun Yiqun Lin added a comment - LGTM, +1. Thanks yunjiong zhao for updating the patch. Will commit on weekend in case someone has other comment for this.
          Hide
          linyiqun Yiqun Lin added a comment -

          The remove operation should be safe since the method removePendingBlock has using synchronized. The failed test is not related. Committed to trunk and branch-2. Thanks yunjiong zhao for the contribution and thanks Manoj Govindassamy for the review!

          Show
          linyiqun Yiqun Lin added a comment - The remove operation should be safe since the method removePendingBlock has using synchronized . The failed test is not related. Committed to trunk and branch-2. Thanks yunjiong zhao for the contribution and thanks Manoj Govindassamy for the review!
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11213 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11213/)
          HDFS-11377. Balancer hung due to no available mover threads. Contributed (yqlin: rev 9cbbd1eae893b21212c9bc9e6745c6859317a667)

          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Dispatcher.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11213 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11213/ ) HDFS-11377 . Balancer hung due to no available mover threads. Contributed (yqlin: rev 9cbbd1eae893b21212c9bc9e6745c6859317a667) (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Dispatcher.java
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          The relevant code was introduced by HDFS-8818, which was included in 2.7.x, 2.8.x, 2.9.0 and 3.0.0
          Shouldn't this be backported into 2.7.x and 2.8.x?

          Show
          jojochuang Wei-Chiu Chuang added a comment - The relevant code was introduced by HDFS-8818 , which was included in 2.7.x, 2.8.x, 2.9.0 and 3.0.0 Shouldn't this be backported into 2.7.x and 2.8.x?
          Hide
          shv Konstantin Shvachko added a comment -

          Merged this into branch-2.8 and branch-2.7. Changing fix version.

          Show
          shv Konstantin Shvachko added a comment - Merged this into branch-2.8 and branch-2.7. Changing fix version.
          Hide
          brahmareddy Brahma Reddy Battula added a comment -

          Konstantin Shvachko can you please update the CHANGES.txt in branch-2.7?

          Show
          brahmareddy Brahma Reddy Battula added a comment - Konstantin Shvachko can you please update the CHANGES.txt in branch-2.7 ?

            People

            • Assignee:
              zhaoyunjiong yunjiong zhao
              Reporter:
              zhaoyunjiong yunjiong zhao
            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development