Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-11377

Balancer hung due to no available mover threads

    XMLWordPrintableJSON

Details

    Description

      When running balancer on large cluster which have more than 3000 Datanodes, it might be hung due to "No mover threads available".
      The stack trace shows it waiting forever like below.

      "main" #1 prio=5 os_prio=0 tid=0x00007ff6cc014800 nid=0x6b2c waiting on condition [0x00007ff6d1bad000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
              at java.lang.Thread.sleep(Native Method)
              at org.apache.hadoop.hdfs.server.balancer.Dispatcher.waitForMoveCompletion(Dispatcher.java:1043)
              at org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchBlockMoves(Dispatcher.java:1017)
              at org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchAndCheckContinue(Dispatcher.java:981)
              at org.apache.hadoop.hdfs.server.balancer.Balancer.runOneIteration(Balancer.java:611)
              at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:663)
              at org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:776)
              at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
              at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:905)
      

      In the log, there are lots of WARN about "No mover threads available".

      2017-01-26 15:36:40,085 WARN org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads available: skip moving blk_13700554102_1112815018180 with size=268435456 from 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through 10.115.67.137:50010
      2017-01-26 15:36:40,085 WARN org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads available: skip moving blk_4009558842_1103118359883 with size=268435456 from 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through 10.115.67.137:50010
      2017-01-26 15:36:40,085 WARN org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads available: skip moving blk_13881956058_1112996460026 with size=133509566 from 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through 10.115.67.36:50010

      What happened here is, when there are no mover threads available, DDatanode.isPendingQEmpty() will return false, so Balancer hung.

      Attachments

        1. HDFS-11377.002.patch
          1 kB
          yunjiong zhao
        2. HDFS-11377.001.patch
          0.8 kB
          yunjiong zhao

        Issue Links

          Activity

            People

              zhaoyunjiong yunjiong zhao
              zhaoyunjiong yunjiong zhao
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: