Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-8825 Enhancements to Balancer
  3. HDFS-8278

HDFS Balancer should consider remaining storage % when checking for under-utilized machines

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.8.0
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: balancer & mover
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      DFS balancer mistakenly identifies a node with very little storage space remaining as an "underutilized" node and tries to move large amounts of data to that particular node.

      All these block moves fail to execute successfully, as the % utilization is less relevant than the dfs remaining storage on that node.

      15/04/24 04:25:55 INFO balancer.Balancer: 0 over-utilized: []
      15/04/24 04:25:55 INFO balancer.Balancer: 1 underutilized: [172.19.1.46:50010:DISK]
      15/04/24 04:25:55 INFO balancer.Balancer: Need to move 47.68 GB to make the cluster balanced.
      15/04/24 04:25:55 INFO balancer.Balancer: Decided to move 413.08 MB bytes from 172.19.1.52:50010:DISK to 172.19.1.46:50010:DISK
      15/04/24 04:25:55 INFO balancer.Balancer: Will move 413.08 MB in this iteration
      15/04/24 04:25:55 WARN balancer.Dispatcher: Failed to move blk_1078689321_1099517353638 with size=131146 from 172.19.1.52:50010:DISK to 172.19.1.46:50010:DISK through 172.19.1.53:50010: Got error, status message opReplaceBlock BP-942051088-172.18.1.41-1370508013893:blk_1078689321_1099517353638 received exception org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: Out of space: The volume with the most available space (=225042432 B) is less than the block size (=268435456 B)., block move is failed
      

      The machine in concern is under-full when it comes to the BP utilization, but has very little free space available for blocks.

      Decommission Status : Normal
      Configured Capacity: 3826907185152 (3.48 TB)
      DFS Used: 2817262833664 (2.56 TB)
      Non DFS Used: 1000621305856 (931.90 GB)
      DFS Remaining: 9023045632 (8.40 GB)
      DFS Used%: 73.62%
      DFS Remaining%: 0.24%
      Configured Cache Capacity: 8589934592 (8 GB)
      Cache Used: 0 (0 B)
      Cache Remaining: 8589934592 (8 GB)
      Cache Used%: 0.00%
      Cache Remaining%: 100.00%
      Xceivers: 3
      Last contact: Fri Apr 24 04:28:36 PDT 2015
      

      The machine has 0.40 Gb of non-RAM storage available on that node, so it is futile to attempt to move any blocks to that particular machine.

      This is a similar concern when a machine loses disks, since the comparisons of utilization always compare percentages per-node. Even that scenario needs to cap data movement to that node to the "DFS Remaining %" variable.

      Trying to move any more data than that to a given node will always fail.

        Attachments

        1. h8278_20150817.patch
          3 kB
          Tsz Wo Nicholas Sze

          Issue Links

            Activity

              People

              • Assignee:
                szetszwo Tsz Wo Nicholas Sze
                Reporter:
                gopalv Gopal V
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: