Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-14309

Allow load balancer to operate when there is region in transition by adding force flag

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.3.0, 2.0.0
    • None
    • None
    • Reviewed
    • Hide
      This issue adds boolean parameter, force, to 'balancer' command so that admin can force region balancing even when there is region (other than hbase:meta) in transition - assuming RIT being transient.
      If hbase:meta is in transition, balancer command returns false.

      WARNING: For experts only. Forcing a balance may do more damage than repair when assignment is confused
      Note: enclose the force parameter in double quotes
      Show
      This issue adds boolean parameter, force, to 'balancer' command so that admin can force region balancing even when there is region (other than hbase:meta) in transition - assuming RIT being transient. If hbase:meta is in transition, balancer command returns false. WARNING: For experts only. Forcing a balance may do more damage than repair when assignment is confused Note: enclose the force parameter in double quotes

    Description

      This issue adds boolean parameter, force, to 'balancer' command so that admin can force region balancing even when there is region in transition - assuming RIT being transient.

      This enhancement was requested by some customer.

      The assumption of this change is that the operator has run hbck and has a reasonable idea why regions are stuck in transition before using the force flag.

      There was a recent event at the customer where a cluster ended up with a small number of regionservers hosting most of the regions on the cluster (one regionserver had 50% of the roughly 20,000 regions). The balancer couldn't be run due to the small number of regions that were stuck in transition. The admin ended up killing the regionservers so that reassignment would yield a more equitable distribution of the regions.

      On a different cluster, there was a single store file that had corrupt HDFS blocks (the SSDs on the cluster were known to lose data). However, since this single region (out of 10s of 1000s of regions on this cluster) was stuck in transition, the balancer couldn't run.
      While the state keeping in HBase isn't so good yet that the admin can kick off the balancer automatically in such scenarios knowing when it is safe to do so and when it is not, having this option available for the operator to use as he / she sees fit seems prudent.

      Attachments

        1. 14309-branch-1.1.txt
          36 kB
          Ted Yu
        2. 14309-v1.txt
          41 kB
          Ted Yu
        3. 14309-v2.txt
          40 kB
          Ted Yu
        4. 14309-v3.txt
          40 kB
          Ted Yu
        5. 14309-v4.txt
          40 kB
          Ted Yu
        6. 14309-v5.txt
          40 kB
          Ted Yu
        7. 14309-v5.txt
          40 kB
          Ted Yu
        8. 14309-v5-branch-1.txt
          39 kB
          Ted Yu
        9. 14309-v6.txt
          41 kB
          Ted Yu
        10. 14309-v7.txt
          41 kB
          Ted Yu
        11. 14309-v7-branch-1.txt
          41 kB
          Ted Yu

        Activity

          People

            yuzhihong@gmail.com Ted Yu
            yuzhihong@gmail.com Ted Yu
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: