HBase
  1. HBase
  2. HBASE-4191

hbase load balancer needs locality awareness

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Balancer
    • Labels:
      None

      Description

      Previously, HBASE-4114 implements the metrics for HFile HDFS block locality, which provides the HFile level locality information.
      But in order to work with load balancer and region assignment, we need the region level locality information.

      Let's define the region locality information first, which is almost the same as HFile locality index.

      HRegion locality index (HRegion A, RegionServer B) =
      (Total number of HDFS blocks that can be retrieved locally by the RegionServer B for the HRegion A) / ( Total number of the HDFS blocks for the Region A)
      So the HRegion locality index tells us that how much locality we can get if the HMaster assign the HRegion A to the RegionServer B.

      So there will be 2 steps involved to assign regions based on the locality.
      1) During the cluster start up time, the master will scan the hdfs to calculate the "HRegion locality index" for each pair of HRegion and Region Server. It is pretty expensive to scan the dfs. So we only needs to do this once during the start up time.

      2) During the cluster run time, each region server will update the "HRegion locality index" as metrics periodically as HBASE-4114 did. The Region Server can expose them to the Master through ZK, meta table, or just RPC messages.

      Based on the "HRegion locality index", the assignment manager in the master would have a global knowledge about the region locality distribution and can run the MIN COST MAXIMUM FLOW solver to reach the global optimization.

      Let's construct the graph first:
      [Graph]
      Imaging there is a bipartite graph and the left side is the set of regions and the right side is the set of region servers.
      There is a source node which links itself to each node in the region set.
      There is a sink node which is linked from each node in the region server set.

      [Capacity]
      The capacity between the source node and region nodes is 1.
      And the capacity between the region nodes and region server nodes is also 1.
      (The purpose is each region can ONLY be assigned to one region server at one time)

      The capacity between the region server nodes and sink node are the avg number of regions which should be assigned each region server.
      (The purpose is balance the load for each region server)

      [Cost]
      The cost between each region and region server is the opposite of locality index, which means the higher locality is, if region A is assigned to region server B, the lower cost it is.
      The cost function could be more sophisticated when we put more metrics into account.

      So after running the min-cost max flow solver, the master could assign the regions based on the global locality optimization.

      Also the master should share this global view to secondary master in case the master fail over happens.
      In addition, the HBASE-4491 (Locality Checker) is the tool, which is based on the same metrics, to proactively to scan dfs to calculate the global locality information in the cluster. It will help us to verify data locality information during the run time.

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          Liyin Tang added a comment -

          Hi Ted, do you have started working on this.
          I have a similar feature to do

          Show
          Liyin Tang added a comment - Hi Ted, do you have started working on this. I have a similar feature to do
          Hide
          Ted Yu added a comment -

          @Liyin:
          Go ahead with your implementation.

          Show
          Ted Yu added a comment - @Liyin: Go ahead with your implementation.
          Hide
          Karthik Ranganathan added a comment -

          I think we should also go one step further and store the preferred assignments in a column-family in META. That way, upon restarts the locality is automatically preserved.

          Show
          Karthik Ranganathan added a comment - I think we should also go one step further and store the preferred assignments in a column-family in META. That way, upon restarts the locality is automatically preserved.
          Hide
          Ted Yu added a comment -

          Currently assignAllUserRegions() reads "hbase.master.startup.retainassign" and if the value is true, balancer.retainAssignment() is called.
          Looks like new config parameter should be introduce, e.g. "hbase.master.startup.policy", to accommodate the above and future enhancements.

          Show
          Ted Yu added a comment - Currently assignAllUserRegions() reads "hbase.master.startup.retainassign" and if the value is true, balancer.retainAssignment() is called. Looks like new config parameter should be introduce, e.g. "hbase.master.startup.policy", to accommodate the above and future enhancements.
          Hide
          Liyin Tang added a comment -

          I have updated the description of the task for open discussion.
          Any comments are so welcome

          Show
          Liyin Tang added a comment - I have updated the description of the task for open discussion. Any comments are so welcome
          Hide
          Ted Yu added a comment -

          I am not sure about the scanning in proposal #1. Master startup should be quick.

          The MAXIMUM FLOW solver is a good fit for HBase Locality Checker. For dynamic load balancing, we may not want to reassign large number of regions which would be disruptive to scanner performance.

          I think HBASE-4491 can be implemented first. Its output can be readily used by master startup.

          Show
          Ted Yu added a comment - I am not sure about the scanning in proposal #1. Master startup should be quick. The MAXIMUM FLOW solver is a good fit for HBase Locality Checker. For dynamic load balancing, we may not want to reassign large number of regions which would be disruptive to scanner performance. I think HBASE-4491 can be implemented first. Its output can be readily used by master startup.
          Hide
          Mikhail Bautin added a comment -

          @Ted: could you please elaborate on how you express the region assignment problem as a Max Flow problem? If we define the "cost" of assigning a region to a server based on locality, and define a constraint of "load balancedness" to be such that each regionserver is assigned no more than approximately ceil(numRegions / numServers) + C regions for some small value of C, then I can see how the problem becomes a min-cost max flow (http://en.wikipedia.org/wiki/Minimum_cost_flow_problem). However, I don't see how we could reduce the assignment problem to the max-flow problem directly (http://en.wikipedia.org/wiki/Maximum_flow_problem).

          Show
          Mikhail Bautin added a comment - @Ted: could you please elaborate on how you express the region assignment problem as a Max Flow problem? If we define the "cost" of assigning a region to a server based on locality, and define a constraint of "load balancedness" to be such that each regionserver is assigned no more than approximately ceil(numRegions / numServers) + C regions for some small value of C, then I can see how the problem becomes a min-cost max flow ( http://en.wikipedia.org/wiki/Minimum_cost_flow_problem ). However, I don't see how we could reduce the assignment problem to the max-flow problem directly ( http://en.wikipedia.org/wiki/Maximum_flow_problem ).
          Hide
          Ted Yu added a comment -

          Liyin modified JIRA description and put down max-flow problem.
          Original description didn't mention it.

          Show
          Ted Yu added a comment - Liyin modified JIRA description and put down max-flow problem. Original description didn't mention it.
          Hide
          Liyin Tang added a comment -

          As Mikhail corrects, the problem should be min cost flow problem instead of max flow problem. I have updated the description to provides more details.

          Show
          Liyin Tang added a comment - As Mikhail corrects, the problem should be min cost flow problem instead of max flow problem. I have updated the description to provides more details.
          Hide
          Karthik Ranganathan added a comment -

          A couple of initial thoughts on things we would need to consider:

          1. I think there should be a weight for node-locality, rack-locality and cross-rack reads while computing the flow.
          2. Also, I think we need one more constraint - we want the final state to have roughly the same number of regions per regionserver (within the slop). So we would need a decision tree or some such.

          Show
          Karthik Ranganathan added a comment - A couple of initial thoughts on things we would need to consider: 1. I think there should be a weight for node-locality, rack-locality and cross-rack reads while computing the flow. 2. Also, I think we need one more constraint - we want the final state to have roughly the same number of regions per regionserver (within the slop). So we would need a decision tree or some such.
          Hide
          Ted Yu added a comment -

          @Liyin:
          Thanks for formulating the requirement.
          I think the description should be modified after consensus is reached. In the meantime, feel free to make comments in this JIRA.
          I think the following goals may not be achieved at the same time:
          1. maximum (node/rack) locality
          2. the same number of regions on each live region server
          As Karthik said, slop is an important factor in decision making.

          Show
          Ted Yu added a comment - @Liyin: Thanks for formulating the requirement. I think the description should be modified after consensus is reached. In the meantime, feel free to make comments in this JIRA. I think the following goals may not be achieved at the same time: 1. maximum (node/rack) locality 2. the same number of regions on each live region server As Karthik said, slop is an important factor in decision making.
          Hide
          gaojinchao added a comment -

          @Liyin
          This is a good feature, How do you process now?

          Show
          gaojinchao added a comment - @Liyin This is a good feature, How do you process now?
          Hide
          Elliott Clark added a comment -

          It seems like the stochastic load balancer gives HBase the locality awareness when balancing.

          Show
          Elliott Clark added a comment - It seems like the stochastic load balancer gives HBase the locality awareness when balancing.
          Hide
          stack added a comment -

          What you think of Liyin's costing vs what you have in the Stochastic balancer Elliott (Do you think the HRegion#computeHDFSBlocksDistribution call will happen often? Seems like its value is cached for a period of time).

          Show
          stack added a comment - What you think of Liyin's costing vs what you have in the Stochastic balancer Elliott (Do you think the HRegion#computeHDFSBlocksDistribution call will happen often? Seems like its value is cached for a period of time).

            People

            • Assignee:
              Liyin Tang
              Reporter:
              Ted Yu
            • Votes:
              1 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

              • Created:
                Updated:

                Development