Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-3609

Improve the selection of regions to balance; part 2

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.92.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Balancer improvements over the previous purely random assignment balancer; distributes evenly from oldest and newest regions and does round robin loading a newly added server.

      Description

      See 'HBASE-3586 Improve the selection of regions to balance' for discussion of algorithms that improve on current random assignment.

      1. 3609-double-alternation.txt
        17 kB
        Ted Yu
      2. 3609-empty-RS.txt
        14 kB
        Ted Yu
      3. hbase-3609-by-region-age.txt
        5 kB
        Ted Yu
      4. hbase-3609.txt
        3 kB
        Ted Yu

        Issue Links

          Activity

          Hide
          lars_francke Lars Francke added a comment -

          This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).

          Show
          lars_francke Lars Francke added a comment - This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).
          Hide
          stack stack added a comment -

          Committed. Thanks for the patch Ted Yu

          Show
          stack stack added a comment - Committed. Thanks for the patch Ted Yu
          Hide
          stack stack added a comment -

          Marking patch available

          Show
          stack stack added a comment - Marking patch available
          Hide
          stack stack added a comment -

          I am +1 on committing. The doc is good. Will wait to commit for a day or so in case others have opinion on this patch. If you get feedback from others on how this balance change works for them, include it in the issue Ted. Good stuff.

          Show
          stack stack added a comment - I am +1 on committing. The doc is good. Will wait to commit for a day or so in case others have opinion on this patch. If you get feedback from others on how this balance change works for them, include it in the issue Ted. Good stuff.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Added detailed javadoc for balanceCluster() describing what this JIRA has done.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Added detailed javadoc for balanceCluster() describing what this JIRA has done.
          Hide
          stack stack added a comment -

          @Ted I'm game for committing this to branch if you add javadoc explaining your new balance algorithm; the rules you've implemented. Boil down your blog and use that. I'm uneasy about the lack of unit test but if its too hard to do, I think its OK getting the balancer changes in now into TRUNK before we start in trying to harden 0.92.

          Good work.

          Show
          stack stack added a comment - @Ted I'm game for committing this to branch if you add javadoc explaining your new balance algorithm; the rules you've implemented. Boil down your blog and use that. I'm uneasy about the lack of unit test but if its too hard to do, I think its OK getting the balancer changes in now into TRUNK before we start in trying to harden 0.92. Good work.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -
          Show
          yuzhihong@gmail.com Ted Yu added a comment - A brief summary can be found here: http://zhihongyu.blogspot.com/2011/04/load-balancer-in-hbase-090.html
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Made changes to utilize the randomizer which shuffles the list of underloaded region servers.
          This way we can avoid distributing offloaded regions to few region servers.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Made changes to utilize the randomizer which shuffles the list of underloaded region servers. This way we can avoid distributing offloaded regions to few region servers.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          From Stan Barton who helps me experiment with my changes:

          There is no easy
          way to check how many regions are assigned to particular RS, so will
          probably need to write some small parser to prove that.

          I think we should backport HBASE-3704 (at least Regions by Region Server) to 0.90.3 so that people can easily tell how (un)even the load is distributed.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - From Stan Barton who helps me experiment with my changes: There is no easy way to check how many regions are assigned to particular RS, so will probably need to write some small parser to prove that. I think we should backport HBASE-3704 (at least Regions by Region Server) to 0.90.3 so that people can easily tell how (un)even the load is distributed.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Removed the changes in Master.
          The boolean flag can be computed in the first loop of balanceCluster()

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Removed the changes in Master. The boolean flag can be computed in the first loop of balanceCluster()
          Hide
          stack stack added a comment -

          Empty server can be detected within balanceCluster(). But this detection has been performed by Master, hence the flag.

          Thats fair. Its nice having the balance invocation method simple as possible though.

          The static regionId helps make each region Id unique. I actually utilized this fact to debug my code.

          I missed that it was being used.

          Preliminary response from Stan Barton showed improvement over random selector.

          I am waiting for further feedback from gaojinchao@huawei.com and Stan.

          I think that if you get good feedback from others, that'll help getting this patch committed.

          Good stuff Ted.

          Show
          stack stack added a comment - Empty server can be detected within balanceCluster(). But this detection has been performed by Master, hence the flag. Thats fair. Its nice having the balance invocation method simple as possible though. The static regionId helps make each region Id unique. I actually utilized this fact to debug my code. I missed that it was being used. Preliminary response from Stan Barton showed improvement over random selector. I am waiting for further feedback from gaojinchao@huawei.com and Stan. I think that if you get good feedback from others, that'll help getting this patch committed. Good stuff Ted.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Empty server can be detected within balanceCluster(). But this detection has been performed by Master, hence the flag.

          The static regionId helps make each region Id unique. I actually utilized this fact to debug my code.

          Let me think more about writing unit test(s) that verifies the distribution over underloaded servers.

          Both TestLoadBalancer and TestAdmin pass.

          Preliminary response from Stan Barton showed improvement over random selector.
          I am waiting for further feedback from gaojinchao@huawei.com and Stan.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Empty server can be detected within balanceCluster(). But this detection has been performed by Master, hence the flag. The static regionId helps make each region Id unique. I actually utilized this fact to debug my code. Let me think more about writing unit test(s) that verifies the distribution over underloaded servers. Both TestLoadBalancer and TestAdmin pass. Preliminary response from Stan Barton showed improvement over random selector. I am waiting for further feedback from gaojinchao@huawei.com and Stan.
          Hide
          stack stack added a comment -

          Ted:

          Why do we pass the emptyRS flag? Didn't we just insert an HServerInfo for the server with not regions into the assignments Map? Isn't an HSI that has an empty array of Regions enough of a flag such that you don't need this extra boolean?

          Is this used?

          +  static int regionId = 0;
          

          The new javadoc helps.

          Otherwise patch looks good Ted. It seems to be an ornamentation on what we had previous; rather than taking regionservers at random, it alternatively takes the newest and then the oldest off the regionserver – is that right? (Is that explained in the patch? I don't think I can see it). You've also added this enhancement: "Basically I find the new regions and put them on different underloaded servers. Previously one underloaded server would be filled up before the next underloaded server is considered." Any chance of your proving the last enhancement with a unit test?

          All the other load balancer tests pass?

          Thanks.

          Show
          stack stack added a comment - Ted: Why do we pass the emptyRS flag? Didn't we just insert an HServerInfo for the server with not regions into the assignments Map? Isn't an HSI that has an empty array of Regions enough of a flag such that you don't need this extra boolean? Is this used? + static int regionId = 0; The new javadoc helps. Otherwise patch looks good Ted. It seems to be an ornamentation on what we had previous; rather than taking regionservers at random, it alternatively takes the newest and then the oldest off the regionserver – is that right? (Is that explained in the patch? I don't think I can see it). You've also added this enhancement: "Basically I find the new regions and put them on different underloaded servers. Previously one underloaded server would be filled up before the next underloaded server is considered." Any chance of your proving the last enhancement with a unit test? All the other load balancer tests pass? Thanks.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Removed statements used for debugging.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Removed statements used for debugging.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Modified unit test so that regions generated have unique regionIds.
          This is first step toward better verification of balancing results.
          Also fixed a bug according to Stanislav Barton's feedback.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Modified unit test so that regions generated have unique regionIds. This is first step toward better verification of balancing results. Also fixed a bug according to Stanislav Barton's feedback.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          This patch combines 3609-empty-RS.txt with my earlier enhancement.
          Basically I find the new regions and put them on different underloaded servers. Previously one underloaded server would be filled up before the next underloaded server is considered.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - This patch combines 3609-empty-RS.txt with my earlier enhancement. Basically I find the new regions and put them on different underloaded servers. Previously one underloaded server would be filled up before the next underloaded server is considered.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Add javadoc for emptyRegionServerPresent.
          Also explained how the new mechanism works.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Add javadoc for emptyRegionServerPresent. Also explained how the new mechanism works.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          For unit test, the region Ids created by mockClusterServers() have the same value - because all the regions are created at almost the same time.
          It will take a little more time to devise appropriate test case generation and validate the balancing moves.

          For the moment, I still base my improvement on the existing framework where region count plays dominant factor.

          One of the goals of this JIRA is to remove randomness from LoadBalancer so that we can deterministically produce near-optimal balancing actions.
          The new parameter, emptyRegionServerPresent, helps decide whether we should move old and new regions to other servers.

          I will upload a new patch where I describe the above in detail. I even plan to blog about the history of HBASE-3586 and this JIRA.

          To validate my latest patch, I need a little help from community participants. Our use case creates hbase tables frequently in our flow with pre-split regions. Since those regions get round-robin assigned initially, it is not easy to reproduce what Stan experienced.

          Thanks for the review and suggestion, Stack.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - For unit test, the region Ids created by mockClusterServers() have the same value - because all the regions are created at almost the same time. It will take a little more time to devise appropriate test case generation and validate the balancing moves. For the moment, I still base my improvement on the existing framework where region count plays dominant factor. One of the goals of this JIRA is to remove randomness from LoadBalancer so that we can deterministically produce near-optimal balancing actions. The new parameter, emptyRegionServerPresent, helps decide whether we should move old and new regions to other servers. I will upload a new patch where I describe the above in detail. I even plan to blog about the history of HBASE-3586 and this JIRA. To validate my latest patch, I need a little help from community participants. Our use case creates hbase tables frequently in our flow with pre-split regions. Since those regions get round-robin assigned initially, it is not easy to reproduce what Stan experienced. Thanks for the review and suggestion, Stack.
          Hide
          stack stack added a comment -

          Looking at patch, I see removal of testRandomizer. Would be nice if a new test of new functionality took its place.

          I like how you use regionid figuring region age. Would suggest you add a little documentation that you depend on regionid being a timestamp (would also clarify what you are up to here).

          Do you have to do this:

          +      boolean emptyRegionServerPresent) 

          So this algorithm balances exclusively by age? No randomness?

          Would be nice if you described the algo in a comment included in the patch.

          How we know this algo better than what we have?

          Patch looks good otherwise Ted.

          Show
          stack stack added a comment - Looking at patch, I see removal of testRandomizer. Would be nice if a new test of new functionality took its place. I like how you use regionid figuring region age. Would suggest you add a little documentation that you depend on regionid being a timestamp (would also clarify what you are up to here). Do you have to do this: + boolean emptyRegionServerPresent) So this algorithm balances exclusively by age? No randomness? Would be nice if you described the algo in a comment included in the patch. How we know this algo better than what we have? Patch looks good otherwise Ted.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          More background for Stan's cluster:
          There're at least 600 regions on each region server. When 30 new regions were created on the same region server, random selector only chose 3 out of the 30 new regions for reassignment. The other region selection was from inactive (old) regions. This is expected behavior because new and old regions were selected equally probably.

          Basically we traded some optimization for safety of not overloading a newly discovered region server.

          My latest patch avoids the above behavior. At the same time, it also can deal with a region server which just joined the cluster and should be assigned both old and new regions.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - More background for Stan's cluster: There're at least 600 regions on each region server. When 30 new regions were created on the same region server, random selector only chose 3 out of the 30 new regions for reassignment. The other region selection was from inactive (old) regions. This is expected behavior because new and old regions were selected equally probably. Basically we traded some optimization for safety of not overloading a newly discovered region server. My latest patch avoids the above behavior. At the same time, it also can deal with a region server which just joined the cluster and should be assigned both old and new regions.
          Hide
          stack stack added a comment -

          I wonder why all regions of a table are assigned to one regionserver when we added randomization to the balancer in 0.90.2? Didn't we? (i.e. HBASE-3586)

          Show
          stack stack added a comment - I wonder why all regions of a table are assigned to one regionserver when we added randomization to the balancer in 0.90.2? Didn't we? (i.e. HBASE-3586 )
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          From Stan Barton:
          Apr 8th:
          I can see that the regionserver that gets all the inserts makes pauses
          time to time - I checked the log and it is because (I assume) the
          regions of other tables from this RS are re-assigned elsewhere. When I
          started inserting, the table was empty, now it has ~20 regions, all
          assigned to one RS, that kicks the scale out property out of the
          system. The insertion process is still running, if you are interested
          in some other info.

          Apr 11th:
          I have downloaded the 0.90.2 candidate but I regret to inform you,
          that it did not help to solve the issue. I am re-doing the tests and
          again, all the newly created regions are being assigned to a single
          RS. It is a real performance killer.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - From Stan Barton: Apr 8th: I can see that the regionserver that gets all the inserts makes pauses time to time - I checked the log and it is because (I assume) the regions of other tables from this RS are re-assigned elsewhere. When I started inserting, the table was empty, now it has ~20 regions, all assigned to one RS, that kicks the scale out property out of the system. The insertion process is still running, if you are interested in some other info. Apr 11th: I have downloaded the 0.90.2 candidate but I regret to inform you, that it did not help to solve the issue. I am re-doing the tests and again, all the newly created regions are being assigned to a single RS. It is a real performance killer.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          This version takes into account whether a new region server has just been discovered by master.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - This version takes into account whether a new region server has just been discovered by master.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Use latest Guava release.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Use latest Guava release.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          The goal of 3609-alternate.txt is to balance the 'ages' (reflected by sum of region Ids) of regions across region servers.
          Currently HServerLoad doesn't contain HRegionInfo information. This makes validating the above goal less straightforward.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - The goal of 3609-alternate.txt is to balance the 'ages' (reflected by sum of region Ids) of regions across region servers. Currently HServerLoad doesn't contain HRegionInfo information. This makes validating the above goal less straightforward.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Since request count (HBASE-3507) is contained in HRegion, not HRegionInfo, there is more work to be done so that decaying moving average of request rate (in Ryan's term) is available to LoadBalancer.
          This can be done in other JIRAs.

          I suggest 3609-alternate.txt be checked in first.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Since request count ( HBASE-3507 ) is contained in HRegion, not HRegionInfo, there is more work to be done so that decaying moving average of request rate (in Ryan's term) is available to LoadBalancer. This can be done in other JIRAs. I suggest 3609-alternate.txt be checked in first.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Depending on the chosen policy, I plan to introduce interface, RegionPlanHolder, which abstracts the concrete data structures holding RegionPlans:

            interface RegionPlanHolder {
          	  RegionPlanHolder create();
          	  int size();
          	  void add(RegionPlan rp);
          	  // removes the head element
          	  RegionPlan remove();
          	  // removes the tail element
          	  RegionPlan removeLast();
          	  // retrieves element at index 
          	  RegionPlan get(int index);
            }
          

          Then regionsToMove would be declared as RegionPlanHolder so that implementation detail such as the following would be hidden in balanceCluster().

              MinMaxPriorityQueue<RegionPlan> regionsToMove = MinMaxPriorityQueue.orderedBy(rpComparator).create();
          
          Show
          yuzhihong@gmail.com Ted Yu added a comment - Depending on the chosen policy, I plan to introduce interface, RegionPlanHolder, which abstracts the concrete data structures holding RegionPlans: interface RegionPlanHolder { RegionPlanHolder create(); int size(); void add(RegionPlan rp); // removes the head element RegionPlan remove(); // removes the tail element RegionPlan removeLast(); // retrieves element at index RegionPlan get( int index); } Then regionsToMove would be declared as RegionPlanHolder so that implementation detail such as the following would be hidden in balanceCluster(). MinMaxPriorityQueue<RegionPlan> regionsToMove = MinMaxPriorityQueue.orderedBy(rpComparator).create();
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          I am thinking about adding this enum:

            public static enum BalancerPolicy {
          	  BALANCE_REGION_COUNT,
          	  BALANCE_REQUESTS_PER_MINUTE,
            }
          

          Shall we introduce hbase.balancer.policy in hbase-default.xml which reflects the above enum ?

          Show
          yuzhihong@gmail.com Ted Yu added a comment - I am thinking about adding this enum: public static enum BalancerPolicy { BALANCE_REGION_COUNT, BALANCE_REQUESTS_PER_MINUTE, } Shall we introduce hbase.balancer.policy in hbase-default.xml which reflects the above enum ?
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          I was looking for a class which can store moving average of requests/second.
          Shall we add that policy in another JIRA ?

          Show
          yuzhihong@gmail.com Ted Yu added a comment - I was looking for a class which can store moving average of requests/second. Shall we add that policy in another JIRA ?
          Hide
          stack stack added a comment -

          Patch looks good Ted. How you propose we add it in? It'd be put in place instead of the randomizing balancing? Or should it be a configuration where we load different balancing algoritms. For example, I can imagine that some would like to balance based off requests/second.... rather than region count.

          Oh, you think a test would make sense for your stuff?

          Good stuff.

          Show
          stack stack added a comment - Patch looks good Ted. How you propose we add it in? It'd be put in place instead of the randomizing balancing? Or should it be a configuration where we load different balancing algoritms. For example, I can imagine that some would like to balance based off requests/second.... rather than region count. Oh, you think a test would make sense for your stuff? Good stuff.
          Hide
          stack stack added a comment -

          On upgrading guava, thats no problem. Just include in your patch a change to pom.xml that moves the guava version forward Ted.

          Show
          stack stack added a comment - On upgrading guava, thats no problem. Just include in your patch a change to pom.xml that moves the guava version forward Ted.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          This is close to what I described in HBASE-3586.
          I used MinMaxPriorityQueue to alternately choose regions from head and tail of regionsToMove.

          Both TestLoadBalancer and TestAdmin pass.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - This is close to what I described in HBASE-3586 . I used MinMaxPriorityQueue to alternately choose regions from head and tail of regionsToMove. Both TestLoadBalancer and TestAdmin pass.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Improved region selection using Double-Ended Priority Queues

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Improved region selection using Double-Ended Priority Queues
          Show
          yuzhihong@gmail.com Ted Yu added a comment - Can we upgrade Guava so that I can use this ? http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/collect/MinMaxPriorityQueue.html
          Show
          yuzhihong@gmail.com Ted Yu added a comment - I will try http://download.oracle.com/javase/6/docs/api/java/util/Deque.html
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          If someone can point me to an implementation of Double-Ended Priority Queue in Java, I can use it in the three loops involving regionsToMove.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - If someone can point me to an implementation of Double-Ended Priority Queue in Java, I can use it in the three loops involving regionsToMove.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          hbase-3609-by-region-age.txt is the refined version.
          This accounts for the out-of-band regions which were assigned to the server after some other region server crashed.
          In that scenario, random selection may not work very well because the regions on individual region server are not sorted.
          Both TestLoadBalancer and TestAdmin pass.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - hbase-3609-by-region-age.txt is the refined version. This accounts for the out-of-band regions which were assigned to the server after some other region server crashed. In that scenario, random selection may not work very well because the regions on individual region server are not sorted. Both TestLoadBalancer and TestAdmin pass.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Sort regions before fetching any region

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Sort regions before fetching any region
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          hbase-3609.txt implemented first part of the idea expressed in HBASE-3586 @ 06/Mar/11 07:10
          The alternation between head and tail of regions list happens after each addition to regionsToMove. So this should provide good distribution of young and old regions.

          The implementation for second part needs more refinement because regionsToMove is manipulated in three loops in balanceCluster().

          TestLoadBalancer passes.

          Please review.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - hbase-3609.txt implemented first part of the idea expressed in HBASE-3586 @ 06/Mar/11 07:10 The alternation between head and tail of regions list happens after each addition to regionsToMove. So this should provide good distribution of young and old regions. The implementation for second part needs more refinement because regionsToMove is manipulated in three loops in balanceCluster(). TestLoadBalancer passes. Please review.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Initial attempt.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Initial attempt.

            People

            • Assignee:
              yuzhihong@gmail.com Ted Yu
              Reporter:
              stack stack
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development