Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-3586

Improve the selection of regions to balance

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 0.90.1
    • Fix Version/s: 0.90.2
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Currently LoadBalancer goes through the list of regions per RS and grabs the few first ones to balance. This is not bad, but that list is often sorted naturally since the a RS that boots will open the regions in a sequential and sorted order (since it comes from .META.) which means that we're balancing regions starting in an almost sorted fashion.

      We discovered that because one of our internal users created a new table starting with letter "p" which has now grown to 100 regions in the last few hours and they are all served by 1 region server. Looking at the master's log, the balancer has moved as many regions from that region server but they are all from the same table that starts with letter "a" (and the regions that were moved all come one after the other).

      The part of the code that should be modified is:

      for (HRegionInfo hri: regions) {
        // Don't rebalance meta regions.
        if (hri.isMetaRegion()) continue; 
        regionsToMove.add(new RegionPlan(hri, serverInfo, null));
        numTaken++;
        if (numTaken >= numToOffload) break;
      }
      
      1. 3586-randomize-v2.txt
        6 kB
        stack
      2. 3586-randomize.txt
        9 kB
        stack
      3. hbase-3586-table-creation.txt
        1 kB
        Ted Yu
      4. hbase-3586-with-sort.txt
        2 kB
        Ted Yu
      5. HBASE-3586-by-region-age.patch
        2 kB
        Andrew Purtell
      6. HBASE-3586-by-region-age.patch
        2 kB
        Andrew Purtell

        Issue Links

          Activity

          Hide
          lars_francke Lars Francke added a comment -

          This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).

          Show
          lars_francke Lars Francke added a comment - This issue was closed as part of a bulk closing operation on 2015-11-20. All issues that have been resolved and where all fixVersions have been released have been closed (following discussions on the mailing list).
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Right.
          I am continuing my effort toward better load balancing through HBASE-3507, etc.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Right. I am continuing my effort toward better load balancing through HBASE-3507 , etc.
          Hide
          stack stack added a comment -

          @Ted So you tested the patch that was submitted here? The random assignment? The balancer only does count of regions, not load on the regions, so yes, I'd imagine that its possible some regions would be taking no load.

          Show
          stack stack added a comment - @Ted So you tested the patch that was submitted here? The random assignment? The balancer only does count of regions, not load on the regions, so yes, I'd imagine that its possible some regions would be taking no load.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          I ran the random region selector on staging cluster (15 region servers, 4600 regions).
          In terms of region count, load is balanced.
          I observed that a few region servers received around 0 requests even though they carried some regions of the table which was actively written to.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - I ran the random region selector on staging cluster (15 region servers, 4600 regions). In terms of region count, load is balanced. I observed that a few region servers received around 0 requests even though they carried some regions of the table which was actively written to.
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-TRUNK #1781 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1781/)

          Show
          hudson Hudson added a comment - Integrated in HBase-TRUNK #1781 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1781/ )
          Hide
          stack stack added a comment -

          Thanks for review Ted.

          Show
          stack stack added a comment - Thanks for review Ted.
          Hide
          stack stack added a comment -

          I opened new issue for continuing the above balancer explorations: HBASE-3609

          Show
          stack stack added a comment - I opened new issue for continuing the above balancer explorations: HBASE-3609
          Hide
          stack stack added a comment -

          Committed patch to branch and trunk. Thanks for review Ryan. Lets open new issue Ted to do other balancing algorithms.

          Show
          stack stack added a comment - Committed patch to branch and trunk. Thanks for review Ryan. Lets open new issue Ted to do other balancing algorithms.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          I feel sweat on my back

          I am still a beginner. Let's make HBase better together.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - I feel sweat on my back I am still a beginner. Let's make HBase better together.
          Hide
          apurtell Andrew Purtell added a comment -

          @Ted: Thank you for your contributions.

          Show
          apurtell Andrew Purtell added a comment - @Ted: Thank you for your contributions.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Overall +1

          For TestLoadBalancer.java, I would use simpler form because LoadBalancer.randomize() is only called once for each server in balanceCluster():

                  List<HRegionInfo> copy = new ArrayList<HRegionInfo>(original);
                  List<HRegionInfo> randomized = LoadBalancer.randomize(copy);
                  if (original.equals(randomized)) {
                  	assertFalse(e.getKey().toString() + " has identical region list", true);
                  }
          

          I removed some logging which should have been done using LOG.info().
          I also ran TestAdmin which passed.

          This approach is smart and should avoid common pitfalls statistically.

          Let's observe its efficacy in production.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Overall +1 For TestLoadBalancer.java, I would use simpler form because LoadBalancer.randomize() is only called once for each server in balanceCluster(): List<HRegionInfo> copy = new ArrayList<HRegionInfo>(original); List<HRegionInfo> randomized = LoadBalancer.randomize(copy); if (original.equals(randomized)) { assertFalse(e.getKey().toString() + " has identical region list" , true ); } I removed some logging which should have been done using LOG.info(). I also ran TestAdmin which passed. This approach is smart and should avoid common pitfalls statistically. Let's observe its efficacy in production.
          Hide
          stack stack added a comment -

          Here is version that does not include pollution patching FSUtils.

          Show
          stack stack added a comment - Here is version that does not include pollution patching FSUtils.
          Hide
          ryanobjc ryan rawson added a comment -

          +1 lgtm

          Show
          ryanobjc ryan rawson added a comment - +1 lgtm
          Hide
          stack stack added a comment -

          How is this. Adds a test too.

          Show
          stack stack added a comment - How is this. Adds a test too.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Random selection approach would make debugging harder.

          A patch using random selection is welcome.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Random selection approach would make debugging harder. A patch using random selection is welcome.
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-TRUNK #1771 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1771/)

          Show
          hudson Hudson added a comment - Integrated in HBase-TRUNK #1771 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1771/ )
          Hide
          stack stack added a comment -

          I'd suggest we checkin random for now then experiment thereafter?

          Show
          stack stack added a comment - I'd suggest we checkin random for now then experiment thereafter?
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Although random selection statistically achieves balance, I still prefer a deterministic approach.
          Consider the following variant for my patches:
          Instead of using the following loop to fill out regionsToMove:

                for (int i = sz-1; i >= 0; i--) {
              	  HRegionInfo hri = regions.get(i);
          

          We alternate between the head and tail of regions (picking both young and old ones).

          We still keep the following sort():

              Collections.sort(regionsToMove, rpComparator);
          

          We then iterate through underloadedServers repeatedly, doing the following action alternately:
          picking one region from head of regionsToMove in passes 1, 3, 5, etc
          picking one region from tail of regionsToMove in passes 2, 4, 6, etc.

          E.g. suppose RS1 has regions with region Ids of 42, 54, 105 and 201
          RS2 has regions with region Ids of 34, 104, 110 and 154
          Suppose we need to offload some regions to RS3 and RS4 which didn't carry regions.

          regionsToMove would contain regions (201, 42, 154, 34) before sorting and regions (201, 154, 42, 34) after sorting.
          Then we assign
          region 201 to RS3, region 154 to RS4
          region 34 to RS3, region 42 to RS4

          Please comment.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Although random selection statistically achieves balance, I still prefer a deterministic approach. Consider the following variant for my patches: Instead of using the following loop to fill out regionsToMove: for ( int i = sz-1; i >= 0; i--) { HRegionInfo hri = regions.get(i); We alternate between the head and tail of regions (picking both young and old ones). We still keep the following sort(): Collections.sort(regionsToMove, rpComparator); We then iterate through underloadedServers repeatedly, doing the following action alternately: picking one region from head of regionsToMove in passes 1, 3, 5, etc picking one region from tail of regionsToMove in passes 2, 4, 6, etc. E.g. suppose RS1 has regions with region Ids of 42, 54, 105 and 201 RS2 has regions with region Ids of 34, 104, 110 and 154 Suppose we need to offload some regions to RS3 and RS4 which didn't carry regions. regionsToMove would contain regions (201, 42, 154, 34) before sorting and regions (201, 154, 42, 34) after sorting. Then we assign region 201 to RS3, region 154 to RS4 region 34 to RS3, region 42 to RS4 Please comment.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          I assume the above situation happened without my first patch
          Can you describe how that server had lower than average number of regions ?

          There're two scenarios which would make my first patch suboptimal:
          1. After cluster starts up, some region servers aren't immediately detected by master as being online. Therefore they carry no regions.
          2. When region server crashes and is brought back up back monitoring script.
          If one or more tables get created immediately after either of the above two scenarios, balancer would assign (young) regions from the new table to the few region servers which didn't carry regions.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - I assume the above situation happened without my first patch Can you describe how that server had lower than average number of regions ? There're two scenarios which would make my first patch suboptimal: 1. After cluster starts up, some region servers aren't immediately detected by master as being online. Therefore they carry no regions. 2. When region server crashes and is brought back up back monitoring script. If one or more tables get created immediately after either of the above two scenarios, balancer would assign (young) regions from the new table to the few region servers which didn't carry regions.
          Hide
          stack stack added a comment -

          Oh, just to say that at our house, we had a pathological condition where balancer ended up assigning all regions for a table to one server, then as it split, the balancer kept landing the new splits back to the same origin server. Chatting, random would have done for our case.

          Show
          stack stack added a comment - Oh, just to say that at our house, we had a pathological condition where balancer ended up assigning all regions for a table to one server, then as it split, the balancer kept landing the new splits back to the same origin server. Chatting, random would have done for our case.
          Hide
          stack stack added a comment -

          Yeah, agree w/ Ryan that we should try random as baseline. Lets do other policies in a different issue (Thanks for testing your patch Ted and finding that it may be suboptimal).

          Show
          stack stack added a comment - Yeah, agree w/ Ryan that we should try random as baseline. Lets do other policies in a different issue (Thanks for testing your patch Ted and finding that it may be suboptimal).
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          One of the goals of load balancer is to minimize the number of region movements so that scanner timeout is minimized.
          The current approach to balancing young regions aligns better with this goal.
          For random approach, we may need to constantly move some regions to achieve balanced load.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - One of the goals of load balancer is to minimize the number of region movements so that scanner timeout is minimized. The current approach to balancing young regions aligns better with this goal. For random approach, we may need to constantly move some regions to achieve balanced load.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          @Ryan: I think you're referring to my first patch. I partially agree. We may provide (at least two) policies - one favoring moving young regions and the other doing random region selection.

          I think my second patch establishes condition for the first to function as expected.
          I would like to hear about other use patterns which are not covered by both patches.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - @Ryan: I think you're referring to my first patch. I partially agree. We may provide (at least two) policies - one favoring moving young regions and the other doing random region selection. I think my second patch establishes condition for the first to function as expected. I would like to hear about other use patterns which are not covered by both patches.
          Hide
          ryanobjc ryan rawson added a comment -

          Can you try a random approach? Often random can be more predictable and not
          have weak edge cases that different use patterns can tickle.

          Show
          ryanobjc ryan rawson added a comment - Can you try a random approach? Often random can be more predictable and not have weak edge cases that different use patterns can tickle.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Although balanceCluster() can be made more complex by considering both old and new regions, the new patch achieves the same effect.
          When creating a table with multiple regions, I check whether there're online region servers which don't carry region (this can be relaxed by introducing a threshold which separates overloaded and underloaded servers). If there're such servers, balance() is called to balance the (relatively old) regions.
          Since assignmentManager.assignUserRegions() uses round-robin assignment, cluster would still be balanced when createTable() returns.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Although balanceCluster() can be made more complex by considering both old and new regions, the new patch achieves the same effect. When creating a table with multiple regions, I check whether there're online region servers which don't carry region (this can be relaxed by introducing a threshold which separates overloaded and underloaded servers). If there're such servers, balance() is called to balance the (relatively old) regions. Since assignmentManager.assignUserRegions() uses round-robin assignment, cluster would still be balanced when createTable() returns.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          The patch actually aggravates unbalanced load in a situation where one or more region servers weren't assigned (any) region and a table with multiple regions is created.
          All the new regions from this table would be assigned to the region servers which didn't carry any region.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - The patch actually aggravates unbalanced load in a situation where one or more region servers weren't assigned (any) region and a table with multiple regions is created. All the new regions from this table would be assigned to the region servers which didn't carry any region.
          Hide
          apurtell Andrew Purtell added a comment -

          Committed to trunk and 0.90 branch.

          Show
          apurtell Andrew Purtell added a comment - Committed to trunk and 0.90 branch.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Added import statement.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Added import statement.
          Hide
          stack stack added a comment -

          +1 on Ted's patch. LGTM.

          Show
          stack stack added a comment - +1 on Ted's patch. LGTM.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Per Andrew's request, here is my patch.
          Pardon me for possible white space issue.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - Per Andrew's request, here is my patch. Pardon me for possible white space issue.
          Hide
          apurtell Andrew Purtell added a comment -

          I wouldn't wait for comment, you have someone giving you a +1 on approach. Next we need a patch to test.

          Show
          apurtell Andrew Purtell added a comment - I wouldn't wait for comment, you have someone giving you a +1 on approach. Next we need a patch to test.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          I can try.
          But I need to trim down my changes for HBASE-3373
          I don't want to repeat the experience of my previous submission. So I would wait for other dev's comment first.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - I can try. But I need to trim down my changes for HBASE-3373 I don't want to repeat the experience of my previous submission. So I would wait for other dev's comment first.
          Hide
          apurtell Andrew Purtell added a comment -

          @Ted: So maybe you should post a patch here?

          Show
          apurtell Andrew Purtell added a comment - @Ted: So maybe you should post a patch here?
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          I also have the following:

              // put young regions at the beginning of regionsToMove
              Collections.sort(regionsToMove, rpComparator);
          
              // Walk down least loaded, filling each to the min
          

          where RegionPlanComparator is defined as:

            static class RegionPlanComparator implements Comparator<RegionPlan> {
                @Override
                public int compare(RegionPlan l, RegionPlan r) {
              	  long diff = r.getRegionInfo().getRegionId() - l.getRegionInfo().getRegionId();
              	  if (diff < 0) return -1;
              	  if (diff > 0) return 1;
              	  return 0;
                }	  
            }
            static RegionPlanComparator rpComparator = new RegionPlanComparator();
          
          Show
          yuzhihong@gmail.com Ted Yu added a comment - I also have the following: // put young regions at the beginning of regionsToMove Collections.sort(regionsToMove, rpComparator); // Walk down least loaded, filling each to the min where RegionPlanComparator is defined as: static class RegionPlanComparator implements Comparator<RegionPlan> { @Override public int compare(RegionPlan l, RegionPlan r) { long diff = r.getRegionInfo().getRegionId() - l.getRegionInfo().getRegionId(); if (diff < 0) return -1; if (diff > 0) return 1; return 0; } } static RegionPlanComparator rpComparator = new RegionPlanComparator();
          Hide
          apurtell Andrew Purtell added a comment -

          Sure that's lighter weight. Your idea as patch.

          Show
          apurtell Andrew Purtell added a comment - Sure that's lighter weight. Your idea as patch.
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          Looking at how HRegionInfo gets added into AssignmentManager.servers:

           
            private void addToServers(final HServerInfo hsi, final HRegionInfo hri) {
              List<HRegionInfo> hris = servers.get(hsi);
              if (hris == null) {
                hris = new ArrayList<HRegionInfo>();
                servers.put(hsi, hris);
              }
              hris.add(hri);
            }
          

          I think we can traverse List<HRegionInfo> in reverse order because young regions are added to the tail of the List.
          I have this:

                int sz = regions.size();
                for (int i = sz-1; i >= 0; i--) {
              	  HRegionInfo hri = regions.get(i);
          
          Show
          yuzhihong@gmail.com Ted Yu added a comment - Looking at how HRegionInfo gets added into AssignmentManager.servers: private void addToServers( final HServerInfo hsi, final HRegionInfo hri) { List<HRegionInfo> hris = servers.get(hsi); if (hris == null ) { hris = new ArrayList<HRegionInfo>(); servers.put(hsi, hris); } hris.add(hri); } I think we can traverse List<HRegionInfo> in reverse order because young regions are added to the tail of the List. I have this: int sz = regions.size(); for ( int i = sz-1; i >= 0; i--) { HRegionInfo hri = regions.get(i);
          Hide
          apurtell Andrew Purtell added a comment -

          Something like so?

          Show
          apurtell Andrew Purtell added a comment - Something like so?
          Hide
          jdcryans Jean-Daniel Cryans added a comment -

          Ah ok that wasn't clear, well let's try it out!

          Show
          jdcryans Jean-Daniel Cryans added a comment - Ah ok that wasn't clear, well let's try it out!
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          That's what I meant. LoadBalancer should select/move young regions.

          Show
          yuzhihong@gmail.com Ted Yu added a comment - That's what I meant. LoadBalancer should select/move young regions.
          Hide
          jdcryans Jean-Daniel Cryans added a comment -

          You're talking of sorting regionsToMove by creation time, in this jira I'm talking about filling this structure by choosing regions in a slightly different way. Unless that's what you meant too, that we should choose regions from those lists according to their creation time?

          Show
          jdcryans Jean-Daniel Cryans added a comment - You're talking of sorting regionsToMove by creation time, in this jira I'm talking about filling this structure by choosing regions in a slightly different way. Unless that's what you meant too, that we should choose regions from those lists according to their creation time?
          Hide
          yuzhihong@gmail.com Ted Yu added a comment -

          See my comment at 29/Jan/11 00:53 in HBASE-3373

          Show
          yuzhihong@gmail.com Ted Yu added a comment - See my comment at 29/Jan/11 00:53 in HBASE-3373

            People

            • Assignee:
              yuzhihong@gmail.com Ted Yu
              Reporter:
              jdcryans Jean-Daniel Cryans
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development