HBase
  1. HBase
  2. HBASE-3373

Allow regions to be load-balanced by table

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.6
    • Fix Version/s: 0.94.0
    • Component/s: master
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      There is a parameter, hbase.regions.slop, with default value of 0.2
      This parameter allows actual region count to deviate by this percentage from (ideal) average region count.
      Show
      There is a parameter, hbase.regions.slop, with default value of 0.2 This parameter allows actual region count to deviate by this percentage from (ideal) average region count.

      Description

      From our experience, cluster can be well balanced and yet, one table's regions may be badly concentrated on few region servers.
      For example, one table has 839 regions (380 regions at time of table creation) out of which 202 are on one server.

      It would be desirable for load balancer to distribute regions for specified tables evenly across the cluster. Each of such tables has number of regions many times the cluster size.

      1. HbaseBalancerTest2.java
        3 kB
        Matt Corgan
      2. 3373.txt
        6 kB
        Ted Yu

        Issue Links

          Activity

          Hide
          Jonathan Gray added a comment -

          On cluster startup in 0.90, regions are assigned in one of two ways. By default, it will attempt to retain the previous assignment of the cluster. The other option which I've also used is round-robin. This will evenly distribute each table.

          That plus the change to do round-robin on table create should probably cover per-table distribution fairly well.

          I think the next step in the load balancer is a major effort to switch to something with more of a cost-based approach. I think ideally you don't need even distribution of each table, you want even distribution of load. If one hot table, it will get evenly balanced anyways.

          One thing we could do is get rid of all random assignments and always try to do some kind of quick load balance or round-robin. It does seem like randomness always leads to one guy who gets an unfair share

          Show
          Jonathan Gray added a comment - On cluster startup in 0.90, regions are assigned in one of two ways. By default, it will attempt to retain the previous assignment of the cluster. The other option which I've also used is round-robin. This will evenly distribute each table. That plus the change to do round-robin on table create should probably cover per-table distribution fairly well. I think the next step in the load balancer is a major effort to switch to something with more of a cost-based approach. I think ideally you don't need even distribution of each table, you want even distribution of load. If one hot table, it will get evenly balanced anyways. One thing we could do is get rid of all random assignments and always try to do some kind of quick load balance or round-robin. It does seem like randomness always leads to one guy who gets an unfair share
          Hide
          Ted Yu added a comment -

          If a table is heavily written to, its regions split over relatively long period of time.
          The new daughter regions may not have good distribution.

          E.g. after a region server comes back online from crash, it takes time for balancer to assign regions from other servers. The new regions from above tend to get assigned to this region server.

          Show
          Ted Yu added a comment - If a table is heavily written to, its regions split over relatively long period of time. The new daughter regions may not have good distribution. E.g. after a region server comes back online from crash, it takes time for balancer to assign regions from other servers. The new regions from above tend to get assigned to this region server.
          Hide
          Ted Yu added a comment -

          List of regions for the table can be given to AssignmentManager so that the regions can be evenly distributed.
          We need to consider the regions in the table that are actively splitting. These regions and their daughters would lead to imbalance after the above round-robin assignment.

          Show
          Ted Yu added a comment - List of regions for the table can be given to AssignmentManager so that the regions can be evenly distributed. We need to consider the regions in the table that are actively splitting. These regions and their daughters would lead to imbalance after the above round-robin assignment.
          Hide
          Ted Yu added a comment -

          For environment where a lot of tables are created, client can specify a list of tables to hbase whose regions would be load balanced. Time-to-live is specified along with table names after which load balancer gets to balance all tables.

          Show
          Ted Yu added a comment - For environment where a lot of tables are created, client can specify a list of tables to hbase whose regions would be load balanced. Time-to-live is specified along with table names after which load balancer gets to balance all tables.
          Hide
          Ted Yu added a comment -

          Currently, regions offloaded from the most overloaded server would be assigned to most underloaded server first. When some regions are actively splitting on the most overloaded server, this arrangement is sub-optimal.
          A better way is to round-robin assign regionsToMove over underloaded servers.

          Show
          Ted Yu added a comment - Currently, regions offloaded from the most overloaded server would be assigned to most underloaded server first. When some regions are actively splitting on the most overloaded server, this arrangement is sub-optimal. A better way is to round-robin assign regionsToMove over underloaded servers.
          Hide
          stack added a comment -

          Moving 'feature' out of bug-fix only release.

          Show
          stack added a comment - Moving 'feature' out of bug-fix only release.
          Hide
          Matt Corgan added a comment -

          Have you guys considered using a consistent hashing method to choose which server a region belongs to? You would create ~50 buckets for each server by hashing serverName_port_bucketNum, and then hash the start key of each region into the buckets.

          There are a few benefits:

          • when you add a server it takes an equal load from all existing servers
          • if you remove a server it distributes its regions equally to the remaining servers
          • adding a server does not cause all regions to shuffle like round robin assignment would
          • assignment is nearly random, but repeatable, so no hot spots
          • when a region splits the front half will stay on the same server, but the back half will usually be sent to another server

          And a few drawbacks:

          • each server wouldn't end up with exactly the same number of regions, but they would be close
          • if a hot spot does end up developing, you can't do anything about it, at least not unless it supported a list of manual overrides
          Show
          Matt Corgan added a comment - Have you guys considered using a consistent hashing method to choose which server a region belongs to? You would create ~50 buckets for each server by hashing serverName_port_bucketNum, and then hash the start key of each region into the buckets. There are a few benefits: when you add a server it takes an equal load from all existing servers if you remove a server it distributes its regions equally to the remaining servers adding a server does not cause all regions to shuffle like round robin assignment would assignment is nearly random, but repeatable, so no hot spots when a region splits the front half will stay on the same server, but the back half will usually be sent to another server And a few drawbacks: each server wouldn't end up with exactly the same number of regions, but they would be close if a hot spot does end up developing, you can't do anything about it, at least not unless it supported a list of manual overrides
          Hide
          Jonathan Gray added a comment -

          I think consistent hashing would be a major step backwards for us and unnecessary because there is no cost of moving bits around in HBase. The primary benefit of consistent hashing is that it reduces the amount of data you have to physically move around. Because of our use of HDFS, we never have to move physical data around.

          In your benefit list, we are already implementing almost all of these features, or if not, it is possible in the current architecture. In addition, our architecture is extremely flexible and we can do all kinds of interesting load balancing techniques related to actual load profiles not just #s of shards/buckets as we do today or as would be done with consistent hashing.

          Show
          Jonathan Gray added a comment - I think consistent hashing would be a major step backwards for us and unnecessary because there is no cost of moving bits around in HBase. The primary benefit of consistent hashing is that it reduces the amount of data you have to physically move around. Because of our use of HDFS, we never have to move physical data around. In your benefit list, we are already implementing almost all of these features, or if not, it is possible in the current architecture. In addition, our architecture is extremely flexible and we can do all kinds of interesting load balancing techniques related to actual load profiles not just #s of shards/buckets as we do today or as would be done with consistent hashing.
          Hide
          Matt Corgan added a comment -

          Gotcha. I guess I was thinking of it more as a quick upgrade to the current load balancer which only looks at region count. We store a lot of time series data, and regions that split were left on the same server while it moved cold regions off. I wrote a little client side consistent hashing balancer that solved the problem in our case, but there are definitely better ways. Consistent hashing also binds regions to severs across cluster restarts which helps keep regions near their last major compacted hdfs file.

          Whatever balancing scheme you do use, don't you need some starting point for randomly distributing the regions? If no other data is available or you need a tie breaker, maybe consistent hashing is better than round robin or purely random placement.

          Show
          Matt Corgan added a comment - Gotcha. I guess I was thinking of it more as a quick upgrade to the current load balancer which only looks at region count. We store a lot of time series data, and regions that split were left on the same server while it moved cold regions off. I wrote a little client side consistent hashing balancer that solved the problem in our case, but there are definitely better ways. Consistent hashing also binds regions to severs across cluster restarts which helps keep regions near their last major compacted hdfs file. Whatever balancing scheme you do use, don't you need some starting point for randomly distributing the regions? If no other data is available or you need a tie breaker, maybe consistent hashing is better than round robin or purely random placement.
          Hide
          Ted Yu added a comment -

          This is what I added in HMaster:

            /**
             * Evenly distributes the regions of the tables (assuming the number of regions is much bigger
             *  than the number of region servers)
             * @param tableNames tables to load balance
             * @param ttl Time-to-live for load balance request. If negative, request is withdrawn
             * @throws IOException e
             */
            public void loadBalanceTable(final byte [][] tableNames, long ttl) throws IOException {
          

          Our production environment has 150 to 300 tables. We run flow sequentially. Each flow creates about 10 new tables.
          The above API would allow load balancer to distribute hot (recently split) regions off certain region server(s).

          Show
          Ted Yu added a comment - This is what I added in HMaster: /** * Evenly distributes the regions of the tables (assuming the number of regions is much bigger * than the number of region servers) * @param tableNames tables to load balance * @param ttl Time-to-live for load balance request. If negative, request is withdrawn * @ throws IOException e */ public void loadBalanceTable( final byte [][] tableNames, long ttl) throws IOException { Our production environment has 150 to 300 tables. We run flow sequentially. Each flow creates about 10 new tables. The above API would allow load balancer to distribute hot (recently split) regions off certain region server(s).
          Hide
          Jonathan Gray added a comment -

          Both of your solutions are rather specialized and I'm not sure generally applicable. I would much prefer spending effort on improving our current load balancer and it seems to me that it would be possible to implement similar behaviors in a more generalized way.

          Also, the addition of an HBaseAdmin region move API makes it so you don't need to muck with HBase server code to do specialized balancing logic. With the current APIs, it's possible to basically push the balancer out into your own client.

          @Matt, I don't think I'm really understanding how you upgrade our load balancer w/ consistent hashing?

          The fact that split regions open back up on the same server is actually an optimization in many cases because it reduces the amount of time the regions are offline and when they come back online and do a compaction to drop references, all the files are more likely to be on the local DataNode rather than remote. In some cases, like time-series, you may want the splits to move to different servers. I could imagine some configurable logic in there to ensure the bottom half goes to a different server (or maybe the top half would actually be more efficient to move away since most the time you'll write more to the bottom half and thus want the data locality / quick turnaround). There's likely going to be a bit of split rework in 0.92 to make it more like the ZK-based regions-in-transition.

          As far as binding regions to servers between cluster restarts, this is already implemented and on by default in 0.90.

          Consistent hashing also requires a fixed keyspace (right?) and that's a mismatch for HBase's flexibility in this regard.

          Do you have any code for this client-side consistent hashing balancer? I'm confused about how that could be implemented without knowing a lot about your data, the regions, the servers available, etc.

          Show
          Jonathan Gray added a comment - Both of your solutions are rather specialized and I'm not sure generally applicable. I would much prefer spending effort on improving our current load balancer and it seems to me that it would be possible to implement similar behaviors in a more generalized way. Also, the addition of an HBaseAdmin region move API makes it so you don't need to muck with HBase server code to do specialized balancing logic. With the current APIs, it's possible to basically push the balancer out into your own client. @Matt, I don't think I'm really understanding how you upgrade our load balancer w/ consistent hashing? The fact that split regions open back up on the same server is actually an optimization in many cases because it reduces the amount of time the regions are offline and when they come back online and do a compaction to drop references, all the files are more likely to be on the local DataNode rather than remote. In some cases, like time-series, you may want the splits to move to different servers. I could imagine some configurable logic in there to ensure the bottom half goes to a different server (or maybe the top half would actually be more efficient to move away since most the time you'll write more to the bottom half and thus want the data locality / quick turnaround). There's likely going to be a bit of split rework in 0.92 to make it more like the ZK-based regions-in-transition. As far as binding regions to servers between cluster restarts, this is already implemented and on by default in 0.90. Consistent hashing also requires a fixed keyspace (right?) and that's a mismatch for HBase's flexibility in this regard. Do you have any code for this client-side consistent hashing balancer? I'm confused about how that could be implemented without knowing a lot about your data, the regions, the servers available, etc.
          Hide
          Ted Yu added a comment -

          Hive, for instance, may create new (intermediate) tables for its map/reduce jobs.
          The add-on I propose would be beneficial for that scenario.

          Also, in LoadBalancer.balanceCluster(), line 210:

                while(numTaken < numToTake && regionidx < regionsToMove.size()) {
                  regionsToMove.get(regionidx).setDestination(server.getKey());
                  numTaken++;
                  regionidx++;
                }
          

          The above code would offload regions from most loaded server to most underloaded server.
          It is more desirable to round-robin the regions from loaded server(s) to underloaded server(s) so that the new daughter regions don't stay on the same server.

          Show
          Ted Yu added a comment - Hive, for instance, may create new (intermediate) tables for its map/reduce jobs. The add-on I propose would be beneficial for that scenario. Also, in LoadBalancer.balanceCluster(), line 210: while (numTaken < numToTake && regionidx < regionsToMove.size()) { regionsToMove.get(regionidx).setDestination(server.getKey()); numTaken++; regionidx++; } The above code would offload regions from most loaded server to most underloaded server. It is more desirable to round-robin the regions from loaded server(s) to underloaded server(s) so that the new daughter regions don't stay on the same server.
          Hide
          Jonathan Gray added a comment -

          Round-robin assignment at table creation is fine. Bypassing the load balancer and doing your own thing is fine. Adding intelligence into the balancer to get good balance of load is great.

          I'm -1 on adding these kinds of specialized hooks into HBase proper. They should either be an external component (seems that they can be) or we should make the balancer pluggable and you could provide alternative/configurable balancer implementations.

          Assigning overloaded regions in a round-robin way to underloaded does make sense. Would be happy to take a contribution to do that. I'm not sure there's a very strong correlation with that and splitting up of daughter regions. It certainly could be the case, but selection of which regions to move off an overloaded server is rather dumb so no guarantees that recently split regions get reassigned.

          Show
          Jonathan Gray added a comment - Round-robin assignment at table creation is fine. Bypassing the load balancer and doing your own thing is fine. Adding intelligence into the balancer to get good balance of load is great. I'm -1 on adding these kinds of specialized hooks into HBase proper. They should either be an external component (seems that they can be) or we should make the balancer pluggable and you could provide alternative/configurable balancer implementations. Assigning overloaded regions in a round-robin way to underloaded does make sense. Would be happy to take a contribution to do that. I'm not sure there's a very strong correlation with that and splitting up of daughter regions. It certainly could be the case, but selection of which regions to move off an overloaded server is rather dumb so no guarantees that recently split regions get reassigned.
          Hide
          Matt Corgan added a comment -

          I can't really post my client code since it's intertwined with a bunch of other stuff, but I extracted the important parts into a junit test that i attached to this issue. We run java (tomcat) so it's fairly easy to talk directly to hbase and integrate a few features into our admin console. Printing friendly record names rather than escaped bytes, triggering backups, moving regions, etc... Don't think it requires knowing the keyspace ahead of time, just that you hash into a known output range, a 63 bit long in my example.

          I think the consistent hashing scheme may be a good out-of-the-box methodology. Even with something smarter, I'd worry about the underlying algorithms getting off course and starting a death spiral as bad outputs are fed back in creating even worse outputs. Something like consistent hashing could be a good beacon to always be steering towards so things don't get too far off course.

          I have about 20 tables with many different access patterns and I can't envision an algorithm that balances them truly well. Everything could be going fine until I kick off a MR job that randomly digs up 100 very cold regions and find that they're all on the same server.

          I'm thinking of a system where each region is either at home (its consistent hash destination) or visiting another server because the balancer decided its home was too hot. Each regionserver could identify it's hotter regions, and the balancer could move these around in an effort to smooth out the load. In the mean time, colder regions would stay well distributed based on how good the hashing mechanism is. If a regionserver cools down, the master brings home it's vacationing regions first, and if it's still cool, then it borrows someone else's hotter home regions. Without an underlying scheme, I can envision things getting extremely chaotic, especially with regards to cold regions of a single table getting bundled up since they're being overlooked. With this method, you're never too far from safely hitting the reset button.

          ...

          Regarding your comment about moving the top or bottom child off the parent server after a split, I tend to prefer moving the bottom one. With time series data it will keep writing to the bottom child, so if you don't move the bottom child then a single server will end up doing the appending forever. I prefer to rotate the server that's doing the work even though it's not quite as efficient and may cause a longer split pause.... makes for a more balanced cluster.

          Show
          Matt Corgan added a comment - I can't really post my client code since it's intertwined with a bunch of other stuff, but I extracted the important parts into a junit test that i attached to this issue. We run java (tomcat) so it's fairly easy to talk directly to hbase and integrate a few features into our admin console. Printing friendly record names rather than escaped bytes, triggering backups, moving regions, etc... Don't think it requires knowing the keyspace ahead of time, just that you hash into a known output range, a 63 bit long in my example. I think the consistent hashing scheme may be a good out-of-the-box methodology. Even with something smarter, I'd worry about the underlying algorithms getting off course and starting a death spiral as bad outputs are fed back in creating even worse outputs. Something like consistent hashing could be a good beacon to always be steering towards so things don't get too far off course. I have about 20 tables with many different access patterns and I can't envision an algorithm that balances them truly well. Everything could be going fine until I kick off a MR job that randomly digs up 100 very cold regions and find that they're all on the same server. I'm thinking of a system where each region is either at home (its consistent hash destination) or visiting another server because the balancer decided its home was too hot. Each regionserver could identify it's hotter regions, and the balancer could move these around in an effort to smooth out the load. In the mean time, colder regions would stay well distributed based on how good the hashing mechanism is. If a regionserver cools down, the master brings home it's vacationing regions first, and if it's still cool, then it borrows someone else's hotter home regions. Without an underlying scheme, I can envision things getting extremely chaotic, especially with regards to cold regions of a single table getting bundled up since they're being overlooked. With this method, you're never too far from safely hitting the reset button. ... Regarding your comment about moving the top or bottom child off the parent server after a split, I tend to prefer moving the bottom one. With time series data it will keep writing to the bottom child, so if you don't move the bottom child then a single server will end up doing the appending forever. I prefer to rotate the server that's doing the work even though it's not quite as efficient and may cause a longer split pause.... makes for a more balanced cluster.
          Hide
          Matt Corgan added a comment -

          Sample consistent hashing balancing

          Show
          Matt Corgan added a comment - Sample consistent hashing balancing
          Hide
          Matt Corgan added a comment -

          removed dependency

          Show
          Matt Corgan added a comment - removed dependency
          Hide
          Ted Yu added a comment -

          We should sort regionsToMove by the creation time of regions. The rationale is that new regions tend to be the hot ones and should be round-robin assigned to underloaded servers.

          Show
          Ted Yu added a comment - We should sort regionsToMove by the creation time of regions. The rationale is that new regions tend to be the hot ones and should be round-robin assigned to underloaded servers.
          Hide
          Ted Yu added a comment -

          @Matt:
          The following code can be improved through randomization in case of empty tail to avoid clustering at consistentHashRing.firstKey()

          regionHash = tail.isEmpty() ? consistentHashRing.firstKey() : tail.firstKey();
          
          Show
          Ted Yu added a comment - @Matt: The following code can be improved through randomization in case of empty tail to avoid clustering at consistentHashRing.firstKey() regionHash = tail.isEmpty() ? consistentHashRing.firstKey() : tail.firstKey();
          Hide
          Ted Yu added a comment -

          The loop in balanceCluster(Map<HServerInfo, List<HRegionInfo>>) which fills out destination servers for regionsToMove currently iterates underloadedServers in the same direction (from index 0 up).
          This works for load metric being number of regions.
          If we use number of requests as load metric, we should iterate from index 0 up, then back down to index 0, so forth. This would give us better load distribution.

          Show
          Ted Yu added a comment - The loop in balanceCluster(Map<HServerInfo, List<HRegionInfo>>) which fills out destination servers for regionsToMove currently iterates underloadedServers in the same direction (from index 0 up). This works for load metric being number of regions. If we use number of requests as load metric, we should iterate from index 0 up, then back down to index 0, so forth. This would give us better load distribution.
          Hide
          Ted Yu added a comment -

          In preparation for HBASE-3616, I think we need to enhance this method in SplitTransaction:

            HRegion createDaughterRegion(final HRegionInfo hri,
                final FlushRequester flusher)
          

          requestsCount is initially 0 for daughter regions which doesn't reflect the fact that parent region has been heavily accessed.
          Maybe we can assign half of requestsCount of parent region to each daughter region ?

          Show
          Ted Yu added a comment - In preparation for HBASE-3616 , I think we need to enhance this method in SplitTransaction: HRegion createDaughterRegion( final HRegionInfo hri, final FlushRequester flusher) requestsCount is initially 0 for daughter regions which doesn't reflect the fact that parent region has been heavily accessed. Maybe we can assign half of requestsCount of parent region to each daughter region ?
          Hide
          gaojinchao added a comment -

          In hbase version 0.20.6, If contiguous regions, do not assign adjacent
          regions in same region server. So it can break daughters of splits in same
          region server and avoid hot spot. The performance can improve.

          In version 0.90.1, daughter is opened in region server that his parent is opened.
          In the case A region server has thousands of regions. the contiguous region is difficult to
          Choose by random. So the region server always is hot spot.

          Should the balance method be choose the contiguous region and then random or
          other way avoid hot spot? (eg: add configue parameter choose balance method base on applications ?)

          Show
          gaojinchao added a comment - In hbase version 0.20.6, If contiguous regions, do not assign adjacent regions in same region server. So it can break daughters of splits in same region server and avoid hot spot. The performance can improve. In version 0.90.1, daughter is opened in region server that his parent is opened. In the case A region server has thousands of regions. the contiguous region is difficult to Choose by random. So the region server always is hot spot. Should the balance method be choose the contiguous region and then random or other way avoid hot spot? (eg: add configue parameter choose balance method base on applications ?)
          Hide
          zhoushuaifeng added a comment -

          Agree with Gao's comments. When the region are splitting, it usually gets more write operations. It's better to assign the daughters to different regionservers to avoid hot spot.

          Show
          zhoushuaifeng added a comment - Agree with Gao's comments. When the region are splitting, it usually gets more write operations. It's better to assign the daughters to different regionservers to avoid hot spot.
          Hide
          Ted Yu added a comment -

          Suggestion from Stan Barton:
          This JIRA can be generalized as a new policy for load balancer. That is, to have balanced number of
          regions per RS per table and not in total number of regions from all tables.

          Show
          Ted Yu added a comment - Suggestion from Stan Barton: This JIRA can be generalized as a new policy for load balancer. That is, to have balanced number of regions per RS per table and not in total number of regions from all tables.
          Hide
          stack added a comment -

          The need for this issue keeps coming up. I'm not sure if it the requesters are post 0.90.2 (and HBASE-3586). I'd think the latter should have made our story better but maybe not (We should get folks to note if they are complaining post 0.90.2 or not).

          Show
          stack added a comment - The need for this issue keeps coming up. I'm not sure if it the requesters are post 0.90.2 (and HBASE-3586 ). I'd think the latter should have made our story better but maybe not (We should get folks to note if they are complaining post 0.90.2 or not).
          Hide
          stack added a comment -

          Moving out of 0.92.0. Pull it back in if you think different.

          Show
          stack added a comment - Moving out of 0.92.0. Pull it back in if you think different.
          Hide
          Ben West added a comment -

          We're running 0.94 and ran into this. With 4 region servers, we had one table with ~1800 regions, evenly balanced. We then used importtsv to import ~300 regions of a new table. We ended up with virtually all regions on one server; when I look at the master's log it looks like there were 159 rebalances (which makes sense); 123 were moving regions from the old table, and 26 moved new table regions. The result is that about 90% of the regions of the new table are on one server.

          When I look at DefaultLoadBalancer.balanceCluster, it has:

                  // fetch in alternate order if there is new region server
                  if (emptyRegionServerPresent) {
                    fetchFromTail = !fetchFromTail;
                  }
          

          so we're only doing the randomization stuff in HBASE-3609 if there's a new region server? Is there a reason we don't do this all the time?

          Show
          Ben West added a comment - We're running 0.94 and ran into this. With 4 region servers, we had one table with ~1800 regions, evenly balanced. We then used importtsv to import ~300 regions of a new table. We ended up with virtually all regions on one server; when I look at the master's log it looks like there were 159 rebalances (which makes sense); 123 were moving regions from the old table, and 26 moved new table regions. The result is that about 90% of the regions of the new table are on one server. When I look at DefaultLoadBalancer.balanceCluster, it has: // fetch in alternate order if there is new region server if (emptyRegionServerPresent) { fetchFromTail = !fetchFromTail; } so we're only doing the randomization stuff in HBASE-3609 if there's a new region server? Is there a reason we don't do this all the time?
          Hide
          Ted Yu added a comment -

          @Ben:
          Thanks for trying out 0.94

          The code snippet above deals with region server which recently joined the cluster. Its goal is to avoid hot region server which receives above average load.
          This is part of the changes from HBASE-3609. The randomization is done on this line:

              Collections.shuffle(sns, RANDOM);
          

          where we schedule regions to region servers which are shuffled randomly.

          Your observation about unbalanced table(s) in the cluster is valid. This is due to master not passing per-table region distribution to balanceCluster().
          I have a patch which is in internal repository where master calls balanceCluster() for each table.
          Once we test it in production cluster, I should be able to contribute back.

          Show
          Ted Yu added a comment - @Ben: Thanks for trying out 0.94 The code snippet above deals with region server which recently joined the cluster. Its goal is to avoid hot region server which receives above average load. This is part of the changes from HBASE-3609 . The randomization is done on this line: Collections.shuffle(sns, RANDOM); where we schedule regions to region servers which are shuffled randomly. Your observation about unbalanced table(s) in the cluster is valid. This is due to master not passing per-table region distribution to balanceCluster(). I have a patch which is in internal repository where master calls balanceCluster() for each table. Once we test it in production cluster, I should be able to contribute back.
          Hide
          Ben West added a comment -

          @Ted: Thanks! I think I was looking at trunk instead of .94, I see that in .94 it should be random:

          List<HRegionInfo> regions = randomize(server.getValue());
          

          Your per-table balance will be very useful for this case. Look forward to seeing it!

          Show
          Ben West added a comment - @Ted: Thanks! I think I was looking at trunk instead of .94, I see that in .94 it should be random: List<HRegionInfo> regions = randomize(server.getValue()); Your per-table balance will be very useful for this case. Look forward to seeing it!
          Hide
          Jean-Daniel Cryans added a comment -

          Just to clear up some confusion, trunk is going to be 0.94 so what you're playing with is probably 0.90.4

          Show
          Jean-Daniel Cryans added a comment - Just to clear up some confusion, trunk is going to be 0.94 so what you're playing with is probably 0.90.4
          Hide
          Ted Yu added a comment -

          The following data structure is introduced:
          Map<String, Map<ServerName, List<HRegionInfo>>> tableRegionDistribution
          The String key is the name of table. The list holds the regions for the table on ServerName.

          In HMaster.balance(), before calling balancer.balanceCluster(assignments), we translate (regroup) Map<ServerName, List<HRegionInfo>> returned by assignmentManager.getAssignments() into Map<String, Map<ServerName, List<HRegionInfo>>>. Then we iterate over the tables and call balancer.balanceCluster(assignments) for each table.

          Additionally, new HBaseAdmin method can be added to filter tables which don't need to be balanced.

          The ultimate goal is to distribute table regions evenly

          Show
          Ted Yu added a comment - The following data structure is introduced: Map<String, Map<ServerName, List<HRegionInfo>>> tableRegionDistribution The String key is the name of table. The list holds the regions for the table on ServerName. In HMaster.balance(), before calling balancer.balanceCluster(assignments), we translate (regroup) Map<ServerName, List<HRegionInfo>> returned by assignmentManager.getAssignments() into Map<String, Map<ServerName, List<HRegionInfo>>>. Then we iterate over the tables and call balancer.balanceCluster(assignments) for each table. Additionally, new HBaseAdmin method can be added to filter tables which don't need to be balanced. The ultimate goal is to distribute table regions evenly
          Hide
          Lars Hofhansl added a comment -

          Looks good to me (as far as I understand the balancing code).
          Minor nits:

          • Should "ensemble" be "dummytable" (or something) and be defined as constant?
          • Why:
            if (!svrToRegions.containsKey(e.getKey())) {
              regions = new ArrayList<HRegionInfo>();
              svrToRegions.put(e.getKey(), regions);
            } else {
              regions = svrToRegions.get(e.getKey());
            }
            

            instead of:

            regions = svrToRegions.get(e.getKey());
            if (regions == null) {
              regions = new ArrayList<HRegionInfo>();
              svrToRegions.put(e.getKey(), regions);
            }
            

            serverName can't be null here, correct?

          Show
          Lars Hofhansl added a comment - Looks good to me (as far as I understand the balancing code). Minor nits: Should "ensemble" be "dummytable" (or something) and be defined as constant? Why: if (!svrToRegions.containsKey(e.getKey())) { regions = new ArrayList<HRegionInfo>(); svrToRegions.put(e.getKey(), regions); } else { regions = svrToRegions.get(e.getKey()); } instead of: regions = svrToRegions.get(e.getKey()); if (regions == null ) { regions = new ArrayList<HRegionInfo>(); svrToRegions.put(e.getKey(), regions); } serverName can't be null here, correct?
          Hide
          Ted Yu added a comment -

          Since the table name in the outer map is only used for grouping purposes, I think "ensemble" is a better name than "dummytable".
          The code snippets in second comment are equivalent. I may keep my code since this is patch 1 in a series of at least 3 patches.
          The other two are: block location affinity utilization and better load balancing heuristics for tables with presplit regions.

          Show
          Ted Yu added a comment - Since the table name in the outer map is only used for grouping purposes, I think "ensemble" is a better name than "dummytable". The code snippets in second comment are equivalent. I may keep my code since this is patch 1 in a series of at least 3 patches. The other two are: block location affinity utilization and better load balancing heuristics for tables with presplit regions.
          Hide
          Ted Yu added a comment -

          Patch which Hadoop QA can use.

          Show
          Ted Yu added a comment - Patch which Hadoop QA can use.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12509174/3373.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated -151 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 77 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
          org.apache.hadoop.hbase.mapred.TestTableMapReduce
          org.apache.hadoop.hbase.mapreduce.TestImportTsv

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/652//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/652//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/652//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12509174/3373.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 77 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestImportTsv Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/652//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/652//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/652//console This message is automatically generated.
          Hide
          Ted Yu added a comment -

          @Ben, @Jonathan Gray:
          What do you think of my patch ?

          Thanks

          Show
          Ted Yu added a comment - @Ben, @Jonathan Gray: What do you think of my patch ? Thanks
          Hide
          Lars Hofhansl added a comment -

          I forgot to +1 it above.

          Show
          Lars Hofhansl added a comment - I forgot to +1 it above.
          Hide
          Ted Yu added a comment -

          This issue was opened more than a year ago.
          If there is no objection, I will get it in TRUNK.

          Show
          Ted Yu added a comment - This issue was opened more than a year ago. If there is no objection, I will get it in TRUNK.
          Hide
          Ben West added a comment -

          @Zhihong: Thank you for looking into this issue.

          Just to verify (since I'm not too familiar with the code): with this change, the balancing will be done at the per-table level, right?

          So if I have one big table, this may not address the issue (since a randomly chosen region from the table is unlikely to be the newly split one) but with lots of small tables the issue will be addressed. Correct?

          Show
          Ben West added a comment - @Zhihong: Thank you for looking into this issue. Just to verify (since I'm not too familiar with the code): with this change, the balancing will be done at the per-table level, right? So if I have one big table, this may not address the issue (since a randomly chosen region from the table is unlikely to be the newly split one) but with lots of small tables the issue will be addressed. Correct?
          Hide
          Ted Yu added a comment -

          @Ben:
          Your observation is correct.
          If you have a big table which has newly split regions, load balancer in 0.92 (and TRUNK) is already able to pick newly split ones and balance them.
          If the big table has presplit regions only, I have another feature (to be open sourced) which utilizes region load information for balancing decisions.

          Show
          Ted Yu added a comment - @Ben: Your observation is correct. If you have a big table which has newly split regions, load balancer in 0.92 (and TRUNK) is already able to pick newly split ones and balance them. If the big table has presplit regions only, I have another feature (to be open sourced) which utilizes region load information for balancing decisions.
          Hide
          Ben West added a comment -

          OK, thanks Zhihong, I was confusing which JIRA this was.

          +1 from me.

          Show
          Ben West added a comment - OK, thanks Zhihong, I was confusing which JIRA this was. +1 from me.
          Hide
          Ted Yu added a comment -

          Integrated to TRUNK.

          Thanks for the review, Lars and Ben.

          More enhancements for load balancer to come.

          Show
          Ted Yu added a comment - Integrated to TRUNK. Thanks for the review, Lars and Ben. More enhancements for load balancer to come.
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2613 (See https://builds.apache.org/job/HBase-TRUNK/2613/)
          HBASE-3373 Allow regions to be load-balanced by table

          tedyu :
          Files :

          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/TestRegionRebalancing.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2613 (See https://builds.apache.org/job/HBase-TRUNK/2613/ ) HBASE-3373 Allow regions to be load-balanced by table tedyu : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/TestRegionRebalancing.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-security #65 (See https://builds.apache.org/job/HBase-TRUNK-security/65/)
          HBASE-3373 Allow regions to be load-balanced by table

          tedyu :
          Files :

          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          • /hbase/trunk/src/test/java/org/apache/hadoop/hbase/TestRegionRebalancing.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-security #65 (See https://builds.apache.org/job/HBase-TRUNK-security/65/ ) HBASE-3373 Allow regions to be load-balanced by table tedyu : Files : /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/trunk/src/test/java/org/apache/hadoop/hbase/TestRegionRebalancing.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #257 (See https://builds.apache.org/job/HBase-0.92/257/)
          HBASE-5231 Backport HBASE-3373 (per-table load balancing) to 0.92

          tedyu :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/DefaultLoadBalancer.java
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/TestRegionRebalancing.java
          Show
          Hudson added a comment - Integrated in HBase-0.92 #257 (See https://builds.apache.org/job/HBase-0.92/257/ ) HBASE-5231 Backport HBASE-3373 (per-table load balancing) to 0.92 tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/DefaultLoadBalancer.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/TestRegionRebalancing.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92-security #88 (See https://builds.apache.org/job/HBase-0.92-security/88/)
          HBASE-5231 Backport HBASE-3373 (per-table load balancing) to 0.92

          tedyu :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/DefaultLoadBalancer.java
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/TestRegionRebalancing.java
          Show
          Hudson added a comment - Integrated in HBase-0.92-security #88 (See https://builds.apache.org/job/HBase-0.92-security/88/ ) HBASE-5231 Backport HBASE-3373 (per-table load balancing) to 0.92 tedyu : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/DefaultLoadBalancer.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/TestRegionRebalancing.java

            People

            • Assignee:
              Ted Yu
              Reporter:
              Ted Yu
            • Votes:
              3 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development