HBase
  1. HBase
  2. HBASE-3679

Provide cost information associated with moving region(s)

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Implemented
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      In order for load balancer to make better decision, we need to establish cost model for moving region(s).
      One factor would be the number of active scanners on a particular region.
      This count is easy to maintain at HRegion level. instantiateInternalScanner() can increment the counter and RegionScanner.close() would decrement the count.

        Issue Links

          Activity

          Ted Yu created issue -
          Ted Yu made changes -
          Field Original Value New Value
          Description In order for load balancer to make better decision, we need to establish cost model for moving region(s).
          One factor would be the number of active scanners on a particular region.
          In order for load balancer to make better decision, we need to establish cost model for moving region(s).
          One factor would be the number of active scanners on a particular region.
          This count is easy to maintain at HRegion level. instantiateInternalScanner() can increment the counter and RegionScanner.close() would decrement the count.
          Hide
          Ted Yu added a comment -

          From Jonathan:
          One of the hardest parts of load balancing based on request count and other dynamic/transient measures is that you can get some pretty pathological conditions where you are always moving stuff around.

          To guard against it, I think we'll need to move to more of a cost-based algorithm that is taking not just the difference in request counts into account but also a baseline "cost" of moving a region. The cost difference in load between two unbalanced servers would have to outweigh the cost associated with moving a region. As you say, looking at the number of live operations to a given region could contribute to the cost of moving that region, but the best measure for that is probably just looking at request count (it's all requests that incur a cost, not just active scanners).

          Show
          Ted Yu added a comment - From Jonathan: One of the hardest parts of load balancing based on request count and other dynamic/transient measures is that you can get some pretty pathological conditions where you are always moving stuff around. To guard against it, I think we'll need to move to more of a cost-based algorithm that is taking not just the difference in request counts into account but also a baseline "cost" of moving a region. The cost difference in load between two unbalanced servers would have to outweigh the cost associated with moving a region. As you say, looking at the number of live operations to a given region could contribute to the cost of moving that region, but the best measure for that is probably just looking at request count (it's all requests that incur a cost, not just active scanners).
          Hide
          Ted Yu added a comment -

          From Ryan:
          it would make sense to avoid moving regions, so therefore the more
          recently a region was moved, the less likely we should move it.

          you could imagine a hypothetical perfect 'region move cost' function
          that might look like:

          F(r) = timeSinceMoved(r) + size(r) + loadAvg(r)

          The functions should probably be normalized to [0,1], so the range of
          F would be [0,3] with 3 == 'dont move' and 0 == 'move first'.

          The goal is to minimize all the F(r[i]) in the moves.

          Show
          Ted Yu added a comment - From Ryan: it would make sense to avoid moving regions, so therefore the more recently a region was moved, the less likely we should move it. you could imagine a hypothetical perfect 'region move cost' function that might look like: F(r) = timeSinceMoved(r) + size(r) + loadAvg(r) The functions should probably be normalized to [0,1] , so the range of F would be [0,3] with 3 == 'dont move' and 0 == 'move first'. The goal is to minimize all the F(r [i] ) in the moves.
          Hide
          dhruba borthakur added a comment -

          What would we do in the following case:

          We have two region servers A and B in the cluster, A is doing 20K ops/sec while B is doing 5K ops/second. A client is encountering 10 ms latency per call from A while it sees a 30 ms latency per call from B.

          Are we going to move load from A to B or vice-versa?

          Show
          dhruba borthakur added a comment - What would we do in the following case: We have two region servers A and B in the cluster, A is doing 20K ops/sec while B is doing 5K ops/second. A client is encountering 10 ms latency per call from A while it sees a 30 ms latency per call from B. Are we going to move load from A to B or vice-versa?
          Hide
          Ted Yu added a comment -

          What would be a good way to measure (aggregately) latency incurred on client side ?

          Show
          Ted Yu added a comment - What would be a good way to measure (aggregately) latency incurred on client side ?
          stack made changes -
          Link This issue is part of HBASE-3724 [ HBASE-3724 ]
          Hide
          Ted Yu added a comment -

          From gaojinchao@huawei.com:
          a short time hot spot can't move Region

          scan request number is larger than writer. do you take into it ?

          Show
          Ted Yu added a comment - From gaojinchao@huawei.com: a short time hot spot can't move Region scan request number is larger than writer. do you take into it ?
          Hide
          Ted Yu added a comment -

          From Stack:

          I think the next step would be adding more smarts to balancer so it
          could make calls on how loaded a regionserver was and so it had an
          idea of how long a region had been open on a particular regionserver.
          Anything that was open < 5 minutes or so would not be moved.
          Something like that. If a regionserver is taking lots of load, move a
          selection of regions to the least loaded.

          Stack's comment from HBASE-3799:
          A general comment on balancing (that probably fits better elsewhere than as a comment on this issue) is that we need 'smoothing' of region move.... Yesterday we brought a regionserver back online into a smallish cluster that was under load and the balance run unloaded a bunch of regions all in the one go which put a dent in the throughput; it'd be sweet if the balancer ran at an appropriate 'rate'. When under load, it should move regions 'gently' rather than all as a big bang (the decommission script will move a region at a time, verifying it deployed in its new location before moving another... this can take ages to complete but its proven minimally disruptive to loadings)

          Show
          Ted Yu added a comment - From Stack: I think the next step would be adding more smarts to balancer so it could make calls on how loaded a regionserver was and so it had an idea of how long a region had been open on a particular regionserver. Anything that was open < 5 minutes or so would not be moved. Something like that. If a regionserver is taking lots of load, move a selection of regions to the least loaded. Stack's comment from HBASE-3799 : A general comment on balancing (that probably fits better elsewhere than as a comment on this issue) is that we need 'smoothing' of region move.... Yesterday we brought a regionserver back online into a smallish cluster that was under load and the balance run unloaded a bunch of regions all in the one go which put a dent in the throughput; it'd be sweet if the balancer ran at an appropriate 'rate'. When under load, it should move regions 'gently' rather than all as a big bang (the decommission script will move a region at a time, verifying it deployed in its new location before moving another... this can take ages to complete but its proven minimally disruptive to loadings)
          Ted Yu made changes -
          Link This issue is blocked by HBASE-3811 [ HBASE-3811 ]
          Hide
          Ted Yu added a comment -

          From Anty Rao:
          When doing balance, Can we take into account the compaction status of regions.
          Currently, even the region is doing compaction, it can also be interrupted
          to response to reassign.

          Show
          Ted Yu added a comment - From Anty Rao: When doing balance, Can we take into account the compaction status of regions. Currently, even the region is doing compaction, it can also be interrupted to response to reassign.
          Ted Yu made changes -
          Link This issue relates to HBASE-3943 [ HBASE-3943 ]
          Hide
          stack added a comment -

          See StochasticBalancer. Resolving as implemented.

          Show
          stack added a comment - See StochasticBalancer. Resolving as implemented.
          stack made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Implemented [ 10 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          1041d 23h 32m 1 stack 26/Jan/14 22:45

            People

            • Assignee:
              Unassigned
              Reporter:
              Ted Yu
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development