Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.92.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The current blocking we do when we are close to some limits (memstores over the multiplier factor, too many store files, global memstore memory) is bad, too coarse and confusing. After hitting HBASE-5161, it really becomes obvious that we need something better.

      I did a little brainstorm with Stack, we came up quickly with two solutions:

      • Send some exception to the client, like OverloadedException, that's thrown when some situation happens like getting past the low memory barrier. It would be thrown when the client gets a handler and does some check while putting or deleting. The client would treat this a retryable exception but ideally wouldn't check .META. for a new location. It could be fancy and have multiple levels of pushback, like send the exception to 25% of the clients, and then go up if the situation persists. Should be "easy" to implement but we'll be using a lot more IO to send the payload over and over again (but at least it wouldn't sit in the RS's memory).
      • Send a message alongside a successful put or delete to tell the client to slow down a little, this way we don't have to do back and forth with the payload between the client and the server. It's a cleaner (I think) but more involved solution.

      In every case the RS should do very obvious things to notify the operators of this situation, through logs, web UI, metrics, etc.

      Other ideas?

        Issue Links

          Activity

          Jean-Daniel Cryans created issue -
          Hide
          Lars Hofhansl added a comment -

          The exception part is nice, because it works without touching RPC.

          I wonder if that should be a phased approach?
          I.e. if there is light pressure, just delay the operations in the server threads (uses up server threads and memory, but will be transparent to the client, who will just see a slightly less performant server).
          When the pressure gets higher despite the server delays, we'd throw exceptions back to the client, and if that is not enough we block as we do now.

          The interesting part is: How exactly would we measure the pressure? Based on the number of a storefiles and the current rate of updates?

          Show
          Lars Hofhansl added a comment - The exception part is nice, because it works without touching RPC. I wonder if that should be a phased approach? I.e. if there is light pressure, just delay the operations in the server threads (uses up server threads and memory, but will be transparent to the client, who will just see a slightly less performant server). When the pressure gets higher despite the server delays, we'd throw exceptions back to the client, and if that is not enough we block as we do now. The interesting part is: How exactly would we measure the pressure? Based on the number of a storefiles and the current rate of updates?
          Hide
          Jean-Daniel Cryans added a comment -

          Yeah to me the most interesting/challenging part is figuring out what's considered a "heavy load" plus all the nuances for the phases.

          Right now we have a pattern of "soft" and "hard" limits:

          • memstore multiplier, they can get by default up to 2x their configured size by default before blocking.
          • compaction threshold and blocking values, the former is when we should kick a compaction and the latter is when we are in a "too many store files" mode.
          • global memstore barriers, the lower one starts force flushing the bigger regions and the upper one blocks inserts.

          I may be missing some others.

          Do we want to keep them or piggy back? In all three cases we don't tell the client anything until it blocks and instead we try to just take action ASAP. One way could be to send resets once we're past the soft limit... but I'm not sure if it's really going to help.

          Show
          Jean-Daniel Cryans added a comment - Yeah to me the most interesting/challenging part is figuring out what's considered a "heavy load" plus all the nuances for the phases. Right now we have a pattern of "soft" and "hard" limits: memstore multiplier, they can get by default up to 2x their configured size by default before blocking. compaction threshold and blocking values, the former is when we should kick a compaction and the latter is when we are in a "too many store files" mode. global memstore barriers, the lower one starts force flushing the bigger regions and the upper one blocks inserts. I may be missing some others. Do we want to keep them or piggy back? In all three cases we don't tell the client anything until it blocks and instead we try to just take action ASAP. One way could be to send resets once we're past the soft limit... but I'm not sure if it's really going to help.
          Hide
          Jesse Yates added a comment -

          Worked up an initial implementation that doesn't actually break the RPC (I think, even if client or the server has monitoring turned on), but avoids doing the exception passing back to the client, which IMHO is the least clean way to handle this as exceptions are for exceptional cases, not the common case.

          Right now the implementation only covers Puts, but could definitely be extended to cover other writes; I wanted to get some feedback on general style, etc. before going for the full blown implementation (and all the general cleanup associated with a 'real' patch).

          As far as how it is put together...
          Client has a backoff policy to take into consideration what the server is saying.
          This policy can take into account the current, max, and growth size of the write buffer for figuring out how long to sleep. The server pressure is unwrapped from the server in the Result as a MonitoredResult and then updated on the client. The next put will then take into account that pressure when attempting a put. The problem here is that the server can't tell all clients that the pressure has gone down, but that trade-off is common given traditional collision/backoff situations.

          The client has also been given permission to grow to a multiplier of its writeBufferSize, similar to the memstoremultiplier, allowing it to buffer more writes. If a write is within the expansion range, we want to allow the client to accept more writes while waiting/backing-off, so we launch a flusher thread that after waiting the backoff time will flush the writes (singleton). This gives us back-off as well as some flexiblilty on the client as to how much we buffer. To disable the backoff behavior, its as simple as setting the multiplier to 1, so the expansion = max.

          Similarly, on the RS, we make the policy for monitoring pluggable and optional to enable the blocking for a given threshold. The policy here can take into account the current number of store files, the number where we will get blocking and the current % full of the memstore (total size and max size). This allows us to ensure that we only slow down based on a given policy (yet to be implemented) but do get protection for the RS from malicious/over-eager clients.

          This impl may be a little over the top, but it does cover both sides of the on-the-wire story. Only things left to do are to extend this to other writes as well as creating at least one (if not two or three) tunable policies for figuring out blocking.

          Thoughts?

          Show
          Jesse Yates added a comment - Worked up an initial implementation that doesn't actually break the RPC (I think, even if client or the server has monitoring turned on), but avoids doing the exception passing back to the client, which IMHO is the least clean way to handle this as exceptions are for exceptional cases, not the common case. Right now the implementation only covers Puts, but could definitely be extended to cover other writes; I wanted to get some feedback on general style, etc. before going for the full blown implementation (and all the general cleanup associated with a 'real' patch). As far as how it is put together... Client has a backoff policy to take into consideration what the server is saying. This policy can take into account the current, max, and growth size of the write buffer for figuring out how long to sleep. The server pressure is unwrapped from the server in the Result as a MonitoredResult and then updated on the client. The next put will then take into account that pressure when attempting a put. The problem here is that the server can't tell all clients that the pressure has gone down, but that trade-off is common given traditional collision/backoff situations. The client has also been given permission to grow to a multiplier of its writeBufferSize, similar to the memstoremultiplier, allowing it to buffer more writes. If a write is within the expansion range, we want to allow the client to accept more writes while waiting/backing-off, so we launch a flusher thread that after waiting the backoff time will flush the writes (singleton). This gives us back-off as well as some flexiblilty on the client as to how much we buffer. To disable the backoff behavior, its as simple as setting the multiplier to 1, so the expansion = max. Similarly, on the RS, we make the policy for monitoring pluggable and optional to enable the blocking for a given threshold. The policy here can take into account the current number of store files, the number where we will get blocking and the current % full of the memstore (total size and max size). This allows us to ensure that we only slow down based on a given policy (yet to be implemented) but do get protection for the RS from malicious/over-eager clients. This impl may be a little over the top, but it does cover both sides of the on-the-wire story. Only things left to do are to extend this to other writes as well as creating at least one (if not two or three) tunable policies for figuring out blocking. Thoughts?
          Jesse Yates made changes -
          Field Original Value New Value
          Attachment java_HBASE-5162.patch [ 12514851 ]
          Hide
          Lars Hofhansl added a comment -

          Might be good to publish on review board, that patch is a bit big to read as diff

          Show
          Lars Hofhansl added a comment - Might be good to publish on review board, that patch is a bit big to read as diff
          Hide
          Jesse Yates added a comment -

          @lars - can do. I didn't want it up there initially as that tends to find a lot of nitpicks, which this patch I'm sure has lots of

          Show
          Jesse Yates added a comment - @lars - can do. I didn't want it up there initially as that tends to find a lot of nitpicks, which this patch I'm sure has lots of
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3930/
          -----------------------------------------------------------

          Review request for hbase, Michael Stack, Jean-Daniel Cryans, and Lars Hofhansl.

          Summary
          -------

          Under heavy write load, HBase will create a saw-tooth pattern in accepting writes. This is due to the I/O in minor compactions not being able to keep up with the write load. Specifically, the memstore is attempting to flush while we are attempting to do a minor compaction, leading to blocking all writes. Instead, we need to have the option of graceful degradation mechanism.

          This patch supports both a short-term,adjustable server-side write blocking as well as client-side back-off to help alleviate temporary memstore pressure.

          This addresses bug HBASE-5162.
          https://issues.apache.org/jira/browse/HBASE-5162

          Diffs


          src/main/java/org/apache/hadoop/hbase/HConstants.java 763fe89
          src/main/java/org/apache/hadoop/hbase/client/BackoffPolicy.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/client/HTable.java 57605e6
          src/main/java/org/apache/hadoop/hbase/client/MonitoredResult.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/client/NoServerBackoffPolicy.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 25cb31d
          src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 7d7be3c
          src/main/java/org/apache/hadoop/hbase/regionserver/MemstorePressureMonitor.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/regionserver/OperationStatus.java 1b94ab5
          src/main/java/org/apache/hadoop/hbase/regionserver/PressureMonitor.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/regionserver/SimpleMemStorePressureMonitor.java PRE-CREATION
          src/main/java/org/apache/hadoop/hbase/regionserver/StoreUtils.java PRE-CREATION
          src/test/java/org/apache/hadoop/hbase/client/TestClientBackOff.java PRE-CREATION
          src/test/java/org/apache/hadoop/hbase/regionserver/TestMemStorePressureMonitor.java PRE-CREATION

          Diff: https://reviews.apache.org/r/3930/diff

          Testing
          -------

          Thanks,

          Jesse

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3930/ ----------------------------------------------------------- Review request for hbase, Michael Stack, Jean-Daniel Cryans, and Lars Hofhansl. Summary ------- Under heavy write load, HBase will create a saw-tooth pattern in accepting writes. This is due to the I/O in minor compactions not being able to keep up with the write load. Specifically, the memstore is attempting to flush while we are attempting to do a minor compaction, leading to blocking all writes. Instead, we need to have the option of graceful degradation mechanism. This patch supports both a short-term,adjustable server-side write blocking as well as client-side back-off to help alleviate temporary memstore pressure. This addresses bug HBASE-5162 . https://issues.apache.org/jira/browse/HBASE-5162 Diffs src/main/java/org/apache/hadoop/hbase/HConstants.java 763fe89 src/main/java/org/apache/hadoop/hbase/client/BackoffPolicy.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/client/HTable.java 57605e6 src/main/java/org/apache/hadoop/hbase/client/MonitoredResult.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/client/NoServerBackoffPolicy.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 25cb31d src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 7d7be3c src/main/java/org/apache/hadoop/hbase/regionserver/MemstorePressureMonitor.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/OperationStatus.java 1b94ab5 src/main/java/org/apache/hadoop/hbase/regionserver/PressureMonitor.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/SimpleMemStorePressureMonitor.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/StoreUtils.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/client/TestClientBackOff.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/regionserver/TestMemStorePressureMonitor.java PRE-CREATION Diff: https://reviews.apache.org/r/3930/diff Testing ------- Thanks, Jesse
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3930/#review5209
          -----------------------------------------------------------

          src/main/java/org/apache/hadoop/hbase/client/HTable.java
          <https://reviews.apache.org/r/3930/#comment11397>

          I find this a bit dubious.
          This won't actually slow the client thread down, but just accumulate more data and reduce the number of RPCs. In the end it might lead to more load on the server, because we can deliver more puts as with fewer but larger batches.

          I'd rather just rely on the server sleeping the thread for a bit (as you do later).

          src/main/java/org/apache/hadoop/hbase/client/HTable.java
          <https://reviews.apache.org/r/3930/#comment11399>

          What if the flusher is not null? Should we re-calculate the wait time?

          src/main/java/org/apache/hadoop/hbase/client/HTable.java
          <https://reviews.apache.org/r/3930/#comment11400>

          If sleepTime is 0 (for example from NoServerBackoffPolicy), we should probably not create the thread and flush right here.

          (But as I said in the comment above, I'd probably not bother with this extra thread to begin with )

          src/main/java/org/apache/hadoop/hbase/client/HTable.java
          <https://reviews.apache.org/r/3930/#comment11401>

          Was this a problem before? Or only now becaue of the background thread?

          src/main/java/org/apache/hadoop/hbase/client/HTable.java
          <https://reviews.apache.org/r/3930/#comment11398>

          This will break backwards compatibility, right? (Not saying that's not ok, just calling it out)

          I'd almost rather have the client not know about this, until we reach a bad spot (in which case we can throw back retryable exceptions).

          src/main/java/org/apache/hadoop/hbase/regionserver/MemstorePressureMonitor.java
          <https://reviews.apache.org/r/3930/#comment11402>

          Ah ok, this is where we gracefully delay the server thread a bit.
          Seems this would need to be tweaked carefully to make it effective while not slowing normal operations.

          Should the serverPauseTime be somehow related to the amount of pressure.
          I.e. wait a bit more if the pressure is higher?
          Maybe the pausetime calculation should be part of the pluggable policy?

          Also in the jira there was some discussion about throwing (presumably retryable) exceptions back to the client is the pressure gets too high. That would slow the client, without consuming server resources (beyond multiple requests).

          src/main/java/org/apache/hadoop/hbase/regionserver/StoreUtils.java
          <https://reviews.apache.org/r/3930/#comment11403>

          General comment: Where are we on putting/documenting these things in hbase-defaults.xml?

          • Lars

          On 2012-02-16 20:45:50, Jesse Yates wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3930/

          -----------------------------------------------------------

          (Updated 2012-02-16 20:45:50)

          Review request for hbase, Michael Stack, Jean-Daniel Cryans, and Lars Hofhansl.

          Summary

          -------

          Under heavy write load, HBase will create a saw-tooth pattern in accepting writes. This is due to the I/O in minor compactions not being able to keep up with the write load. Specifically, the memstore is attempting to flush while we are attempting to do a minor compaction, leading to blocking all writes. Instead, we need to have the option of graceful degradation mechanism.

          This patch supports both a short-term,adjustable server-side write blocking as well as client-side back-off to help alleviate temporary memstore pressure.

          This addresses bug HBASE-5162.

          https://issues.apache.org/jira/browse/HBASE-5162

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/HConstants.java 763fe89

          src/main/java/org/apache/hadoop/hbase/client/BackoffPolicy.java PRE-CREATION

          src/main/java/org/apache/hadoop/hbase/client/HTable.java 57605e6

          src/main/java/org/apache/hadoop/hbase/client/MonitoredResult.java PRE-CREATION

          src/main/java/org/apache/hadoop/hbase/client/NoServerBackoffPolicy.java PRE-CREATION

          src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 25cb31d

          src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 7d7be3c

          src/main/java/org/apache/hadoop/hbase/regionserver/MemstorePressureMonitor.java PRE-CREATION

          src/main/java/org/apache/hadoop/hbase/regionserver/OperationStatus.java 1b94ab5

          src/main/java/org/apache/hadoop/hbase/regionserver/PressureMonitor.java PRE-CREATION

          src/main/java/org/apache/hadoop/hbase/regionserver/SimpleMemStorePressureMonitor.java PRE-CREATION

          src/main/java/org/apache/hadoop/hbase/regionserver/StoreUtils.java PRE-CREATION

          src/test/java/org/apache/hadoop/hbase/client/TestClientBackOff.java PRE-CREATION

          src/test/java/org/apache/hadoop/hbase/regionserver/TestMemStorePressureMonitor.java PRE-CREATION

          Diff: https://reviews.apache.org/r/3930/diff

          Testing

          -------

          Thanks,

          Jesse

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3930/#review5209 ----------------------------------------------------------- src/main/java/org/apache/hadoop/hbase/client/HTable.java < https://reviews.apache.org/r/3930/#comment11397 > I find this a bit dubious. This won't actually slow the client thread down, but just accumulate more data and reduce the number of RPCs. In the end it might lead to more load on the server, because we can deliver more puts as with fewer but larger batches. I'd rather just rely on the server sleeping the thread for a bit (as you do later). src/main/java/org/apache/hadoop/hbase/client/HTable.java < https://reviews.apache.org/r/3930/#comment11399 > What if the flusher is not null? Should we re-calculate the wait time? src/main/java/org/apache/hadoop/hbase/client/HTable.java < https://reviews.apache.org/r/3930/#comment11400 > If sleepTime is 0 (for example from NoServerBackoffPolicy), we should probably not create the thread and flush right here. (But as I said in the comment above, I'd probably not bother with this extra thread to begin with ) src/main/java/org/apache/hadoop/hbase/client/HTable.java < https://reviews.apache.org/r/3930/#comment11401 > Was this a problem before? Or only now becaue of the background thread? src/main/java/org/apache/hadoop/hbase/client/HTable.java < https://reviews.apache.org/r/3930/#comment11398 > This will break backwards compatibility, right? (Not saying that's not ok, just calling it out) I'd almost rather have the client not know about this, until we reach a bad spot (in which case we can throw back retryable exceptions). src/main/java/org/apache/hadoop/hbase/regionserver/MemstorePressureMonitor.java < https://reviews.apache.org/r/3930/#comment11402 > Ah ok, this is where we gracefully delay the server thread a bit. Seems this would need to be tweaked carefully to make it effective while not slowing normal operations. Should the serverPauseTime be somehow related to the amount of pressure. I.e. wait a bit more if the pressure is higher? Maybe the pausetime calculation should be part of the pluggable policy? Also in the jira there was some discussion about throwing (presumably retryable) exceptions back to the client is the pressure gets too high. That would slow the client, without consuming server resources (beyond multiple requests). src/main/java/org/apache/hadoop/hbase/regionserver/StoreUtils.java < https://reviews.apache.org/r/3930/#comment11403 > General comment: Where are we on putting/documenting these things in hbase-defaults.xml? Lars On 2012-02-16 20:45:50, Jesse Yates wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3930/ ----------------------------------------------------------- (Updated 2012-02-16 20:45:50) Review request for hbase, Michael Stack, Jean-Daniel Cryans, and Lars Hofhansl. Summary ------- Under heavy write load, HBase will create a saw-tooth pattern in accepting writes. This is due to the I/O in minor compactions not being able to keep up with the write load. Specifically, the memstore is attempting to flush while we are attempting to do a minor compaction, leading to blocking all writes. Instead, we need to have the option of graceful degradation mechanism. This patch supports both a short-term,adjustable server-side write blocking as well as client-side back-off to help alleviate temporary memstore pressure. This addresses bug HBASE-5162 . https://issues.apache.org/jira/browse/HBASE-5162 Diffs ----- src/main/java/org/apache/hadoop/hbase/HConstants.java 763fe89 src/main/java/org/apache/hadoop/hbase/client/BackoffPolicy.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/client/HTable.java 57605e6 src/main/java/org/apache/hadoop/hbase/client/MonitoredResult.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/client/NoServerBackoffPolicy.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 25cb31d src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 7d7be3c src/main/java/org/apache/hadoop/hbase/regionserver/MemstorePressureMonitor.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/OperationStatus.java 1b94ab5 src/main/java/org/apache/hadoop/hbase/regionserver/PressureMonitor.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/SimpleMemStorePressureMonitor.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/StoreUtils.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/client/TestClientBackOff.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/regionserver/TestMemStorePressureMonitor.java PRE-CREATION Diff: https://reviews.apache.org/r/3930/diff Testing ------- Thanks, Jesse
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2012-02-18 04:58:21, Lars Hofhansl wrote:

          > src/main/java/org/apache/hadoop/hbase/regionserver/StoreUtils.java, line 7

          > <https://reviews.apache.org/r/3930/diff/1/?file=75496#file75496line7>

          >

          > General comment: Where are we on putting/documenting these things in hbase-defaults.xml?

          That is already in there, but having the constants just reinforces the 'correct' behavior. Though I think there are some situations where the code is different from defaults.xml

          On 2012-02-18 04:58:21, Lars Hofhansl wrote:

          > src/main/java/org/apache/hadoop/hbase/regionserver/MemstorePressureMonitor.java, line 124

          > <https://reviews.apache.org/r/3930/diff/1/?file=75492#file75492line124>

          >

          > Ah ok, this is where we gracefully delay the server thread a bit.

          > Seems this would need to be tweaked carefully to make it effective while not slowing normal operations.

          >

          > Should the serverPauseTime be somehow related to the amount of pressure.

          > I.e. wait a bit more if the pressure is higher?

          > Maybe the pausetime calculation should be part of the pluggable policy?

          >

          > Also in the jira there was some discussion about throwing (presumably retryable) exceptions back to the client is the pressure gets too high. That would slow the client, without consuming server resources (beyond multiple requests).

          Definitely needs some knobs with the policy here and tuning.

          It would be possible to tie the pause time to the amount of pressure (some sort of multiplier effect). Would have to think about the implications of that

          Throwing the exception back the client here means you get a lot of bandwidth overhead as the client passes back a bunch of puts again, which is rough. Otherwise, we are definitely going to be messing with wire-compatibility in a functional (if not strictly RPC) sense.

          On 2012-02-18 04:58:21, Lars Hofhansl wrote:

          > src/main/java/org/apache/hadoop/hbase/client/HTable.java, line 979

          > <https://reviews.apache.org/r/3930/diff/1/?file=75487#file75487line979>

          >

          > This will break backwards compatibility, right? (Not saying that's not ok, just calling it out)

          >

          > I'd almost rather have the client not know about this, until we reach a bad spot (in which case we can throw back retryable exceptions).

          I don't think so. If the server isn't upgraded to the send monitoredResults, then this bit of code won't matter. If the client isn't upgraded, then it won't even consider if the MonitoredResult isn't just a Result.

          On 2012-02-18 04:58:21, Lars Hofhansl wrote:

          > src/main/java/org/apache/hadoop/hbase/client/HTable.java, line 958

          > <https://reviews.apache.org/r/3930/diff/1/?file=75487#file75487line958>

          >

          > Was this a problem before? Or only now becaue of the background thread?

          Just b/c of the background thread.

          On 2012-02-18 04:58:21, Lars Hofhansl wrote:

          > src/main/java/org/apache/hadoop/hbase/client/HTable.java, line 792

          > <https://reviews.apache.org/r/3930/diff/1/?file=75487#file75487line792>

          >

          > If sleepTime is 0 (for example from NoServerBackoffPolicy), we should probably not create the thread and flush right here.

          >

          > (But as I said in the comment above, I'd probably not bother with this extra thread to begin with )

          +1 fixing in next version, assuming the client stuff is kept

          On 2012-02-18 04:58:21, Lars Hofhansl wrote:

          > src/main/java/org/apache/hadoop/hbase/client/HTable.java, line 791

          > <https://reviews.apache.org/r/3930/diff/1/?file=75487#file75487line791>

          >

          > What if the flusher is not null? Should we re-calculate the wait time?

          We are temporarily growing the write buffer, so if we get another write, I was thinking we would just let it accumulate too - it will get flushed soon enough. Only corner case here is if the size grows beyond the allowed, in which case there will be a forced flush.

          On 2012-02-18 04:58:21, Lars Hofhansl wrote:

          > src/main/java/org/apache/hadoop/hbase/client/HTable.java, line 790

          > <https://reviews.apache.org/r/3930/diff/1/?file=75487#file75487line790>

          >

          > I find this a bit dubious.

          > This won't actually slow the client thread down, but just accumulate more data and reduce the number of RPCs. In the end it might lead to more load on the server, because we can deliver more puts as with fewer but larger batches.

          >

          > I'd rather just rely on the server sleeping the thread for a bit (as you do later).

          The question here is what we consider the 'client'. To me, the client was the client-side HTable, not the using application. So yes, this will not slow the using application, but that was the intention...keeping HBase seeming as responsive, but just letting the messaging layer handle the queuing and buffering.

          Technically, this will actually slow the rate of writes to the server, which is what we really are worrying about. Essentially, this implementation is such that we temporarily increase batch size so the server has time to finish its compactions and flushing.

          The server can handle the larger batch sizes (except for extreme cases) since the assumption has to be clientMemory << serverMemory.

          This is just experimental I'll need to do some stress testing to actually see the effects

          • Jesse

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/3930/#review5209
          -----------------------------------------------------------

          On 2012-02-16 20:45:50, Jesse Yates wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/3930/

          -----------------------------------------------------------

          (Updated 2012-02-16 20:45:50)

          Review request for hbase, Michael Stack, Jean-Daniel Cryans, and Lars Hofhansl.

          Summary

          -------

          Under heavy write load, HBase will create a saw-tooth pattern in accepting writes. This is due to the I/O in minor compactions not being able to keep up with the write load. Specifically, the memstore is attempting to flush while we are attempting to do a minor compaction, leading to blocking all writes. Instead, we need to have the option of graceful degradation mechanism.

          This patch supports both a short-term,adjustable server-side write blocking as well as client-side back-off to help alleviate temporary memstore pressure.

          This addresses bug HBASE-5162.

          https://issues.apache.org/jira/browse/HBASE-5162

          Diffs

          -----

          src/main/java/org/apache/hadoop/hbase/HConstants.java 763fe89

          src/main/java/org/apache/hadoop/hbase/client/BackoffPolicy.java PRE-CREATION

          src/main/java/org/apache/hadoop/hbase/client/HTable.java 57605e6

          src/main/java/org/apache/hadoop/hbase/client/MonitoredResult.java PRE-CREATION

          src/main/java/org/apache/hadoop/hbase/client/NoServerBackoffPolicy.java PRE-CREATION

          src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 25cb31d

          src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 7d7be3c

          src/main/java/org/apache/hadoop/hbase/regionserver/MemstorePressureMonitor.java PRE-CREATION

          src/main/java/org/apache/hadoop/hbase/regionserver/OperationStatus.java 1b94ab5

          src/main/java/org/apache/hadoop/hbase/regionserver/PressureMonitor.java PRE-CREATION

          src/main/java/org/apache/hadoop/hbase/regionserver/SimpleMemStorePressureMonitor.java PRE-CREATION

          src/main/java/org/apache/hadoop/hbase/regionserver/StoreUtils.java PRE-CREATION

          src/test/java/org/apache/hadoop/hbase/client/TestClientBackOff.java PRE-CREATION

          src/test/java/org/apache/hadoop/hbase/regionserver/TestMemStorePressureMonitor.java PRE-CREATION

          Diff: https://reviews.apache.org/r/3930/diff

          Testing

          -------

          Thanks,

          Jesse

          Show
          jiraposter@reviews.apache.org added a comment - On 2012-02-18 04:58:21, Lars Hofhansl wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/StoreUtils.java, line 7 > < https://reviews.apache.org/r/3930/diff/1/?file=75496#file75496line7 > > > General comment: Where are we on putting/documenting these things in hbase-defaults.xml? That is already in there, but having the constants just reinforces the 'correct' behavior. Though I think there are some situations where the code is different from defaults.xml On 2012-02-18 04:58:21, Lars Hofhansl wrote: > src/main/java/org/apache/hadoop/hbase/regionserver/MemstorePressureMonitor.java, line 124 > < https://reviews.apache.org/r/3930/diff/1/?file=75492#file75492line124 > > > Ah ok, this is where we gracefully delay the server thread a bit. > Seems this would need to be tweaked carefully to make it effective while not slowing normal operations. > > Should the serverPauseTime be somehow related to the amount of pressure. > I.e. wait a bit more if the pressure is higher? > Maybe the pausetime calculation should be part of the pluggable policy? > > Also in the jira there was some discussion about throwing (presumably retryable) exceptions back to the client is the pressure gets too high. That would slow the client, without consuming server resources (beyond multiple requests). Definitely needs some knobs with the policy here and tuning. It would be possible to tie the pause time to the amount of pressure (some sort of multiplier effect). Would have to think about the implications of that Throwing the exception back the client here means you get a lot of bandwidth overhead as the client passes back a bunch of puts again , which is rough. Otherwise, we are definitely going to be messing with wire-compatibility in a functional (if not strictly RPC) sense. On 2012-02-18 04:58:21, Lars Hofhansl wrote: > src/main/java/org/apache/hadoop/hbase/client/HTable.java, line 979 > < https://reviews.apache.org/r/3930/diff/1/?file=75487#file75487line979 > > > This will break backwards compatibility, right? (Not saying that's not ok, just calling it out) > > I'd almost rather have the client not know about this, until we reach a bad spot (in which case we can throw back retryable exceptions). I don't think so. If the server isn't upgraded to the send monitoredResults, then this bit of code won't matter. If the client isn't upgraded, then it won't even consider if the MonitoredResult isn't just a Result. On 2012-02-18 04:58:21, Lars Hofhansl wrote: > src/main/java/org/apache/hadoop/hbase/client/HTable.java, line 958 > < https://reviews.apache.org/r/3930/diff/1/?file=75487#file75487line958 > > > Was this a problem before? Or only now becaue of the background thread? Just b/c of the background thread. On 2012-02-18 04:58:21, Lars Hofhansl wrote: > src/main/java/org/apache/hadoop/hbase/client/HTable.java, line 792 > < https://reviews.apache.org/r/3930/diff/1/?file=75487#file75487line792 > > > If sleepTime is 0 (for example from NoServerBackoffPolicy), we should probably not create the thread and flush right here. > > (But as I said in the comment above, I'd probably not bother with this extra thread to begin with ) +1 fixing in next version, assuming the client stuff is kept On 2012-02-18 04:58:21, Lars Hofhansl wrote: > src/main/java/org/apache/hadoop/hbase/client/HTable.java, line 791 > < https://reviews.apache.org/r/3930/diff/1/?file=75487#file75487line791 > > > What if the flusher is not null? Should we re-calculate the wait time? We are temporarily growing the write buffer, so if we get another write, I was thinking we would just let it accumulate too - it will get flushed soon enough. Only corner case here is if the size grows beyond the allowed, in which case there will be a forced flush. On 2012-02-18 04:58:21, Lars Hofhansl wrote: > src/main/java/org/apache/hadoop/hbase/client/HTable.java, line 790 > < https://reviews.apache.org/r/3930/diff/1/?file=75487#file75487line790 > > > I find this a bit dubious. > This won't actually slow the client thread down, but just accumulate more data and reduce the number of RPCs. In the end it might lead to more load on the server, because we can deliver more puts as with fewer but larger batches. > > I'd rather just rely on the server sleeping the thread for a bit (as you do later). The question here is what we consider the 'client'. To me, the client was the client-side HTable, not the using application. So yes, this will not slow the using application, but that was the intention...keeping HBase seeming as responsive, but just letting the messaging layer handle the queuing and buffering. Technically, this will actually slow the rate of writes to the server , which is what we really are worrying about. Essentially, this implementation is such that we temporarily increase batch size so the server has time to finish its compactions and flushing. The server can handle the larger batch sizes (except for extreme cases) since the assumption has to be clientMemory << serverMemory. This is just experimental I'll need to do some stress testing to actually see the effects Jesse ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3930/#review5209 ----------------------------------------------------------- On 2012-02-16 20:45:50, Jesse Yates wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3930/ ----------------------------------------------------------- (Updated 2012-02-16 20:45:50) Review request for hbase, Michael Stack, Jean-Daniel Cryans, and Lars Hofhansl. Summary ------- Under heavy write load, HBase will create a saw-tooth pattern in accepting writes. This is due to the I/O in minor compactions not being able to keep up with the write load. Specifically, the memstore is attempting to flush while we are attempting to do a minor compaction, leading to blocking all writes. Instead, we need to have the option of graceful degradation mechanism. This patch supports both a short-term,adjustable server-side write blocking as well as client-side back-off to help alleviate temporary memstore pressure. This addresses bug HBASE-5162 . https://issues.apache.org/jira/browse/HBASE-5162 Diffs ----- src/main/java/org/apache/hadoop/hbase/HConstants.java 763fe89 src/main/java/org/apache/hadoop/hbase/client/BackoffPolicy.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/client/HTable.java 57605e6 src/main/java/org/apache/hadoop/hbase/client/MonitoredResult.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/client/NoServerBackoffPolicy.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 25cb31d src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 7d7be3c src/main/java/org/apache/hadoop/hbase/regionserver/MemstorePressureMonitor.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/OperationStatus.java 1b94ab5 src/main/java/org/apache/hadoop/hbase/regionserver/PressureMonitor.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/SimpleMemStorePressureMonitor.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/StoreUtils.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/client/TestClientBackOff.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/regionserver/TestMemStorePressureMonitor.java PRE-CREATION Diff: https://reviews.apache.org/r/3930/diff Testing ------- Thanks, Jesse
          Hide
          Lars Hofhansl added a comment -

          Let's aim for 0.94.1 with this.

          Show
          Lars Hofhansl added a comment - Let's aim for 0.94.1 with this.
          Lars Hofhansl made changes -
          Fix Version/s 0.94.1 [ 12320257 ]
          Fix Version/s 0.94.0 [ 12316419 ]
          Hide
          Lars Hofhansl added a comment -

          Is this ready? Moving to 0.96. Feel free to pull back.

          Show
          Lars Hofhansl added a comment - Is this ready? Moving to 0.96. Feel free to pull back.
          Lars Hofhansl made changes -
          Fix Version/s 0.96.0 [ 12320040 ]
          Fix Version/s 0.94.1 [ 12320257 ]
          Hide
          Jesse Yates added a comment -

          @Lars not yet - needs some more testing and validation on a real cluster. I've been a bit busy and haven't got a chance to swing back around.

          Show
          Jesse Yates added a comment - @Lars not yet - needs some more testing and validation on a real cluster. I've been a bit busy and haven't got a chance to swing back around.
          Todd Lipcon made changes -
          Link This issue is related to HBASE-6423 [ HBASE-6423 ]
          Ian Varley made changes -
          Link This issue relates to HBASE-5333 [ HBASE-5333 ]
          Ian Varley made changes -
          Link This issue is duplicated by HBASE-5333 [ HBASE-5333 ]
          Ian Varley made changes -
          Link This issue relates to HBASE-5333 [ HBASE-5333 ]
          Todd Lipcon made changes -
          Link This issue is related to HBASE-2981 [ HBASE-2981 ]
          Hide
          Jimmy Xiang added a comment -

          Isn't the exception way much cleaner and simpler? I think the exception way is greedy, and the hbase client code needs minimal change. Based on the retry count, it can adjust the delay time in the middle.

          For the load monitoring, we assume the load trend remains the same, which may not be that case actually. The client side has to track the regions whose memstore is under pressure. Every client needs to do the same tracking.

          Show
          Jimmy Xiang added a comment - Isn't the exception way much cleaner and simpler? I think the exception way is greedy, and the hbase client code needs minimal change. Based on the retry count, it can adjust the delay time in the middle. For the load monitoring, we assume the load trend remains the same, which may not be that case actually. The client side has to track the regions whose memstore is under pressure. Every client needs to do the same tracking.
          Hide
          Jesse Yates added a comment -

          Jimmy Xiang not really - the exception causes old clients to fail and uses exception type inheritance + instanceof checking to get the information across. The client code does need minimal change (see my attached patch), but it just feels dirty, especially when we have PBs for everythign. What would be far better is just pass that info back as an optional PB parameter and have the client monitor it or not. This both works the the older versions AND optionally turning it off on the client if your workload needs it. the workload doesn't stay the same - the client gets constant information (at least with each mutation across the wire) back from the region and adjusts as it's updated.

          So yeah, every client that has this enabled would have to track the information. But you could make it that you just track the most recent 'n' regions. And an advanced version would also do some load half-life mechanism on the client if it hasn't talked to that region in a while.

          Show
          Jesse Yates added a comment - Jimmy Xiang not really - the exception causes old clients to fail and uses exception type inheritance + instanceof checking to get the information across. The client code does need minimal change (see my attached patch), but it just feels dirty , especially when we have PBs for everythign. What would be far better is just pass that info back as an optional PB parameter and have the client monitor it or not. This both works the the older versions AND optionally turning it off on the client if your workload needs it. the workload doesn't stay the same - the client gets constant information (at least with each mutation across the wire) back from the region and adjusts as it's updated. So yeah, every client that has this enabled would have to track the information. But you could make it that you just track the most recent 'n' regions. And an advanced version would also do some load half-life mechanism on the client if it hasn't talked to that region in a while.
          Hide
          Lars Hofhansl added a comment -

          We could just throw a ServerTooBusyException (or something), which would extend IOException, so the client would just retry. However after a while then the client would fail, when it should just wait longer.
          No sure how much better that would than it is now, though. It would be better to gradually slow clients down, so that they still make progress but only as fast as the system can absorb it.

          Show
          Lars Hofhansl added a comment - We could just throw a ServerTooBusyException (or something), which would extend IOException, so the client would just retry. However after a while then the client would fail, when it should just wait longer. No sure how much better that would than it is now, though. It would be better to gradually slow clients down, so that they still make progress but only as fast as the system can absorb it.
          Hide
          Jesse Yates added a comment -

          The problem with throwing just another IOException with the default client is that the client thinks its writes failed, in which it retries, causing MORE load on the already overloaded server. You'd have to hack the client a bit, which means that still on 0.94 you get the retry behaviour

          Show
          Jesse Yates added a comment - The problem with throwing just another IOException with the default client is that the client thinks its writes failed, in which it retries, causing MORE load on the already overloaded server. You'd have to hack the client a bit, which means that still on 0.94 you get the retry behaviour
          Hide
          Jesse Yates added a comment -

          I tried to tackle this doing the PB way a little while ago (though it may have been in conjunction with some other ticket that I don't recall at the moment) and you basically need to rewrite how the client does batch puts to allow you to actually get at the returned PB object, rather than having it pass the callback the underlying errors/rows. IIRC this wasn't super clean as the code there is very convoluted. However, you can tie it into the other mutation methods really easily as just a hook on the serverCallable.

          Show
          Jesse Yates added a comment - I tried to tackle this doing the PB way a little while ago (though it may have been in conjunction with some other ticket that I don't recall at the moment) and you basically need to rewrite how the client does batch puts to allow you to actually get at the returned PB object, rather than having it pass the callback the underlying errors/rows. IIRC this wasn't super clean as the code there is very convoluted. However, you can tie it into the other mutation methods really easily as just a hook on the serverCallable.
          Hide
          Jimmy Xiang added a comment -

          I was thinking RegionTooBusyException, so close. As long as the server releases the IPC handler in such scenario, accessing other regions should not be blocked. The point is that we don't want a busy region blocks a whole region server.

          As to the old clients, right, the behavior is a little different. But they should not fail, as currently they should expect exceptions and handle them properly, for example, retry, although it may not be as efficient as delaying more when the retry count is bigger.

          As to measure the load, I think it is a good idea. I just have some concern in spending too much efforts on it without trying the simple one at first, which is known to work.

          Show
          Jimmy Xiang added a comment - I was thinking RegionTooBusyException, so close. As long as the server releases the IPC handler in such scenario, accessing other regions should not be blocked. The point is that we don't want a busy region blocks a whole region server. As to the old clients, right, the behavior is a little different. But they should not fail, as currently they should expect exceptions and handle them properly, for example, retry, although it may not be as efficient as delaying more when the retry count is bigger. As to measure the load, I think it is a good idea. I just have some concern in spending too much efforts on it without trying the simple one at first, which is known to work.
          Hide
          Jimmy Xiang added a comment -

          For the extra load due to retry, it is mostly on network, not very much on the server. The retry time is configurable based on the tries:

          ConnectionUtils.getPauseTime(pause, tries)

          As long as the retry is gradually slowing down, is it acceptable?

          Show
          Jimmy Xiang added a comment - For the extra load due to retry, it is mostly on network, not very much on the server. The retry time is configurable based on the tries: ConnectionUtils.getPauseTime(pause, tries) As long as the retry is gradually slowing down, is it acceptable?
          Hide
          Jesse Yates added a comment -

          Jimmy Xiang my patch is up there, if you want to run with it.. I'm a bit busy at the moment

          The client has also been given permission to grow to a multiplier of its writeBufferSize, similar to the memstoremultiplier, allowing it to buffer more writes. If a write is within the expansion range, we want to allow the client to accept more writes while waiting/backing-off, so we launch a flusher thread that after waiting the backoff time will flush the writes (singleton). This gives us back-off as well as some flexiblilty on the client as to how much we buffer. To disable the backoff behavior, its as simple as setting the multiplier to 1, so the expansion = max.

          -Jesse

          I don't this is really the right way to handle this, but instead just plug in the wait-object per region that ties in the most recent information. Also, from the above, using a MonitoredResult rather than throwing an exception ensures the same behavior on old and new clients. Its ugly, but as long as you put in a TODO, I'm okay with it.

          You might need to rewrite a bit of my patch - there are probably some good hints as to the right places to hook in code, but I think just having the clients that 'care' wait will be good enough for now. The addition of the server waiting is a bad idea as it has a much larger effect on the system as a whole.

          Show
          Jesse Yates added a comment - Jimmy Xiang my patch is up there, if you want to run with it.. I'm a bit busy at the moment The client has also been given permission to grow to a multiplier of its writeBufferSize, similar to the memstoremultiplier, allowing it to buffer more writes. If a write is within the expansion range, we want to allow the client to accept more writes while waiting/backing-off, so we launch a flusher thread that after waiting the backoff time will flush the writes (singleton). This gives us back-off as well as some flexiblilty on the client as to how much we buffer. To disable the backoff behavior, its as simple as setting the multiplier to 1, so the expansion = max. -Jesse I don't this is really the right way to handle this, but instead just plug in the wait-object per region that ties in the most recent information. Also, from the above, using a MonitoredResult rather than throwing an exception ensures the same behavior on old and new clients. Its ugly, but as long as you put in a TODO, I'm okay with it. You might need to rewrite a bit of my patch - there are probably some good hints as to the right places to hook in code, but I think just having the clients that 'care' wait will be good enough for now. The addition of the server waiting is a bad idea as it has a much larger effect on the system as a whole.
          Hide
          Jesse Yates added a comment -

          My concern is that the retry is completely a waste - just accept the write and then slow it down later. Otherwise you are resending the data a lot more than necessary, which is going to put more network load (especially with large puts) than you need, possibly multiple times over.

          Show
          Jesse Yates added a comment - My concern is that the retry is completely a waste - just accept the write and then slow it down later. Otherwise you are resending the data a lot more than necessary, which is going to put more network load (especially with large puts) than you need, possibly multiple times over.
          stack made changes -
          Fix Version/s 0.95.0 [ 12324094 ]
          Fix Version/s 0.96.0 [ 12320040 ]
          Lars Hofhansl made changes -
          Fix Version/s 0.94.0 [ 12316419 ]
          Lars Hofhansl made changes -
          Fix Version/s 0.94.1 [ 12320257 ]
          Fix Version/s 0.94.0 [ 12316419 ]
          Lars Hofhansl made changes -
          Fix Version/s 0.94.1 [ 12320257 ]
          stack made changes -
          Fix Version/s 0.95.1 [ 12324288 ]
          Fix Version/s 0.95.0 [ 12324094 ]
          Hide
          stack added a comment -

          Important issue but won't be done for 0.95

          Show
          stack added a comment - Important issue but won't be done for 0.95
          stack made changes -
          Fix Version/s 0.95.1 [ 12324288 ]
          Hide
          Nicolas Liochon added a comment -

          Arriving from HBASE-11208, we could:

          • accept the write as mentioned by jesse
          • return a load status (something like a number between 0 and 100, maybe two figures for the region and the regionserver), the client can then use this to slowdown. The AsyncProcess would ease this.
          Show
          Nicolas Liochon added a comment - Arriving from HBASE-11208 , we could: accept the write as mentioned by jesse return a load status (something like a number between 0 and 100, maybe two figures for the region and the regionserver), the client can then use this to slowdown. The AsyncProcess would ease this.

            People

            • Assignee:
              Unassigned
              Reporter:
              Jean-Daniel Cryans
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:

                Development