HBase
  1. HBase
  2. HBASE-2981

Improve consistency of performance during heavy load

    Details

    • Type: Brainstorming Brainstorming
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: regionserver
    • Labels:
      None

      Description

      Currently when running load tests like YCSB, we experience periods of complete cluster inactivity while all clients are blocked on a single region which is unavailable.

      As discussed recently on the list, we should brainstorm some ideas for how to improve the situation: the goal here is to (a) minimize the amount of time when a region is inaccessible, and (b) minimize the impact that one inaccessible region has on other operations in the cluster.

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          One case of blocking is when a region hits the blocking store files limit. In current versions, once a region it hits the limit, it enters the compaction queue at the back. So if there are other compactions already in the queue, it can take some time before the blocked region gets unblocked, since it's waiting on all the other compactions.

          I think a similar thing can happen with the flushes that happen around a close or split.

          If we prioritize compactions (HBASE-2646) we can make these compactions that block progress take precedence over normal "maintenance" compactions.

          Show
          Todd Lipcon added a comment - One case of blocking is when a region hits the blocking store files limit. In current versions, once a region it hits the limit, it enters the compaction queue at the back. So if there are other compactions already in the queue, it can take some time before the blocked region gets unblocked, since it's waiting on all the other compactions. I think a similar thing can happen with the flushes that happen around a close or split. If we prioritize compactions ( HBASE-2646 ) we can make these compactions that block progress take precedence over normal "maintenance" compactions.
          Hide
          stack added a comment -

          Just for the record, hbase has worked in this "stutter"-y manner since time immemorial

          Show
          stack added a comment - Just for the record, hbase has worked in this "stutter"-y manner since time immemorial
          Hide
          Jean-Daniel Cryans added a comment -

          Well I do remember a time when we would just grow to 200 store files or OOME on memstore size

          Show
          Jean-Daniel Cryans added a comment - Well I do remember a time when we would just grow to 200 store files or OOME on memstore size
          Hide
          Todd Lipcon added a comment -

          Linked HBASE-2782 to add some QoS to META. We currently have a situation where if we have some RS that's hosting META as well as some other region, and the other region gets blocked, all of the handlers can fill up waiting for that region. Then all other clients end up getting blocked from making progress since they can't reach META (or ROOT).

          Show
          Todd Lipcon added a comment - Linked HBASE-2782 to add some QoS to META. We currently have a situation where if we have some RS that's hosting META as well as some other region, and the other region gets blocked, all of the handlers can fill up waiting for that region. Then all other clients end up getting blocked from making progress since they can't reach META (or ROOT).
          Hide
          Todd Lipcon added a comment -

          Another idea that has been bandied about is changing the current "blocking store files" behavior to something that's more like a squishy pillow instead of a brick wall. That is to say, as we start to approach the "blocking" limit number of store files, we add some "friction" to writes going into that region. This way writes will slow down and give the compaction queue some time to catch up. Basically we want to smoothly throttle the overall request rate to something we can consistently handle, rather than letting everyone write like crazy and then throw up a wall while we catch up.

          Show
          Todd Lipcon added a comment - Another idea that has been bandied about is changing the current "blocking store files" behavior to something that's more like a squishy pillow instead of a brick wall. That is to say, as we start to approach the "blocking" limit number of store files, we add some "friction" to writes going into that region. This way writes will slow down and give the compaction queue some time to catch up. Basically we want to smoothly throttle the overall request rate to something we can consistently handle, rather than letting everyone write like crazy and then throw up a wall while we catch up.
          Hide
          stack added a comment -

          I made it into new 'Brainstorming' issue type.

          Show
          stack added a comment - I made it into new 'Brainstorming' issue type.

            People

            • Assignee:
              Unassigned
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development