Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.89.20100621
    • Fix Version/s: 0.90.0
    • Component/s: regionserver
    • Labels:
      None

      Description

      I'd like to brainstorm some ideas on how we can prioritize reads and writes to META above reads and writes to other tables. I've noticed that if the regionserver hosting META is under heavy load, then lots of other operations take much longer than they should. For example, I'm currently running 120 threads of YCSB across 3 client nodes hitting a 5-node cluster. Doing a full scan of META (only 600 rows) takes upwards of 30 seconds in the shell, since all of the handler threads are tied up and there's a long RPC queue.

      1. HBASE-2782.txt
        22 kB
        ryan rawson

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Here are a couple quick ideas off the top of my head:

          1) use a priority queue inside the RPC server, and allow the constructor of the server to specify a Comparator<Invocation> to decide which things to go first. We can then introspect the RPCs while they're in the queue to put META requests first.

          2) allow an RPC server to specify a Function<Invocation,Long> which returns a priority level for any incoming RPC. This would allow not just ordering like comparator above, but also the ability to reserve pools of handlers for different RPC tiers. (eg we can say any request to META has a priority 1000, anything to normal tables is < 500, and we will always reserve a thread pool of 5 handlers for META access)

          3) give region servers TWO rpc ports, with separate handler pools - probably a big mess since we have assumptions that we only have one RPC port per server. But this is what HDFS is doing now and seems to work.

          I kind of like option 2 above, and it doesn't seem like it would be incredibly difficult.

          Show
          Todd Lipcon added a comment - Here are a couple quick ideas off the top of my head: 1) use a priority queue inside the RPC server, and allow the constructor of the server to specify a Comparator<Invocation> to decide which things to go first. We can then introspect the RPCs while they're in the queue to put META requests first. 2) allow an RPC server to specify a Function<Invocation,Long> which returns a priority level for any incoming RPC. This would allow not just ordering like comparator above, but also the ability to reserve pools of handlers for different RPC tiers. (eg we can say any request to META has a priority 1000, anything to normal tables is < 500, and we will always reserve a thread pool of 5 handlers for META access) 3) give region servers TWO rpc ports, with separate handler pools - probably a big mess since we have assumptions that we only have one RPC port per server. But this is what HDFS is doing now and seems to work. I kind of like option 2 above, and it doesn't seem like it would be incredibly difficult.
          Hide
          Jonathan Gray added a comment -

          This specialness of META and that ideally it should be distributed is why I'd like to eventually move it (or a mirror of it) into zk.

          Show
          Jonathan Gray added a comment - This specialness of META and that ideally it should be distributed is why I'd like to eventually move it (or a mirror of it) into zk.
          Hide
          Andrew Purtell added a comment -

           This specialness of META and that ideally it should be distributed is why I'd like to eventually move it (or a mirror of it) into zk.

          +1

          Show
          Andrew Purtell added a comment -  This specialness of META and that ideally it should be distributed is why I'd like to eventually move it (or a mirror of it) into zk. +1
          Hide
          Todd Lipcon added a comment -

          Ehh, I'm still unconvinced. Moving META to ZK means extra work for things like snapshots, backups, etc, where currently we can use the same mechanisms for user tables as for meta tables. Plus these issues that we see on META are issues that we'll also see on user tables. For example, right now it's very easy for a MR job to completely monopolize the capacity of a cluster to the point that interactive queries start having 30sec+ latencies. Really good QoS is hard, but I think a simple solution like above can get us a lot of benefit for not much work. Especially if we can make the QoS policy pluggable, maybe someone will just write a really good one and contribute it back.

          Multi-tenancy is a huge problem we haven't even begun to tackle, and QoS is just a tiny bit of it, but my point is that we need to solve this problem regardless of what happens to META.

          Show
          Todd Lipcon added a comment - Ehh, I'm still unconvinced. Moving META to ZK means extra work for things like snapshots, backups, etc, where currently we can use the same mechanisms for user tables as for meta tables. Plus these issues that we see on META are issues that we'll also see on user tables. For example, right now it's very easy for a MR job to completely monopolize the capacity of a cluster to the point that interactive queries start having 30sec+ latencies. Really good QoS is hard, but I think a simple solution like above can get us a lot of benefit for not much work. Especially if we can make the QoS policy pluggable, maybe someone will just write a really good one and contribute it back. Multi-tenancy is a huge problem we haven't even begun to tackle, and QoS is just a tiny bit of it, but my point is that we need to solve this problem regardless of what happens to META.
          Hide
          Jonathan Gray added a comment -

          Hmm, not sure I see the issues with moving META to ZK as snapshots and backups (these things do not really exist today). I also think mirroring META into ZK but retaining a table for persistence could make good sense and also solve those issues, but let's not totally sidetrack this jira.

          I agree completely that we need to do work around QoS and what you're proposing makes sense for META in the short-term and helps build towards QoS of user tables. I'm still +1 on this jira but think we should not limit this to a special-casing for META.

          Show
          Jonathan Gray added a comment - Hmm, not sure I see the issues with moving META to ZK as snapshots and backups (these things do not really exist today). I also think mirroring META into ZK but retaining a table for persistence could make good sense and also solve those issues, but let's not totally sidetrack this jira. I agree completely that we need to do work around QoS and what you're proposing makes sense for META in the short-term and helps build towards QoS of user tables. I'm still +1 on this jira but think we should not limit this to a special-casing for META.
          Hide
          Todd Lipcon added a comment -

          Agree, not going to sidetrack. I'll bring up my gripes re snapshot/backup/etc in the "move meta to ZK" jira if/when it gets filed

          Also agree that we should not special-case meta in the framework. After sleeping on it, I continue to like option 2. All that we should have to do for initial cut is:
          1) Define an interface like CallPrioritizer that takes a call object (method name/params) and provides a PrioritizedCall object (probably just wrapper for Long plus the original call object)
          2) Internal to RPC, change callQueue over to a priority queue, where we compare by the call priorities. If no callprioritizer is specified, we just prioritize on insertion time like we do now.
          3) Provide default implementation which prioritizes meta requests above user requests. Make it a configurable class so people can experiment with fancier QOS. If this turns out to work well, we turn it on by default, otherwise we leave the "in-order execution" as is by default.

          Then we can do other stuff like "reserved handler pools" or other ways of guaranteeing execution of high priority requests in a followup JIRA.

          Show
          Todd Lipcon added a comment - Agree, not going to sidetrack. I'll bring up my gripes re snapshot/backup/etc in the "move meta to ZK" jira if/when it gets filed Also agree that we should not special-case meta in the framework. After sleeping on it, I continue to like option 2. All that we should have to do for initial cut is: 1) Define an interface like CallPrioritizer that takes a call object (method name/params) and provides a PrioritizedCall object (probably just wrapper for Long plus the original call object) 2) Internal to RPC, change callQueue over to a priority queue, where we compare by the call priorities. If no callprioritizer is specified, we just prioritize on insertion time like we do now. 3) Provide default implementation which prioritizes meta requests above user requests. Make it a configurable class so people can experiment with fancier QOS. If this turns out to work well, we turn it on by default, otherwise we leave the "in-order execution" as is by default. Then we can do other stuff like "reserved handler pools" or other ways of guaranteeing execution of high priority requests in a followup JIRA.
          Hide
          ryan rawson added a comment -

          i have a working patch for this, it makes an amazing difference! I tested it on a version of hbase with that META deadlock and it cleared it right up I'll post a patch soon

          Show
          ryan rawson added a comment - i have a working patch for this, it makes an amazing difference! I tested it on a version of hbase with that META deadlock and it cleared it right up I'll post a patch soon
          Hide
          stack added a comment -

          This patch changes the api for rpc. It likely messes up Gary's work on making rpc pluggable; i.e. secure and non-secure rpc. Let me ask him what he thinks of this.

          Whats this change mean?

          • private static final int MAX_QUEUE_SIZE_PER_HANDLER = 100;
            + private static final int MAX_QUEUE_SIZE_PER_HANDLER = 1000;

          QoSFunction has to be in HRS? We've been doing work to break up the massive classes.

          How about a unit test?

          Show
          stack added a comment - This patch changes the api for rpc. It likely messes up Gary's work on making rpc pluggable; i.e. secure and non-secure rpc. Let me ask him what he thinks of this. Whats this change mean? private static final int MAX_QUEUE_SIZE_PER_HANDLER = 100; + private static final int MAX_QUEUE_SIZE_PER_HANDLER = 1000; QoSFunction has to be in HRS? We've been doing work to break up the massive classes. How about a unit test?
          Hide
          ryan rawson added a comment -

          Where else would QoSFunction go? It has to know intimate details about the regionserver, in order to discern if a scan belongs to a META table or not.

          I tweaked the queue size per handler so that i would have less blocking, although I'm not sure how good of an idea this is. I can revert on commit.

          There are only 2 API changes:

          • Add new parameters for high priority RPC pools and levels
          • Add new setter for QosFunction

          that is about it IIRC... its all internal elsewise?

          Show
          ryan rawson added a comment - Where else would QoSFunction go? It has to know intimate details about the regionserver, in order to discern if a scan belongs to a META table or not. I tweaked the queue size per handler so that i would have less blocking, although I'm not sure how good of an idea this is. I can revert on commit. There are only 2 API changes: Add new parameters for high priority RPC pools and levels Add new setter for QosFunction that is about it IIRC... its all internal elsewise?
          Hide
          Todd Lipcon added a comment -

          Yea, -1 on increasing the queue size - those calls can eat a lot of memory, eg if they're bulk puts.

          Over in Hadoop land they solved the pooling issue by having the NN listen on multiple ports, where each port is dedicated to a different "priority level". It's sort of nice since you can do network-level QoS as well to help things out, plus you don't have the issue of having to read a call in order to know if it's high priority. The downside of course is that you expose many more ports. Thoughts?

          Show
          Todd Lipcon added a comment - Yea, -1 on increasing the queue size - those calls can eat a lot of memory, eg if they're bulk puts. Over in Hadoop land they solved the pooling issue by having the NN listen on multiple ports, where each port is dedicated to a different "priority level". It's sort of nice since you can do network-level QoS as well to help things out, plus you don't have the issue of having to read a call in order to know if it's high priority. The downside of course is that you expose many more ports. Thoughts?
          Hide
          ryan rawson added a comment -

          lets go with the working code we have now... it isnt perfect but it improves things substantially.

          Show
          ryan rawson added a comment - lets go with the working code we have now... it isnt perfect but it improves things substantially.
          Hide
          stack added a comment -

          Shall we close this issue since the patch was committed and then open another for Todd's idea of another port to handle priority messages?

          Show
          stack added a comment - Shall we close this issue since the patch was committed and then open another for Todd's idea of another port to handle priority messages?
          Hide
          ryan rawson added a comment -

          Ports for QOS: HBASE-3069

          Show
          ryan rawson added a comment - Ports for QOS: HBASE-3069

            People

            • Assignee:
              ryan rawson
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development