HBase
  1. HBase
  2. HBASE-5353

HA/Distributed HMaster via RegionServers

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.94.0
    • Fix Version/s: None
    • Component/s: master, regionserver
    • Labels:
      None

      Description

      Currently, the HMaster node(s) must be considered a 'special' node (though not a single point of failover), meaning that the node must be protected more than the other cluster machines or at least specially monitored. Minimally, we always need to ensure that the master is running, rather than letting the system handle that internally. It should be possible to instead have the HMaster be much more available, either in a distributed sense (meaning a bit rewrite) or multiple, dynamically created instances combined with the hot fail-over of masters.

        Activity

        Hide
        Jesse Yates added a comment -

        I was thinking about this and it seems like it wouldn't be that hard to have each of the regionservers doing leader election via ZK to select the one (or top 'n' rs) that would spin up master instances on their local machine. Those new masters could do their own leader election in ZK to determine who is the current 'official' HMaster, and the others would act as hot failovers. If a master dies, the next rs in the list would spin up a master instance, ensuring that we always have a certain number of hot masters (clearly cascading failure here is a problem, but if that happens, you have bigger problems). Clearly, running the master from the same JVM is probably a bad idea, but you could potentially even use the startup scripts to spin up a separate jvm with the master.

        This also means some modification to the client, to keep track of the current master, but that should be fairly trivial, as it already has the zk connection (or can do a fail and lookup).

        Show
        Jesse Yates added a comment - I was thinking about this and it seems like it wouldn't be that hard to have each of the regionservers doing leader election via ZK to select the one (or top 'n' rs) that would spin up master instances on their local machine. Those new masters could do their own leader election in ZK to determine who is the current 'official' HMaster, and the others would act as hot failovers. If a master dies, the next rs in the list would spin up a master instance, ensuring that we always have a certain number of hot masters (clearly cascading failure here is a problem, but if that happens, you have bigger problems). Clearly, running the master from the same JVM is probably a bad idea, but you could potentially even use the startup scripts to spin up a separate jvm with the master. This also means some modification to the client, to keep track of the current master, but that should be fairly trivial, as it already has the zk connection (or can do a fail and lookup).
        Hide
        stack added a comment -

        I'd say just run the master in-process w/ the regionserver. Master doesn't do much (It used to be heavily loaded when we did log splitting but thats distributed now or on startup... but even then, should be fine).

        Client already tracks master location as you say though we need to undo this...and just have the client do a read of zk to find master location when it needs it.

        Regards UI, we'd collapse it so that there'd be a single webapp rather than the two we have now. There'd be a 'master' link. If the current regionserver were not the master, the master link would redirect you to current master.

        Show
        stack added a comment - I'd say just run the master in-process w/ the regionserver. Master doesn't do much (It used to be heavily loaded when we did log splitting but thats distributed now or on startup... but even then, should be fine). Client already tracks master location as you say though we need to undo this...and just have the client do a read of zk to find master location when it needs it. Regards UI, we'd collapse it so that there'd be a single webapp rather than the two we have now. There'd be a 'master' link. If the current regionserver were not the master, the master link would redirect you to current master.
        Hide
        Jesse Yates added a comment -

        I'd say just run the master in-process w/ the regionserver. Master doesn't do much (It used to be heavily loaded when we did log splitting but thats distributed now or on startup... but even then, should be fine).

        I worry about putting too much in the same jvm, especially when you ahve a heavily loaded RS, you could be seriously killed with jvm pauses when you up the size to accomodate the master (could be bad too when you have the larger jvm, but no master running). Since, initially, this is going to be enabled via a configuration option, another option would just be to start it in JVM vs. outside the jvm; seems to me it would work either way.

        Client already tracks master location as you say though we need to undo this...and just have the client do a read of zk to find master location when it needs it.

        Talked wtih Lars H about doing this fix too - the client really doesn't need the long running zk connection, but should just zk when it needs the master info. So that could be part of this ticket too.

        Regards UI, we'd collapse it so that there'd be a single webapp rather than the two we have now.

        Only works if we have same jvm stuff. I would argue that it should have a RS link rather than a master link (smile). But for the initial patch I would say the ui stuff should be on hold, until the actual implementation gets worked out.

        Show
        Jesse Yates added a comment - I'd say just run the master in-process w/ the regionserver. Master doesn't do much (It used to be heavily loaded when we did log splitting but thats distributed now or on startup... but even then, should be fine). I worry about putting too much in the same jvm, especially when you ahve a heavily loaded RS, you could be seriously killed with jvm pauses when you up the size to accomodate the master (could be bad too when you have the larger jvm, but no master running). Since, initially, this is going to be enabled via a configuration option, another option would just be to start it in JVM vs. outside the jvm; seems to me it would work either way. Client already tracks master location as you say though we need to undo this...and just have the client do a read of zk to find master location when it needs it. Talked wtih Lars H about doing this fix too - the client really doesn't need the long running zk connection, but should just zk when it needs the master info. So that could be part of this ticket too. Regards UI, we'd collapse it so that there'd be a single webapp rather than the two we have now. Only works if we have same jvm stuff. I would argue that it should have a RS link rather than a master link (smile). But for the initial patch I would say the ui stuff should be on hold, until the actual implementation gets worked out.
        Hide
        Jimmy Xiang added a comment -

        Another option is not to have a master, every region server can do the work a master currently does. Just uses the ZK to coordinate them.
        For example, once a region server dies, all other region server knows about it, all try to run the dead server clean up, but only one will actually do it. The drawback here is too much zk interaction.

        Show
        Jimmy Xiang added a comment - Another option is not to have a master, every region server can do the work a master currently does. Just uses the ZK to coordinate them. For example, once a region server dies, all other region server knows about it, all try to run the dead server clean up, but only one will actually do it. The drawback here is too much zk interaction.
        Hide
        Todd Lipcon added a comment -

        Given that we already have failover support for the master, I'm skeptical that adding any complexity here is a good idea. If you want to colocate masters and RS, you can simply run a master process on a few of your RS nodes, and basically have the same behavior.

        What's the compelling use case? The master is not a SPOF since we already have hot failover support.

        Show
        Todd Lipcon added a comment - Given that we already have failover support for the master, I'm skeptical that adding any complexity here is a good idea. If you want to colocate masters and RS, you can simply run a master process on a few of your RS nodes, and basically have the same behavior. What's the compelling use case? The master is not a SPOF since we already have hot failover support.
        Hide
        Jesse Yates added a comment -

        have failover support for the master,

        oh right. But this means you don't need to worry about where you run your master - the system takes care of all of that for you, making the startup process easier. Works well here b/c we have the master down to such a lightweight process.

        Show
        Jesse Yates added a comment - have failover support for the master, oh right. But this means you don't need to worry about where you run your master - the system takes care of all of that for you, making the startup process easier. Works well here b/c we have the master down to such a lightweight process.
        Hide
        Todd Lipcon added a comment -

        But this means you don't need to worry about where you run your master

        Except it opens a new can of worms: where do you find the master UI? how do you monitor your master if it moves around? how do you easily find the master logs when it could be anywhere in the cluster?

        If your goal is to automatically pick a system to run a master on, you could have your cluster management software do that, but I only see additional complexity being introduced if you add this to HBase proper.

        Show
        Todd Lipcon added a comment - But this means you don't need to worry about where you run your master Except it opens a new can of worms: where do you find the master UI? how do you monitor your master if it moves around? how do you easily find the master logs when it could be anywhere in the cluster? If your goal is to automatically pick a system to run a master on, you could have your cluster management software do that, but I only see additional complexity being introduced if you add this to HBase proper.
        Hide
        Jesse Yates added a comment -

        where do you find the master UI? how do you monitor your master if it moves around? how do you easily find the master logs when it could be anywhere in the cluster?

        The cluster knows about it, so you can have a link on the webui to the master or any of the region servers. As stack was saying above, the region server page would have a link to the master page. Same deal with the logs (or using something like the hbscan stuff from fbook).

        if your goal is to automatically pick a system to run a master on, you could have your cluster management software do that

        True, but if those masters fail over, then your cluster management needs to be aware enough of that to provision more, on different servers; afaik, thisis a pain to do in a really 'cluster aware' sense. This way, its all handled under the covers

        Show
        Jesse Yates added a comment - where do you find the master UI? how do you monitor your master if it moves around? how do you easily find the master logs when it could be anywhere in the cluster? The cluster knows about it, so you can have a link on the webui to the master or any of the region servers. As stack was saying above, the region server page would have a link to the master page. Same deal with the logs (or using something like the hbscan stuff from fbook). if your goal is to automatically pick a system to run a master on, you could have your cluster management software do that True, but if those masters fail over, then your cluster management needs to be aware enough of that to provision more, on different servers; afaik, thisis a pain to do in a really 'cluster aware' sense. This way, its all handled under the covers
        Hide
        Todd Lipcon added a comment -

        The cluster knows about it, so you can have a link on the webui to the master or any of the region servers

        And each of the potential masters publishes metrics to ganglia, so if you want to find the master metrics, you have to hunt around in the ganglia graphs for which master was active at that time?
        And any cron jobs or nagios alerts you write need to first call some HBase utility to find the active master's IP via ZK in order to get to it?

        True, but if those masters fail over, then your cluster management needs to be aware enough of that to provision more, on different servers

        If you have two masters on separate racks, and you have any reasonable monitoring, then your ops team will restart or provision a new one when they fail. I've never ever heard of this kind of scenario being a major cause of downtime.

        The whole thing seems like a bad idea to me. I won't -1 but consider me -0.5

        Show
        Todd Lipcon added a comment - The cluster knows about it, so you can have a link on the webui to the master or any of the region servers And each of the potential masters publishes metrics to ganglia, so if you want to find the master metrics, you have to hunt around in the ganglia graphs for which master was active at that time? And any cron jobs or nagios alerts you write need to first call some HBase utility to find the active master's IP via ZK in order to get to it? True, but if those masters fail over, then your cluster management needs to be aware enough of that to provision more, on different servers If you have two masters on separate racks, and you have any reasonable monitoring, then your ops team will restart or provision a new one when they fail. I've never ever heard of this kind of scenario being a major cause of downtime. The whole thing seems like a bad idea to me. I won't -1 but consider me -0.5
        Hide
        stack added a comment -

        Except it opens a new can of worms: where do you find the master UI? how do you monitor your master if it moves around? how do you easily find the master logs when it could be anywhere in the cluster?

        Its not a new can of worms, right? We have the above (mostly unsolved) problems now if you run with more than one master.

        And any cron jobs or nagios alerts you write need to first call some HBase utility to find the active master's IP via ZK in order to get to it?

        They should be doing this now, if multiple masters?

        If the master function were lightweight enough, it'd be kinda sweet having one daemon type only I'd think; there'd be no longer need for special treatment of master. Might be tricky having them running in the same JVM what w/ all the executors afloat and RPCs (I'd rather do all in the one JVM then have RS start/stop separate Master processes if we were going to go this route).

        Show
        stack added a comment - Except it opens a new can of worms: where do you find the master UI? how do you monitor your master if it moves around? how do you easily find the master logs when it could be anywhere in the cluster? Its not a new can of worms, right? We have the above (mostly unsolved) problems now if you run with more than one master. And any cron jobs or nagios alerts you write need to first call some HBase utility to find the active master's IP via ZK in order to get to it? They should be doing this now, if multiple masters? If the master function were lightweight enough, it'd be kinda sweet having one daemon type only I'd think; there'd be no longer need for special treatment of master. Might be tricky having them running in the same JVM what w/ all the executors afloat and RPCs (I'd rather do all in the one JVM then have RS start/stop separate Master processes if we were going to go this route).
        Hide
        Todd Lipcon added a comment -

        We have the above (mostly unsolved) problems now if you run with more than one master.

        They should be doing this now, if multiple masters?

        Sort of - except when you have two masters, you just set up nagios alerts and metrics to point to both, and you only need to look two places if you have an issue. If you have no idea where the master is, you have to hunt around the cluster to find it.

        If the master function were lightweight enough, it'd be kinda sweet having one daemon type only I'd think

        Except we'd still have multiple daemon types, logically, it's just that they'd be collocated inside the same process, making logs harder to de-interleave, etc.

        Plus, if your RS are are all collocated with TTs and heavily loaded, then I wouldn't want to see the master running on one of them. I'd rather just tell ops "these nodes run the important master daemons, please monitor them and any high utilization is problematic".

        Show
        Todd Lipcon added a comment - We have the above (mostly unsolved) problems now if you run with more than one master. They should be doing this now, if multiple masters? Sort of - except when you have two masters, you just set up nagios alerts and metrics to point to both, and you only need to look two places if you have an issue. If you have no idea where the master is, you have to hunt around the cluster to find it. If the master function were lightweight enough, it'd be kinda sweet having one daemon type only I'd think Except we'd still have multiple daemon types, logically, it's just that they'd be collocated inside the same process, making logs harder to de-interleave, etc. Plus, if your RS are are all collocated with TTs and heavily loaded, then I wouldn't want to see the master running on one of them. I'd rather just tell ops "these nodes run the important master daemons, please monitor them and any high utilization is problematic".
        Hide
        Eli Collins added a comment -

        @Jesse, how is the HMaster machine a single point of failure? A SPOF is a part of the system that, if it fails, sstop the entire system from working. Because there's automatic HMaster failover that's not the case for HBase.

        Show
        Eli Collins added a comment - @Jesse, how is the HMaster machine a single point of failure? A SPOF is a part of the system that, if it fails, sstop the entire system from working. Because there's automatic HMaster failover that's not the case for HBase.
        Hide
        Jesse Yates added a comment -

        @Eli - yeah, I already mentioned that that is correct in response to Todd; I'd forgotten we had added that to HBase

        Show
        Jesse Yates added a comment - @Eli - yeah, I already mentioned that that is correct in response to Todd; I'd forgotten we had added that to HBase
        Hide
        Eli Collins added a comment -

        Cool, thanks for the clarification. Mind updating the description to match the latest understanding?

        Show
        Eli Collins added a comment - Cool, thanks for the clarification. Mind updating the description to match the latest understanding?
        Hide
        Jesse Yates added a comment -

        @Eli - done.

        Show
        Jesse Yates added a comment - @Eli - done.
        Hide
        stack added a comment -

        ...you only need to look two places if you have an issue. If you have no idea where the master is, you have to hunt around the cluster to find it.

        I'd imagine it'd be hard getting this patch in if no idea where the master is (And, again, don't we have this problem now if you start up three masters and one fails? You have to hunt around. We need to build the redirect piece regardless such as a link to master on each server page which redirects to current master and such as a history of who was master when in zk).

        You could even make the combined master+regionserver daemon work like our current multimaster system by having there be affinity for a certain set of servers.

        What kind of nagios alerts would be master particular? We need to add indirection to these now anyways – ask zk who the master is – if more than one master running. Metrics could be a little complicated especially if master moved servers over the period of interest but generally aren't master metrics of less interest since they are generally just aggregates and ganglia or opentsdb do it better job of this anyways?

        Logs don't have to be interleaved. Thats just a bit of log4j config?

        Yes, could be issue if the daemon is bogged down. The master would be less responsive which should be fine for short periods but if sustained it could be issue.

        I'm not going to work on this. I do see it as something that could simplify our deploy story.

        Show
        stack added a comment - ...you only need to look two places if you have an issue. If you have no idea where the master is, you have to hunt around the cluster to find it. I'd imagine it'd be hard getting this patch in if no idea where the master is (And, again, don't we have this problem now if you start up three masters and one fails? You have to hunt around. We need to build the redirect piece regardless such as a link to master on each server page which redirects to current master and such as a history of who was master when in zk). You could even make the combined master+regionserver daemon work like our current multimaster system by having there be affinity for a certain set of servers. What kind of nagios alerts would be master particular? We need to add indirection to these now anyways – ask zk who the master is – if more than one master running. Metrics could be a little complicated especially if master moved servers over the period of interest but generally aren't master metrics of less interest since they are generally just aggregates and ganglia or opentsdb do it better job of this anyways? Logs don't have to be interleaved. Thats just a bit of log4j config? Yes, could be issue if the daemon is bogged down. The master would be less responsive which should be fine for short periods but if sustained it could be issue. I'm not going to work on this. I do see it as something that could simplify our deploy story.
        Hide
        Nicolas Spiegelberg added a comment -

        There's a lot of discussion here, so I might have missed something... How is searching for the master a problem? We already have bin/get-active-master.rb , a jruby script that will search ZK for the current owner of the master lock. Combining that with 'bin/hbase start master --backup', and you have a full solution to let your monitoring scripts ping the correct server for UI information. The downtime for the master is dominated by the ZK ephemeral node timeout more than process setup.

        The common 'annoyance' that we see internally is on the developer side (not monitoring side). You see that servers are down, so you checkout the web ui. The master went down, backup master took over, then autoremediation restarted the downed server in backup mode. This means that the UI is inaccessible from the normal location. DNS propagation takes a lot longer than restarting a process, so that's not really an option for us. Because of this, I think a more important feature is to have the backup masters setup a web server with an HTTP redirect to the active master's UI.

        Show
        Nicolas Spiegelberg added a comment - There's a lot of discussion here, so I might have missed something... How is searching for the master a problem? We already have bin/get-active-master.rb , a jruby script that will search ZK for the current owner of the master lock. Combining that with 'bin/hbase start master --backup', and you have a full solution to let your monitoring scripts ping the correct server for UI information. The downtime for the master is dominated by the ZK ephemeral node timeout more than process setup. The common 'annoyance' that we see internally is on the developer side (not monitoring side). You see that servers are down, so you checkout the web ui. The master went down, backup master took over, then autoremediation restarted the downed server in backup mode. This means that the UI is inaccessible from the normal location. DNS propagation takes a lot longer than restarting a process, so that's not really an option for us. Because of this, I think a more important feature is to have the backup masters setup a web server with an HTTP redirect to the active master's UI.
        Hide
        Todd Lipcon added a comment -

        Because of this, I think a more important feature is to have the backup masters setup a web server with an HTTP redirect to the active master's UI.

        +1

        Show
        Todd Lipcon added a comment - Because of this, I think a more important feature is to have the backup masters setup a web server with an HTTP redirect to the active master's UI. +1
        Hide
        Jesse Yates added a comment -

        To follow up on Nicholas's comment, this would also be just a configuration option, not changing the way you have to run HBase. If it makes more sense for your setup to have a couple known masters that you monitor (the current impl), then you can do it that way. Alternatively, if you just want to have 1 daemon running for hbase - a region/masterServer - then you can do that too. A lot of the motivation around this jira was to try and make lives' easier by having less stuff to worry about.

        Show
        Jesse Yates added a comment - To follow up on Nicholas's comment, this would also be just a configuration option , not changing the way you have to run HBase. If it makes more sense for your setup to have a couple known masters that you monitor (the current impl), then you can do it that way. Alternatively, if you just want to have 1 daemon running for hbase - a region/masterServer - then you can do that too. A lot of the motivation around this jira was to try and make lives' easier by having less stuff to worry about.
        Hide
        Jonathan Hsieh added a comment -

        @Todd @Nicholas A months back I've started some work on HBASE-5083 but never haven't posted the patch yet since it has a pretty nasty hack in it (adding 10 to the master port to get the info/http port). There has been some cluster status things changing recently – I'll wait for that to settle before I finish that patch.

        Show
        Jonathan Hsieh added a comment - @Todd @Nicholas A months back I've started some work on HBASE-5083 but never haven't posted the patch yet since it has a pretty nasty hack in it (adding 10 to the master port to get the info/http port). There has been some cluster status things changing recently – I'll wait for that to settle before I finish that patch.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jesse Yates
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development