HBase
  1. HBase
  2. HBASE-5075

regionserver crashed and failover

    Details

    • Hadoop Flags:
      Incompatible change

      Description

      regionserver crashed,it is too long time to notify hmaster.when hmaster know regionserver's shutdown,it is long time to fetch the hlog's lease.
      hbase is a online db, availability is very important.
      i have a idea to improve availability, monitor node to check regionserver's pid.if this pid not exsits,i think the rs down,i will delete the znode,and force close the hlog file.
      so the period maybe 100ms.

      1. HBase-5075-shell.patch
        1 kB
        zhiyuan.dai
      2. Degion of Failure Detection.pdf
        291 kB
        zhiyuan.dai
      3. HBase-5075-src.patch
        40 kB
        zhiyuan.dai

        Issue Links

          Activity

          Hide
          zhiyuan.dai added a comment -

          hbase is a online db,it's availability is very important.
          if some regionserver carshed ,we can't wait too long to recovery service.
          so my patch is to improve the ability of failover in hbase.

          i come from alipay in china, we have some hbase cluster online, we have the second big cluster in china.

          Show
          zhiyuan.dai added a comment - hbase is a online db,it's availability is very important. if some regionserver carshed ,we can't wait too long to recovery service. so my patch is to improve the ability of failover in hbase. i come from alipay in china, we have some hbase cluster online, we have the second big cluster in china.
          Hide
          junhua yang added a comment -

          hi,
          I think it is very important to shortent the recovery time.
          Now waiting the regionserver recovery sometimes is very long and not acceptable for online service.
          Lots of error from client will be thrown and affect the custom.

          So could you help to provide your solution and @stack, how do you think about hbase failover solution now?

          Do you have any plan to improve it?

          Show
          junhua yang added a comment - hi, I think it is very important to shortent the recovery time. Now waiting the regionserver recovery sometimes is very long and not acceptable for online service. Lots of error from client will be thrown and affect the custom. So could you help to provide your solution and @stack, how do you think about hbase failover solution now? Do you have any plan to improve it?
          Hide
          zhiyuan.dai added a comment -

          patch and develop plan

          Show
          zhiyuan.dai added a comment - patch and develop plan
          Hide
          Ted Yu added a comment -

          One solution is to expire region server znode more quickly.

          Show
          Ted Yu added a comment - One solution is to expire region server znode more quickly.
          Hide
          zhiyuan.dai added a comment -

          @Zhihong Yu
          expire region server znode may be depend on zk's session time out,so we can use ping and check pid to perceive rs's crashed event,and ten we delete znode.

          Show
          zhiyuan.dai added a comment - @Zhihong Yu expire region server znode may be depend on zk's session time out,so we can use ping and check pid to perceive rs's crashed event,and ten we delete znode.
          Hide
          ronghai.ma added a comment -

          is it possible to use packet socket to check the status.

          Show
          ronghai.ma added a comment - is it possible to use packet socket to check the status.
          Hide
          stack added a comment -

          @Ronhai.ma Put up a patch so we can see what you are thinking; are you talking about a supervisor-like process that will remove the regionserver ephemeral node if the pid goes missing? Thanks.

          Show
          stack added a comment - @Ronhai.ma Put up a patch so we can see what you are thinking; are you talking about a supervisor-like process that will remove the regionserver ephemeral node if the pid goes missing? Thanks.
          Hide
          zhiyuan.dai added a comment -

          @stack i have submit the patch.Thanks

          Show
          zhiyuan.dai added a comment - @stack i have submit the patch.Thanks
          Hide
          zhiyuan.dai added a comment -

          @stack you are right,I really is considering a supervisor-like process that will remove the regionserver ephemeral node if the pid goes missing and fail to ping(new Socket-Connection refused),now i am translate us design documents.

          Show
          zhiyuan.dai added a comment - @stack you are right,I really is considering a supervisor-like process that will remove the regionserver ephemeral node if the pid goes missing and fail to ping(new Socket-Connection refused),now i am translate us design documents.
          Hide
          stack added a comment -

          Thanks for doing this. It looks very interesting.

          Please do not reformat existing code. It bloats your patch and makes reviews take longer; reviewer attention span is short (at least in this case) and its a shame to spend it going over code reformats.

          On the patch, is this necessary: + public String getRSPidAndRsZknode();

          Can't you get the pid from a process listing? Or you want us to publish it via jmx? Or it looks like it is already published via jmx. Can your tool pick it up there? On the znode, can't you get the regionserver servername and then do lookup in zk directly?

          Can't you have supervisor do this? Is there not existing utilities that watch a pid and allow you do stuff when its gone? Or is it that you'd kill the server if a long GC pause?

          Do you have a bit of documentation on how this new utility works?

          Thanks.

          Show
          stack added a comment - Thanks for doing this. It looks very interesting. Please do not reformat existing code. It bloats your patch and makes reviews take longer; reviewer attention span is short (at least in this case) and its a shame to spend it going over code reformats. On the patch, is this necessary: + public String getRSPidAndRsZknode(); Can't you get the pid from a process listing? Or you want us to publish it via jmx? Or it looks like it is already published via jmx. Can your tool pick it up there? On the znode, can't you get the regionserver servername and then do lookup in zk directly? Can't you have supervisor do this? Is there not existing utilities that watch a pid and allow you do stuff when its gone? Or is it that you'd kill the server if a long GC pause? Do you have a bit of documentation on how this new utility works? Thanks.
          Hide
          Lars Hofhansl added a comment -

          I would 2nd Stack's request to create a patch without the format changes, also there're some author tags in the javadoc (which we don't do with Apache code).

          Is this guarding against just the RegionServer process dying (but its machine still up), or also against the machine dying? (I know I could take a closer look at the patch, but it's easier if you just tell me)

          Show
          Lars Hofhansl added a comment - I would 2nd Stack's request to create a patch without the format changes, also there're some author tags in the javadoc (which we don't do with Apache code). Is this guarding against just the RegionServer process dying (but its machine still up), or also against the machine dying? (I know I could take a closer look at the patch, but it's easier if you just tell me)
          Hide
          zhiyuan.dai added a comment -

          @Lars Hofhansl
          I am sorry for that i have reformated the existing code,i will do another patch.

          There are two versions of this work.We have implemented version 1 that can't check machine dying but can check process crashed.The next version will realize all about it.

          thanks for your reply.

          Show
          zhiyuan.dai added a comment - @Lars Hofhansl I am sorry for that i have reformated the existing code,i will do another patch. There are two versions of this work.We have implemented version 1 that can't check machine dying but can check process crashed.The next version will realize all about it. thanks for your reply.
          Hide
          zhiyuan.dai added a comment -

          @stack
          I am sorry for that i have reformated the existing code,i will do another patch.

          Thanks for the designing points you've mentioned. I am handling the design documents which include the answers to your questions and i will upload them as soon as possible.

          Show
          zhiyuan.dai added a comment - @stack I am sorry for that i have reformated the existing code,i will do another patch. Thanks for the designing points you've mentioned. I am handling the design documents which include the answers to your questions and i will upload them as soon as possible.
          Hide
          zhiyuan.dai added a comment -

          @stack @Lars Hofhansl
          First the rpc method getRSPidAndRsZknode is to fetch PID and znode which includes domain and service port,this way is reliable. If we use processes list, there may be some misjudgment.

          Second there is a supervisor called RegionServerFailureDetection,we first start regionserver, and then start RegionServerFailureDetection.RegionServerFailureDetection is a watchdog of RegionServer.

          Then the supervisor(RegionServerFailureDetection) of regionserver fetch PID and znode by getRSPidAndRsZknode.

          RegionServerFailureDetection doesn't have any relationship with long GC.

          RegionServerFailureDetection first check whether PID is alive and the check service port is alive.

          Show
          zhiyuan.dai added a comment - @stack @Lars Hofhansl First the rpc method getRSPidAndRsZknode is to fetch PID and znode which includes domain and service port,this way is reliable. If we use processes list, there may be some misjudgment. Second there is a supervisor called RegionServerFailureDetection,we first start regionserver, and then start RegionServerFailureDetection.RegionServerFailureDetection is a watchdog of RegionServer. Then the supervisor(RegionServerFailureDetection) of regionserver fetch PID and znode by getRSPidAndRsZknode. RegionServerFailureDetection doesn't have any relationship with long GC. RegionServerFailureDetection first check whether PID is alive and the check service port is alive.
          Hide
          ZhengBowen added a comment -

          the shell which start&stop monitors.

          Show
          ZhengBowen added a comment - the shell which start&stop monitors.
          Hide
          zhiyuan.dai added a comment -

          @Lars Hofhansl
          we have made shell's patch which includes start and stop supervisor.
          Can you review HBASE-5075?
          Thank you very much.

          Show
          zhiyuan.dai added a comment - @Lars Hofhansl we have made shell's patch which includes start and stop supervisor. Can you review HBASE-5075 ? Thank you very much.
          Hide
          Lars Hofhansl added a comment -

          @zhiyuan.dai:
          Would you mind if I uploaded your patch to "review board", it's easier to review there.
          You put a lot of work into this, thanks for documentation.

          I am a bit worried about maintaining an additional process on every machine (if I understand that correctly).

          Show
          Lars Hofhansl added a comment - @zhiyuan.dai: Would you mind if I uploaded your patch to "review board", it's easier to review there. You put a lot of work into this, thanks for documentation. I am a bit worried about maintaining an additional process on every machine (if I understand that correctly).
          Hide
          zhiyuan.dai added a comment -

          @Lars Hofhansl
          Thank you very much for your attention.
          I don't mind upload to review board.

          Show
          zhiyuan.dai added a comment - @Lars Hofhansl Thank you very much for your attention. I don't mind upload to review board.
          Hide
          zhiyuan.dai added a comment -

          @Lars Hofhansl
          What's url of this patch in review board.

          Show
          zhiyuan.dai added a comment - @Lars Hofhansl What's url of this patch in review board.
          Hide
          Lars Hofhansl added a comment -

          Actually, the patches do not apply cleanly to HBase trunk.

          Show
          Lars Hofhansl added a comment - Actually, the patches do not apply cleanly to HBase trunk.
          Hide
          Jesse Yates added a comment -

          Haven't had a chance to look at the latest patch yet, but have read through the docs. I have the same concern as Lars, namely,

          a bit worried about maintaining an additional process on every machine

          What about doing something a bit simpler like adding a runtime shutdown hook to the RS such that the region server will update ZK or the master when it decides to bail out. Even something as simple as just removing your own znode on failure would be sufficient to cover this use case, correct?

          Show
          Jesse Yates added a comment - Haven't had a chance to look at the latest patch yet, but have read through the docs. I have the same concern as Lars, namely, a bit worried about maintaining an additional process on every machine What about doing something a bit simpler like adding a runtime shutdown hook to the RS such that the region server will update ZK or the master when it decides to bail out. Even something as simple as just removing your own znode on failure would be sufficient to cover this use case, correct?
          Hide
          stack added a comment -

          Even something as simple as just removing your own znode on failure would be sufficient to cover this use case, correct?

          Lets do that regardless. Good idea.

          Show
          stack added a comment - Even something as simple as just removing your own znode on failure would be sufficient to cover this use case, correct? Lets do that regardless. Good idea.
          Hide
          stack added a comment -

          This issue seems to be like 'HBASE-2342 Consider adding a watchdog node next to region server'

          Show
          stack added a comment - This issue seems to be like ' HBASE-2342 Consider adding a watchdog node next to region server'
          Hide
          Jesse Yates added a comment -

          Yeah, very similar. Same issues what that ticket as before, namely wanting to keep HBase as simple and minimal as we can justify.

          Show
          Jesse Yates added a comment - Yeah, very similar. Same issues what that ticket as before, namely wanting to keep HBase as simple and minimal as we can justify.
          Hide
          stack added a comment -

          Rather than write a new supervisor, why not use something old school like http://supervisord.org/ A wrapper script could clear old znode from zk before restarting new RS instance?

          Show
          stack added a comment - Rather than write a new supervisor, why not use something old school like http://supervisord.org/ A wrapper script could clear old znode from zk before restarting new RS instance?
          Hide
          stack added a comment -

          Looking in HRegionServer code, it looks like we delete our znode on the way out already. Someone had your idea already Jesse:

              try {
                deleteMyEphemeralNode();
              } catch (KeeperException e) {
                LOG.warn("Failed deleting my ephemeral node", e);
              }
          

          Maybe this is broke?

          Show
          stack added a comment - Looking in HRegionServer code, it looks like we delete our znode on the way out already. Someone had your idea already Jesse: try { deleteMyEphemeralNode(); } catch (KeeperException e) { LOG.warn( "Failed deleting my ephemeral node" , e); } Maybe this is broke?
          Hide
          Lars Hofhansl added a comment -

          Maybe have this in a shutdownHook as well?

          Of course that does not help if the RegionServer is "kill -9'd", or the RegionServer's machine just dies, or there's a network partition, etc, in which case we'd need to rely on the ZK timeout.

          Show
          Lars Hofhansl added a comment - Maybe have this in a shutdownHook as well? Of course that does not help if the RegionServer is "kill -9'd", or the RegionServer's machine just dies, or there's a network partition, etc, in which case we'd need to rely on the ZK timeout.
          Hide
          Lars Hofhansl added a comment -

          In fact there is a shutdown hook installed in HRegionServer already, which calls stop() on the HRegionServer. We should get better mileage if we also remove the ephemeral node in stop().

          Show
          Lars Hofhansl added a comment - In fact there is a shutdown hook installed in HRegionServer already, which calls stop() on the HRegionServer. We should get better mileage if we also remove the ephemeral node in stop().
          Hide
          stack added a comment -

          Agreed

          Show
          stack added a comment - Agreed
          Hide
          Lars Hofhansl added a comment -

          @zhiyuan.dai: What do you think?

          Show
          Lars Hofhansl added a comment - @zhiyuan.dai: What do you think?
          Hide
          Lars Hofhansl added a comment -

          On 2nd thought. The ephemeral node can only be deleted as long as the ZK connection is active, which is by no means guaranteed for any shutdown hook, also not sure about causing network IO from a shutdown hook.

          Looking at the HRegion.run() it looks like we pretty much in all cases reach the point where deleteMyEphemeralNode is called. Hmm...

          Show
          Lars Hofhansl added a comment - On 2nd thought. The ephemeral node can only be deleted as long as the ZK connection is active, which is by no means guaranteed for any shutdown hook, also not sure about causing network IO from a shutdown hook. Looking at the HRegion.run() it looks like we pretty much in all cases reach the point where deleteMyEphemeralNode is called. Hmm...
          Hide
          Jesse Yates added a comment -

          Had the same concerns about the network IO and (I think) blocking call. However, with the shutdown hook, I think we can be more sure that it runs, rather than putting it after the run method. Also, the hooks run in their own thread, so on shutdown, its not going to block regular shutdown or any other synchronous operations.

          Granted, this doesn't deal with the kill -9 or network partition situation, but if that happens, you have some big problems anyways and a minute (or whatever your zk timeout is) of blocking probably isn't a big deal Also note, that in the latter case there, the daemon wouldn't be able to reach zk anyways to eliminate the node, so you are back to where you were before.

          Show
          Jesse Yates added a comment - Had the same concerns about the network IO and (I think) blocking call. However, with the shutdown hook, I think we can be more sure that it runs, rather than putting it after the run method. Also, the hooks run in their own thread, so on shutdown, its not going to block regular shutdown or any other synchronous operations. Granted, this doesn't deal with the kill -9 or network partition situation, but if that happens, you have some big problems anyways and a minute (or whatever your zk timeout is) of blocking probably isn't a big deal Also note, that in the latter case there, the daemon wouldn't be able to reach zk anyways to eliminate the node, so you are back to where you were before.
          Hide
          zhiyuan.dai added a comment -

          @Jesse
          Thanks for your reply.
          I think HBase is a online DB. How long HBase failover takes is very important. Although kill -9 or network partition situation is a big event,the supervisor can judge that it's regionserver has crushed within ms,and hmaster can move regions which opened in the crushed regionserver to other alive regionservers.Therefore, the failover time is reduced to be accepted.

          As stack and Lars said,shutdownhook is called when the regionserver process is alive and program logic isn't interrupted.The event which is kill -9 can't trigger event that shutdownhook would be called,so the the method deleteMyEphemeralNode would not be executed,in which case we'd need to rely on the ZK timeout.

          My patch is order to reduce the failover time, which improves the availability of HBase.We have some big online hbase clusters which are all the core applications, and the acceptable failover time of the applications is about 10s~20s which include splitting hlog and recovering hlog lease and 'zk timeout'.

          Show
          zhiyuan.dai added a comment - @Jesse Thanks for your reply. I think HBase is a online DB. How long HBase failover takes is very important. Although kill -9 or network partition situation is a big event,the supervisor can judge that it's regionserver has crushed within ms,and hmaster can move regions which opened in the crushed regionserver to other alive regionservers.Therefore, the failover time is reduced to be accepted. As stack and Lars said,shutdownhook is called when the regionserver process is alive and program logic isn't interrupted.The event which is kill -9 can't trigger event that shutdownhook would be called,so the the method deleteMyEphemeralNode would not be executed,in which case we'd need to rely on the ZK timeout. My patch is order to reduce the failover time, which improves the availability of HBase.We have some big online hbase clusters which are all the core applications, and the acceptable failover time of the applications is about 10s~20s which include splitting hlog and recovering hlog lease and 'zk timeout'.
          Hide
          stack added a comment -

          @zhiyuan.dai What you think of the idea of using supervisor or any of the other babysitting programs instead of writing our own from new? If you need to have hbase regionservers dump out their servername so you know what to kill up in zk, that can be done easy enough....

          Show
          stack added a comment - @zhiyuan.dai What you think of the idea of using supervisor or any of the other babysitting programs instead of writing our own from new? If you need to have hbase regionservers dump out their servername so you know what to kill up in zk, that can be done easy enough....
          Hide
          zhiyuan.dai added a comment -

          @stack
          First, thank you.
          Sorry, I don't quite understand your meaning.Do you means another project instead of writing code into hbase?

          Show
          zhiyuan.dai added a comment - @stack First, thank you. Sorry, I don't quite understand your meaning.Do you means another project instead of writing code into hbase?
          Hide
          Nicolas Liochon added a comment -

          In the case that you want to manage (region server crash without any hardware issue, i.e. a pure application bug), a possible solution as well is loop in the launch script. This typically allows 20s failover (stop time + start time), and is even compatible with hotfailover. It's faster and less error prone than monitoring pid. But for HBase it would be a new start mode. It could make sense if we observe much more application bugs than hw issues (this solution is quite common with C/C++ stuff as it's easy to crash a process with these languages...)

          Show
          Nicolas Liochon added a comment - In the case that you want to manage (region server crash without any hardware issue, i.e. a pure application bug), a possible solution as well is loop in the launch script. This typically allows 20s failover (stop time + start time), and is even compatible with hotfailover. It's faster and less error prone than monitoring pid. But for HBase it would be a new start mode. It could make sense if we observe much more application bugs than hw issues (this solution is quite common with C/C++ stuff as it's easy to crash a process with these languages...)
          Hide
          stack added a comment -

          Do you means another project instead of writing code into hbase?

          Yes sir. Process babysitters is a pretty mature domain w/ a wide variety of existing programs that have been debugged and are able to do this for you. What do you think about using one of the existing solutions rather than write your own?

          Show
          stack added a comment - Do you means another project instead of writing code into hbase? Yes sir. Process babysitters is a pretty mature domain w/ a wide variety of existing programs that have been debugged and are able to do this for you. What do you think about using one of the existing solutions rather than write your own?
          Hide
          ronghai.ma added a comment -

          If we wanna HA, it is a long time for 1s.
          The scenario is in individual machine, check the HRegionServer in less than 1000ms.
          It is not enough for only shell script, My solution is : a tiny c/as program monitor the HRegionServer in less than 1000ms, and a shell script (using loop/in crontab) to monitor c/as program.
          Also if we wanna more, we can use MC Serviceguard like etc.

          Show
          ronghai.ma added a comment - If we wanna HA, it is a long time for 1s. The scenario is in individual machine, check the HRegionServer in less than 1000ms. It is not enough for only shell script, My solution is : a tiny c/as program monitor the HRegionServer in less than 1000ms, and a shell script (using loop/in crontab) to monitor c/as program. Also if we wanna more, we can use MC Serviceguard like etc.
          Hide
          zhiyuan.dai added a comment -

          @stack
          Yes,i have understanded what's your means,you are right.I will try the http://supervisord.org/.
          Thank you.

          Show
          zhiyuan.dai added a comment - @stack Yes,i have understanded what's your means,you are right.I will try the http://supervisord.org/ . Thank you.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12515893/HBase-5075-shell.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3337//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12515893/HBase-5075-shell.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3337//console This message is automatically generated.

            People

            • Assignee:
              Unassigned
              Reporter:
              zhiyuan.dai
            • Votes:
              2 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:

                Development