HBase
  1. HBase
  2. HBASE-5844

Delete the region servers znode after a regions server crash

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.95.2
    • Fix Version/s: 0.95.0
    • Component/s: regionserver, scripts
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Tags:
      0.96notable

      Description

      today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s.

      By deleting the znode in start script, we remove this delay and the recovery starts immediately.

      1. 5844.v1.patch
        5 kB
        Nicolas Liochon
      2. 5844.v2.patch
        5 kB
        Nicolas Liochon
      3. 5844.v3.patch
        5 kB
        stack
      4. 5844.v3.patch
        6 kB
        Nicolas Liochon
      5. 5844.v4.patch
        6 kB
        Nicolas Liochon

        Issue Links

          Activity

          Hide
          stack added a comment -

          Marking closed.

          Show
          stack added a comment - Marking closed.
          Hide
          Liang Lee added a comment -

          And I still have another question:in the hbase-daemon.sh file,in the cleanZNode() funtion ,which znode is deleted ?
          Under the normal scenario,when RegionServer starts ,it will create temporary znode on Zookeeper in the path /habse/rs,and what we delete in the cleanZNode() funtion ?
          Thank you!

          Show
          Liang Lee added a comment - And I still have another question:in the hbase-daemon.sh file,in the cleanZNode() funtion ,which znode is deleted ? Under the normal scenario,when RegionServer starts ,it will create temporary znode on Zookeeper in the path /habse/rs,and what we delete in the cleanZNode() funtion ? Thank you!
          Hide
          Liang Lee added a comment -

          Hi,stack ,Could you please provide some document about how to use this pach like HBASE-7404?
          Where does the core configution HBASE_ZNODE_FILE should be configured?
          Thanks

          Show
          Liang Lee added a comment - Hi,stack ,Could you please provide some document about how to use this pach like HBASE-7404 ? Where does the core configution HBASE_ZNODE_FILE should be configured? Thanks
          Hide
          Jean-Daniel Cryans added a comment -

          Encountered another problem that I think I can link to this jira, I was trying to run HBase from trunk without internet access and like in my Sept 25th comment, I get an empty line after start-hbase.sh but now nothing is running. The .log file doesn't show anything after logging ulimit and nothing's in the .out file. After running some bash -x, I was able to figure out that the nohup output was being suppressed. See:

          jdcryans-MBPr:hbase-github jdcryans$ ./bin/start-hbase.sh 
          jdcryans-MBPr:hbase-github jdcryans$
          jdcryans-MBPr:hbase-github jdcryans$ bash -x ./bin/start-hbase.sh 
          ... some stuff then
          + /Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh start master
          jdcryans-MBPr:hbase-github jdcryans$ bash -x /Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh start master
          ... more stuff
          + nohup /Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh --config /Users/jdcryans/git/hbase-github/bin/../conf internal_start master
          jdcryans-MBPr:hbase-github jdcryans$ nohup /Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh --config /Users/jdcryans/git/hbase-github/bin/../conf internal_start master
          appending output to nohup.out
          

          So now I see that it's writing to nohup.out, which in turn tells me what really happened:

          Caused by: java.lang.ClassNotFoundException: org.apache.zookeeper.KeeperException
          	at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
          	at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
          	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
          	at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
          

          Reproing can be done by physically deleting any jar listed in target/cached_classpath.txt. In my case I think the jar wasn't available because I had no internet connection.

          I wonder what other errors it could hide like this.

          Show
          Jean-Daniel Cryans added a comment - Encountered another problem that I think I can link to this jira, I was trying to run HBase from trunk without internet access and like in my Sept 25th comment, I get an empty line after start-hbase.sh but now nothing is running. The .log file doesn't show anything after logging ulimit and nothing's in the .out file. After running some bash -x, I was able to figure out that the nohup output was being suppressed. See: jdcryans-MBPr:hbase-github jdcryans$ ./bin/start-hbase.sh jdcryans-MBPr:hbase-github jdcryans$ jdcryans-MBPr:hbase-github jdcryans$ bash -x ./bin/start-hbase.sh ... some stuff then + /Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh start master jdcryans-MBPr:hbase-github jdcryans$ bash -x /Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh start master ... more stuff + nohup /Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh --config /Users/jdcryans/git/hbase-github/bin/../conf internal_start master jdcryans-MBPr:hbase-github jdcryans$ nohup /Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh --config /Users/jdcryans/git/hbase-github/bin/../conf internal_start master appending output to nohup.out So now I see that it's writing to nohup.out, which in turn tells me what really happened: Caused by: java.lang.ClassNotFoundException: org.apache.zookeeper.KeeperException at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) Reproing can be done by physically deleting any jar listed in target/cached_classpath.txt. In my case I think the jar wasn't available because I had no internet connection. I wonder what other errors it could hide like this.
          Hide
          Nicolas Liochon added a comment -

          It's strange, I didn't reproduce it. I should, because it seems logical. Will look into it and create jiras.
          Anyway, there are bugs around this scenario. For example, when it fails we now have a new pid file, but this pid does not match the process. This is true in 0.90 as well. If there is no process, the error for the stop (in 0.96) will be no regionserver to stop because kill -0 of pid 49938 failed with status 1. If another process took this id (yes it should not happen often), the kill will succeed.

          Show
          Nicolas Liochon added a comment - It's strange, I didn't reproduce it. I should, because it seems logical. Will look into it and create jiras. Anyway, there are bugs around this scenario. For example, when it fails we now have a new pid file, but this pid does not match the process. This is true in 0.90 as well. If there is no process, the error for the stop (in 0.96) will be no regionserver to stop because kill -0 of pid 49938 failed with status 1 . If another process took this id (yes it should not happen often), the kill will succeed.
          Hide
          Jean-Daniel Cryans added a comment -

          To repro: start a local distributed cluster (without hadoop), remove the RS's pid file in /tmp, try starting a region server again. My guess is that since the port is occupied then the second won't start but the znode deletion still runs after.

          Show
          Jean-Daniel Cryans added a comment - To repro: start a local distributed cluster (without hadoop), remove the RS's pid file in /tmp, try starting a region server again. My guess is that since the port is occupied then the second won't start but the znode deletion still runs after.
          Hide
          Nicolas Liochon added a comment -

          btw: I'm having a look at this to understand what's happening.

          Show
          Nicolas Liochon added a comment - btw: I'm having a look at this to understand what's happening.
          Hide
          Nicolas Liochon added a comment -

          Humm.
          The feature is important imho: waiting 30s (at best) before starting a recovery is really nice.
          In a ideal world, ZooKeeper would make this less useful by detecting the dead process sooner, but still it can't be faster than this.

          Note that znode remover should occur when the process finishes, not before starting a new one. What JD describes seems a bug to me.

          Show
          Nicolas Liochon added a comment - Humm. The feature is important imho: waiting 30s (at best) before starting a recovery is really nice. In a ideal world, ZooKeeper would make this less useful by detecting the dead process sooner, but still it can't be faster than this. Note that znode remover should occur when the process finishes, not before starting a new one. What JD describes seems a bug to me.
          Hide
          stack added a comment -

          Looking at this w/ j-d, now we no longer do nohup so the parent process can stick around to watch out for the server crash. This make it so now there are two hbase processes listed per launched daemon. This is kinda ugly.

          When we have this bash script watching the running java process we verge into the territory normally occupied by babysitters like supervise. Our parent bash script will always be less than a real babysitter – supervise, god, etc. – so maybe we should just have this kill znode as an optional script w/ prescription for how to set it up – e.g. run znode remover on daemon crash before starting new one (if we want supervise to start a new one).

          I'm thinking we should back this out since there are open questions still.

          Show
          stack added a comment - Looking at this w/ j-d, now we no longer do nohup so the parent process can stick around to watch out for the server crash. This make it so now there are two hbase processes listed per launched daemon. This is kinda ugly. When we have this bash script watching the running java process we verge into the territory normally occupied by babysitters like supervise. Our parent bash script will always be less than a real babysitter – supervise, god, etc. – so maybe we should just have this kill znode as an optional script w/ prescription for how to set it up – e.g. run znode remover on daemon crash before starting new one (if we want supervise to start a new one). I'm thinking we should back this out since there are open questions still.
          Hide
          Jean-Daniel Cryans added a comment -

          One thing that worries about this patch is the situation where the pid file is gone and someone tries to start the region server. It happened to me a bunch of times. I tried it with you patch and since it removes ephemeral znode it kills the region server that's already running and doesn't start a new one because the ports are already occupied.

          I'm not sure if this is related to this patch, but we're now missing info when using the scripts. We used to have:

          su-jdcryans-2:0.94 jdcryans$ ./bin/start-hbase.sh 
          localhost: starting zookeeper, logging to /Users/jdcryans/Work/HBase/0.94/bin/../logs/hbase-jdcryans-zookeeper-h-25-185.sfo.stumble.net.out
          starting master, logging to /Users/jdcryans/Work/HBase/0.94/bin/../logs/hbase-jdcryans-master-h-25-185.sfo.stumble.net.out
          localhost: starting regionserver, logging to /Users/jdcryans/Work/HBase/0.94/bin/../logs/hbase-jdcryans-regionserver-h-25-185.sfo.stumble.net.out
          

          Now we have:

          su-jdcryans-2:trunk-commit jdcryans$ ./bin/start-hbase.sh 
          
          su-jdcryans-2:trunk-commit jdcryans$ 
          
          Show
          Jean-Daniel Cryans added a comment - One thing that worries about this patch is the situation where the pid file is gone and someone tries to start the region server. It happened to me a bunch of times. I tried it with you patch and since it removes ephemeral znode it kills the region server that's already running and doesn't start a new one because the ports are already occupied. I'm not sure if this is related to this patch, but we're now missing info when using the scripts. We used to have: su-jdcryans-2:0.94 jdcryans$ ./bin/start-hbase.sh localhost: starting zookeeper, logging to /Users/jdcryans/Work/HBase/0.94/bin/../logs/hbase-jdcryans-zookeeper-h-25-185.sfo.stumble.net.out starting master, logging to /Users/jdcryans/Work/HBase/0.94/bin/../logs/hbase-jdcryans-master-h-25-185.sfo.stumble.net.out localhost: starting regionserver, logging to /Users/jdcryans/Work/HBase/0.94/bin/../logs/hbase-jdcryans-regionserver-h-25-185.sfo.stumble.net.out Now we have: su-jdcryans-2:trunk-commit jdcryans$ ./bin/start-hbase.sh su-jdcryans-2:trunk-commit jdcryans$
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-security #192 (See https://builds.apache.org/job/HBase-TRUNK-security/192/)
          HBASE-5844 Delete the region servers znode after a regions server crash (Revision 1334028)

          Result = SUCCESS
          stack :
          Files :

          • /hbase/trunk/bin/hbase-daemon.sh
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-security #192 (See https://builds.apache.org/job/HBase-TRUNK-security/192/ ) HBASE-5844 Delete the region servers znode after a regions server crash (Revision 1334028) Result = SUCCESS stack : Files : /hbase/trunk/bin/hbase-daemon.sh /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2848 (See https://builds.apache.org/job/HBase-TRUNK/2848/)
          HBASE-5844 Delete the region servers znode after a regions server crash (Revision 1334028)

          Result = FAILURE
          stack :
          Files :

          • /hbase/trunk/bin/hbase-daemon.sh
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2848 (See https://builds.apache.org/job/HBase-TRUNK/2848/ ) HBASE-5844 Delete the region servers znode after a regions server crash (Revision 1334028) Result = FAILURE stack : Files : /hbase/trunk/bin/hbase-daemon.sh /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
          Hide
          stack added a comment -

          Committed to trunk. Thanks for the patch N.

          Show
          stack added a comment - Committed to trunk. Thanks for the patch N.
          Hide
          Nicolas Liochon added a comment -

          Ready to be committed imho.

          Show
          Nicolas Liochon added a comment - Ready to be committed imho.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12525420/5844.v4.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop23. The patch compiles against the hadoop 0.23.x profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1743//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1743//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1743//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12525420/5844.v4.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop23. The patch compiles against the hadoop 0.23.x profile. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1743//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1743//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1743//console This message is automatically generated.
          Hide
          Nicolas Liochon added a comment -

          v4 should be ok.
          I will do another jira for the master.

          Show
          Nicolas Liochon added a comment - v4 should be ok. I will do another jira for the master.
          Hide
          Nicolas Liochon added a comment -

          I found the issue, and (hopefully) a fix. I will have a new patch middle of next week, I will include the master znode in this one...

          Show
          Nicolas Liochon added a comment - I found the issue, and (hopefully) a fix. I will have a new patch middle of next week, I will include the master znode in this one...
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-security #186 (See https://builds.apache.org/job/HBase-TRUNK-security/186/)
          HBASE-5844 Delete the region servers znode after a regions server crash; REVERT (Revision 1330983)

          Result = SUCCESS
          stack :
          Files :

          • /hbase/trunk/bin/hbase-daemon.sh
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-security #186 (See https://builds.apache.org/job/HBase-TRUNK-security/186/ ) HBASE-5844 Delete the region servers znode after a regions server crash; REVERT (Revision 1330983) Result = SUCCESS stack : Files : /hbase/trunk/bin/hbase-daemon.sh /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
          Hide
          stack added a comment -

          I reverted the patch from trunk.

          Show
          stack added a comment - I reverted the patch from trunk.
          Hide
          Nicolas Liochon added a comment -

          There is a regression when the cluster is fully distributed: the start command hangs. I'm on it. In the meantime, would it be possible to undo the commit?

          Sorry about this.

          Show
          Nicolas Liochon added a comment - There is a regression when the cluster is fully distributed: the start command hangs. I'm on it. In the meantime, would it be possible to undo the commit? Sorry about this.
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-security #182 (See https://builds.apache.org/job/HBase-TRUNK-security/182/)
          HBASE-5844 Delete the region servers znode after a regions server crash (Revision 1329430)

          Result = SUCCESS
          stack :
          Files :

          • /hbase/trunk/bin/hbase-daemon.sh
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-security #182 (See https://builds.apache.org/job/HBase-TRUNK-security/182/ ) HBASE-5844 Delete the region servers znode after a regions server crash (Revision 1329430) Result = SUCCESS stack : Files : /hbase/trunk/bin/hbase-daemon.sh /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #2801 (See https://builds.apache.org/job/HBase-TRUNK/2801/)
          HBASE-5844 Delete the region servers znode after a regions server crash (Revision 1329430)

          Result = FAILURE
          stack :
          Files :

          • /hbase/trunk/bin/hbase-daemon.sh
          • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #2801 (See https://builds.apache.org/job/HBase-TRUNK/2801/ ) HBASE-5844 Delete the region servers znode after a regions server crash (Revision 1329430) Result = FAILURE stack : Files : /hbase/trunk/bin/hbase-daemon.sh /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
          Hide
          stack added a comment -

          Committed to trunk. Thanks for the patch N.

          Show
          stack added a comment - Committed to trunk. Thanks for the patch N.
          Hide
          stack added a comment -

          What I committed. Its v2 + addressing Ted comment.

          Show
          stack added a comment - What I committed. Its v2 + addressing Ted comment.
          Hide
          stack added a comment -

          @N Ok. I'll commit this then in a new JIRA move out the common code to util or some place. I'll handle Ted's comment on commit.

          Show
          stack added a comment - @N Ok. I'll commit this then in a new JIRA move out the common code to util or some place. I'll handle Ted's comment on commit.
          Hide
          Ted Yu added a comment -
          +      "(Environment variable HBASE_ZNODE_FILE is no set).");
          

          'is no set' -> 'is not set'

          Show
          Ted Yu added a comment - + "(Environment variable HBASE_ZNODE_FILE is no set)." ); 'is no set' -> 'is not set'
          Hide
          Nicolas Liochon added a comment -

          You're right. I propose to commit this patch, I will then generalize the solution to master in another jira.

          Show
          Nicolas Liochon added a comment - You're right. I propose to commit this patch, I will then generalize the solution to master in another jira.
          Hide
          stack added a comment -

          Should master write out its znode name too? If it crashes this code could bring on the second master faster?

          Show
          stack added a comment - Should master write out its znode name too? If it crashes this code could bring on the second master faster?
          Hide
          Nicolas Liochon added a comment -

          v2 should be ok. It does not include anymore the fix for HBASE-5666, so it cannot be tested locally but I tried it before removing the workaround.

          Show
          Nicolas Liochon added a comment - v2 should be ok. It does not include anymore the fix for HBASE-5666 , so it cannot be tested locally but I tried it before removing the workaround.
          Hide
          stack added a comment -

          Ok on your reasoning for not using deleteOnExit. Try and have the two methods share more code like getting the name of the file w/ the znode name in it. Otherwise, sounds good.

          Show
          stack added a comment - Ok on your reasoning for not using deleteOnExit. Try and have the two methods share more code like getting the name of the file w/ the znode name in it. Otherwise, sounds good.
          Hide
          Nicolas Liochon added a comment -

          For the tracker, it's my private workaround for HBASE-5666, it should not have been included in this patch. Sorry about this.

          I think it's better to delete the file explicitly, just after the znode deletion. HRegionServer#deleteMyEphemeralNode is called only once, and I added deleteMyEphemeralNodeOnDisk just after this call. If we rely on #deleteOnExit, I fear we could have the file deleted with a still alive znode. I'm not sure and I have not tried it, but I think it's too easy to enter into the jvm-specific-behavior space here.

          I will fix the java code and try the whole fix on a real cluster for the v2.

          Thanks you for the review.

          Show
          Nicolas Liochon added a comment - For the tracker, it's my private workaround for HBASE-5666 , it should not have been included in this patch. Sorry about this. I think it's better to delete the file explicitly, just after the znode deletion. HRegionServer#deleteMyEphemeralNode is called only once, and I added deleteMyEphemeralNodeOnDisk just after this call. If we rely on #deleteOnExit, I fear we could have the file deleted with a still alive znode. I'm not sure and I have not tried it, but I think it's too easy to enter into the jvm-specific-behavior space here. I will fix the java code and try the whole fix on a real cluster for the v2. Thanks you for the review.
          Hide
          stack added a comment -

          If we go to a count > 100 we just continue the startup? Is that what you want?

          +    while (!tracker.checkIfBaseNodeAvailable() && ++count<100) {
          +      Thread.sleep(100);
          +    }
          

          Be like the rest of the code regards spaces; i.e. spaces around operators...

          +
          + if (fileName==null){

          Maybe you don't need deleteMyEphemeralNodeOnDisk if you instead use http://docs.oracle.com/javase/6/docs/api/java/io/File.html#deleteOnExit() inside in writeMyEphemeralNodeOnDisk?

          Patch looks good N.

          We upped the timeout because noobs would install hbase then run big mapreduce jobs w/o turning jvm and so big GCs. We figured they'd rather have their regionserver ride over the big pauses than have them be 'sensitive' out of the box.

          Show
          stack added a comment - If we go to a count > 100 we just continue the startup? Is that what you want? + while (!tracker.checkIfBaseNodeAvailable() && ++count<100) { + Thread .sleep(100); + } Be like the rest of the code regards spaces; i.e. spaces around operators... + + if (fileName==null){ Maybe you don't need deleteMyEphemeralNodeOnDisk if you instead use http://docs.oracle.com/javase/6/docs/api/java/io/File.html#deleteOnExit( ) inside in writeMyEphemeralNodeOnDisk? Patch looks good N. We upped the timeout because noobs would install hbase then run big mapreduce jobs w/o turning jvm and so big GCs. We figured they'd rather have their regionserver ride over the big pauses than have them be 'sensitive' out of the box.
          Hide
          Nicolas Liochon added a comment -

          I didn't know this parameter. It's interesting, because with ZK the default timeout is 30 seconds, but with HBase it's now 180s (from hbase-default.xml). It was increased to 60s a first time in HBASE-1772. It seems it was increased because of the GC.

          But it means that deleting immediately the ZK represents a huge mttr improvement for the regions server crash case.

          Show
          Nicolas Liochon added a comment - I didn't know this parameter. It's interesting, because with ZK the default timeout is 30 seconds, but with HBase it's now 180s (from hbase-default.xml). It was increased to 60s a first time in HBASE-1772 . It seems it was increased because of the GC. But it means that deleting immediately the ZK represents a huge mttr improvement for the regions server crash case.
          Hide
          Jean-Daniel Cryans added a comment -

          It is deleted automatically, it's an ephemeral znode so it takes zk.session.timeout time to see it disappear.

          Show
          Jean-Daniel Cryans added a comment - It is deleted automatically, it's an ephemeral znode so it takes zk.session.timeout time to see it disappear.
          Hide
          Nicolas Liochon added a comment -

          Patch v1. Tested on a local cluster & pseudo distributed, by stopping the server by kill -9. I will do some minor improvements on the java code and then test on a real cluster, but I'm interested by a feedback on the script.

          Show
          Nicolas Liochon added a comment - Patch v1. Tested on a local cluster & pseudo distributed, by stopping the server by kill -9. I will do some minor improvements on the java code and then test on a real cluster, but I'm interested by a feedback on the script.

            People

            • Assignee:
              Nicolas Liochon
              Reporter:
              Nicolas Liochon
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development