HBase
  1. HBase
  2. HBASE-4209

The HBase hbase-daemon.sh SIGKILLs master when stopping it

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.92.0
    • Component/s: master
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      There's a bit of code in hbase-daemon.sh that makes HBase master being SIGKILLed when stopping it rather than trying SIGTERM (like it does for other daemons). When HBase is executed in a standalone mode (and the only daemon you need to run is master) that causes newly created tables to go missing as unflushed data is thrown out. If there was not a good reason to kill master with SIGKILL perhaps we can take that special case out and rely on SIGTERM.

      1. 4209-v3.txt
        6 kB
        stack
      2. HBASE-4209.final.patch.txt
        6 kB
        Roman Shaposhnik
      3. HBASE-4209.patch.txt
        3 kB
        Roman Shaposhnik

        Activity

        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #2274 (See https://builds.apache.org/job/HBase-TRUNK/2274/)
        HBASE-4209 The HBase hbase-daemon.sh SIGKILLs master when stopping it

        stack :
        Files :

        • /hbase/trunk/CHANGES.txt
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/ShutdownHook.java
        • /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java
        Show
        Hudson added a comment - Integrated in HBase-TRUNK #2274 (See https://builds.apache.org/job/HBase-TRUNK/2274/ ) HBASE-4209 The HBase hbase-daemon.sh SIGKILLs master when stopping it stack : Files : /hbase/trunk/CHANGES.txt /hbase/trunk/src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/ShutdownHook.java /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java
        Hide
        Hudson added a comment -

        Integrated in HBase-0.92 #35 (See https://builds.apache.org/job/HBase-0.92/35/)
        HBASE-4209 The HBase hbase-daemon.sh SIGKILLs master when stopping it

        stack :
        Files :

        • /hbase/branches/0.92/CHANGES.txt
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/ShutdownHook.java
        • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java
        Show
        Hudson added a comment - Integrated in HBase-0.92 #35 (See https://builds.apache.org/job/HBase-0.92/35/ ) HBASE-4209 The HBase hbase-daemon.sh SIGKILLs master when stopping it stack : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/ShutdownHook.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java
        Hide
        stack added a comment -

        Committed to branch 0.92 and trunk. Thank you for the patch Roman.

        Show
        stack added a comment - Committed to branch 0.92 and trunk. Thank you for the patch Roman.
        Hide
        stack added a comment -

        Patch with the Todd fix and I swapped an unorthodox public AFTER a static qualifier. I tried this patch and it works for me. Committing this.

        Show
        stack added a comment - Patch with the Todd fix and I swapped an unorthodox public AFTER a static qualifier. I tried this patch and it works for me. Committing this.
        Hide
        Todd Lipcon added a comment -

        + * to be executed after the last regioserver referring to a given filesystem
        typo (regioserver)

        Otherwise looks pretty good. Thanks for fixing this messy bug.

        Show
        Todd Lipcon added a comment - + * to be executed after the last regioserver referring to a given filesystem typo (regioserver) Otherwise looks pretty good. Thanks for fixing this messy bug.
        Hide
        Roman Shaposhnik added a comment -

        Latest version of patch attached. Unfortunately, it has grown since last time I've submitted it to incorporated feedback from this JIRA.

        All tests pass on my local workstation.

        Show
        Roman Shaposhnik added a comment - Latest version of patch attached. Unfortunately, it has grown since last time I've submitted it to incorporated feedback from this JIRA. All tests pass on my local workstation.
        Hide
        Roman Shaposhnik added a comment -

        @stack

        > At least log the exception.
        > What is the exception we are suppressing? Should we at least log it here too?

        Basically, the only root cause of possible exceptions here is the code in
        suppressHdfsShutdownHook() (which is quite fragile and overly aggressive in
        throwing exceptions if things don't look quite right). Now, if you look at
        how this code is executed in a non-standalone case – HBase does bail on exceptions
        thrown from there. We can adopt the same approach in a standalone and unit testing
        case, but I wasn't sure it was the right thing to do. After all, NOT bailing out
        (and lets say simply logging these things) will be no worse than what we currently
        have (not calling shutdown hooks at all). On the other hand – if we do bail out
        aggressively we get a better chance of catching incompatibilities between
        suppressHdfsShutdownHook() logic and future hadoop releases during unit tests.

        So perhaps – bailing out would make the most sense of all.

        Show
        Roman Shaposhnik added a comment - @stack > At least log the exception. > What is the exception we are suppressing? Should we at least log it here too? Basically, the only root cause of possible exceptions here is the code in suppressHdfsShutdownHook() (which is quite fragile and overly aggressive in throwing exceptions if things don't look quite right). Now, if you look at how this code is executed in a non-standalone case – HBase does bail on exceptions thrown from there. We can adopt the same approach in a standalone and unit testing case, but I wasn't sure it was the right thing to do. After all, NOT bailing out (and lets say simply logging these things) will be no worse than what we currently have (not calling shutdown hooks at all). On the other hand – if we do bail out aggressively we get a better chance of catching incompatibilities between suppressHdfsShutdownHook() logic and future hadoop releases during unit tests. So perhaps – bailing out would make the most sense of all.
        Hide
        stack added a comment -

        This is not good:

        +    } catch (Throwable ex) {}
        

        At least log the exception.

        And similar here:

        +        try {
        +          HRegionServer.startRegionServer(t.getRegionServer());
        +        } catch (Throwable ex) {}
        +        // t.start();
        

        What is the exception we are suppressing? Should we at least log it here too?

        Otherwise, I'm fine with this referencecounting stuff. Its contained to the shutdown handler class.

        Good stuff.

        Show
        stack added a comment - This is not good: + } catch (Throwable ex) {} At least log the exception. And similar here: + try { + HRegionServer.startRegionServer(t.getRegionServer()); + } catch (Throwable ex) {} + // t.start(); What is the exception we are suppressing? Should we at least log it here too? Otherwise, I'm fine with this referencecounting stuff. Its contained to the shutdown handler class. Good stuff.
        Hide
        Roman Shaposhnik added a comment -

        I was asked yesterday how this can be tested. I'm not quite sure I can come up with a traditional unit test for this change, but here's how to verify it manually:

        $ bin/hbase master start & PID=$! ; sleep 15 ; kill $PID
        

        And look for LOG messages of the following nature:

        INFO regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-18,5,main]
        ..............
        INFO regionserver.ShutdownHook: Starting fs shutdown hook thread.
        INFO regionserver.ShutdownHook: Shutdown hook finished.
        

        Without the patch – the HBase master simply dies (as in – no messages get produced)

        Show
        Roman Shaposhnik added a comment - I was asked yesterday how this can be tested. I'm not quite sure I can come up with a traditional unit test for this change, but here's how to verify it manually: $ bin/hbase master start & PID=$! ; sleep 15 ; kill $PID And look for LOG messages of the following nature: INFO regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-18,5,main] .............. INFO regionserver.ShutdownHook: Starting fs shutdown hook thread. INFO regionserver.ShutdownHook: Shutdown hook finished. Without the patch – the HBase master simply dies (as in – no messages get produced)
        Hide
        Roman Shaposhnik added a comment -

        I'm attaching a sample patch that seems to be working well enough. The idea behind it is to reference count the fsShutdownHooks that we have to run from shutdown sequences of multiple RegionServers and only really run each when the last registered regionserver shuts down. Please let me know what do you think.

        Show
        Roman Shaposhnik added a comment - I'm attaching a sample patch that seems to be working well enough. The idea behind it is to reference count the fsShutdownHooks that we have to run from shutdown sequences of multiple RegionServers and only really run each when the last registered regionserver shuts down. Please let me know what do you think.
        Hide
        Roman Shaposhnik added a comment -

        stack, I'm sorry for putting this on a backburner, but at least I now have a better understanding of what's going on.

        Basically I got confused in a situation where suppressHdfsShutdownHook would be called multiple times on the same filesystem object. The first call would succeed, but all the other ones would fail. This is, obviously, just a problem with my patch, not the minihdfs cluster. I'll cook up an alternative and once I run the tests will attach an updated version.

        P.S. Thanks for the encouragement!

        Show
        Roman Shaposhnik added a comment - stack, I'm sorry for putting this on a backburner, but at least I now have a better understanding of what's going on. Basically I got confused in a situation where suppressHdfsShutdownHook would be called multiple times on the same filesystem object. The first call would succeed, but all the other ones would fail. This is, obviously, just a problem with my patch, not the minihdfs cluster. I'll cook up an alternative and once I run the tests will attach an updated version. P.S. Thanks for the encouragement!
        Hide
        stack added a comment -

        suppressHdfsShutdownHook can't disable the HDFS shutdown hook of the miniDFS, simply because there's nothing to disable.

        There is no hdfs shutdown hook when running minihdfs?

        And I don't follow how testing for localfs helps. I'm not clear on why testing for local fs makes a differenc when the shutdown hook is not in minidfs . I'm missing something.

        I'd appreciate your advice, since I'm just learning my way around HBase.

        You seem to be doing just fine to me. Obviously from above exchange you know more about this area of the code than I do (and I think I wrote it?). That suppressHdfsShutdownHook is really pretty (smile). Thats some ugly gymnastics going on in there.

        Show
        stack added a comment - suppressHdfsShutdownHook can't disable the HDFS shutdown hook of the miniDFS, simply because there's nothing to disable. There is no hdfs shutdown hook when running minihdfs? And I don't follow how testing for localfs helps. I'm not clear on why testing for local fs makes a differenc when the shutdown hook is not in minidfs . I'm missing something. I'd appreciate your advice, since I'm just learning my way around HBase. You seem to be doing just fine to me. Obviously from above exchange you know more about this area of the code than I do (and I think I wrote it?). That suppressHdfsShutdownHook is really pretty (smile). Thats some ugly gymnastics going on in there.
        Hide
        Roman Shaposhnik added a comment -

        stack, I have tried it with a standalone HBase writing to a local filesystem (I think there's no way making a standalone instance of HBase use HDFS). And that works quite nicely. Tables get flushed when I SIGTERM the process, etc. Once it came to testing I had to run it with miniDFS and that's where things got interesting. suppressHdfsShutdownHook can't disable the HDFS shutdown hook of the miniDFS, simply because there's nothing to disable.

        So here's the question, should I work around this one special case by simply testing for the

        LocalHBaseCluster.isLocal(hrs.getConfiguration())
        

        or would you propose something else?

        I'd appreciate your advice, since I'm just learning my way around HBase.

        Show
        Roman Shaposhnik added a comment - stack, I have tried it with a standalone HBase writing to a local filesystem (I think there's no way making a standalone instance of HBase use HDFS). And that works quite nicely. Tables get flushed when I SIGTERM the process, etc. Once it came to testing I had to run it with miniDFS and that's where things got interesting. suppressHdfsShutdownHook can't disable the HDFS shutdown hook of the miniDFS, simply because there's nothing to disable. So here's the question, should I work around this one special case by simply testing for the LocalHBaseCluster.isLocal(hrs.getConfiguration()) or would you propose something else? I'd appreciate your advice, since I'm just learning my way around HBase.
        Hide
        stack added a comment -

        Have you tried it Roman? Do all the tests pass? Seems like an innocuous enough change.

        Show
        stack added a comment - Have you tried it Roman? Do all the tests pass? Seems like an innocuous enough change.
        Hide
        Roman Shaposhnik added a comment -

        Perhaps this will be a naive suggestion, but wouldn't it make sense to install Shutdown hooks regardless of whether the Regioserver was started in a standalone mode. Will a trivial change like this one make sense:

        --- src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java
        +++ src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java
        @@ -179,7 +179,11 @@ public class JVMClusterUtil {
             }
             if (regionservers != null) {
               for (JVMClusterUtil.RegionServerThread t: regionservers) {
        -        t.start();
        +          try {
        +              HRegionServer.startRegionServer(t.getRegionServer());
        +          } catch (IOException e) {
        +              // Nothing to do, really
        +          }
               }
             }
             if (masters == null || masters.isEmpty()) {
        
        Show
        Roman Shaposhnik added a comment - Perhaps this will be a naive suggestion, but wouldn't it make sense to install Shutdown hooks regardless of whether the Regioserver was started in a standalone mode. Will a trivial change like this one make sense: --- src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java +++ src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java @@ -179,7 +179,11 @@ public class JVMClusterUtil { } if (regionservers != null) { for (JVMClusterUtil.RegionServerThread t: regionservers) { - t.start(); + try { + HRegionServer.startRegionServer(t.getRegionServer()); + } catch (IOException e) { + // Nothing to do, really + } } } if (masters == null || masters.isEmpty()) {
        Hide
        stack added a comment -

        Is it because no shutdown hook in master and when in standalone mode all runs in the one jvm, the master's effectively?

        In start-hbase.sh, if distmode is false, we ONLY start master:

        if [ "$distMode" == 'false' ]
        then
          "$bin"/hbase-daemon.sh start master
        else
          "$bin"/hbase-daemons.sh --config "${HBASE_CONF_DIR}" start zookeeper
          "$bin"/hbase-daemon.sh --config "${HBASE_CONF_DIR}" start master
          "$bin"/hbase-daemons.sh --config "${HBASE_CONF_DIR}" \
            --hosts "${HBASE_REGIONSERVERS}" start regionserver
          "$bin"/hbase-daemons.sh --config "${HBASE_CONF_DIR}" \
            --hosts "${HBASE_BACKUP_MASTERS}" start master-backup
        fi
        

        Inside in master it will take care of starting up all the other beasties if distmode == false.

        Show
        stack added a comment - Is it because no shutdown hook in master and when in standalone mode all runs in the one jvm, the master's effectively? In start-hbase.sh, if distmode is false, we ONLY start master: if [ "$distMode" == ' false ' ] then "$bin" /hbase-daemon.sh start master else "$bin" /hbase-daemons.sh --config "${HBASE_CONF_DIR}" start zookeeper "$bin" /hbase-daemon.sh --config "${HBASE_CONF_DIR}" start master "$bin" /hbase-daemons.sh --config "${HBASE_CONF_DIR}" \ --hosts "${HBASE_REGIONSERVERS}" start regionserver "$bin" /hbase-daemons.sh --config "${HBASE_CONF_DIR}" \ --hosts "${HBASE_BACKUP_MASTERS}" start master-backup fi Inside in master it will take care of starting up all the other beasties if distmode == false.
        Hide
        Roman Shaposhnik added a comment -

        stack, thanks for the clarifications – much appreciated! Now, as I said, I only see addShutdownHook call in regionserver code:

        $ git grep addShutdownHook
        src/main/java/org/apache/hadoop/hbase/regionserver/ShutdownHook.java:    Runtime.getRuntime().addShutdownHook(t);
        

        and then

         /**
           * @param hrs
           * @param name
           * @return Thread the RegionServer is running in correctly named.
           * @throws IOException
           */
          public static Thread startRegionServer(final HRegionServer hrs,
              final String name) throws IOException {
            Thread t = new Thread(hrs);
            t.setName(name);
            t.start();
            // Install shutdown hook that will catch signals and run an orderly shutdown
            // of the hrs.
            ShutdownHook.install(hrs.getConfiguration(), FileSystem.get(hrs
                .getConfiguration()), hrs, t);
            return t;
          }
        

        Now, the behavior that I observe is that when HBase is running in a distributed fashion, flushing of the newly created tables happen when regionserve receives a SIGTERM. When it is running in a standalone configuration tables don't get flushed UNLESS I explicitly call stop-hbase.sh before sending a SIGTERM.

        Once would assume that the same code would be called in both cases since it is registering that hook unconditionally. What am I missing here?

        Show
        Roman Shaposhnik added a comment - stack, thanks for the clarifications – much appreciated! Now, as I said, I only see addShutdownHook call in regionserver code: $ git grep addShutdownHook src/main/java/org/apache/hadoop/hbase/regionserver/ShutdownHook.java: Runtime.getRuntime().addShutdownHook(t); and then /** * @param hrs * @param name * @return Thread the RegionServer is running in correctly named. * @throws IOException */ public static Thread startRegionServer(final HRegionServer hrs, final String name) throws IOException { Thread t = new Thread(hrs); t.setName(name); t.start(); // Install shutdown hook that will catch signals and run an orderly shutdown // of the hrs. ShutdownHook.install(hrs.getConfiguration(), FileSystem.get(hrs .getConfiguration()), hrs, t); return t; } Now, the behavior that I observe is that when HBase is running in a distributed fashion, flushing of the newly created tables happen when regionserve receives a SIGTERM. When it is running in a standalone configuration tables don't get flushed UNLESS I explicitly call stop-hbase.sh before sending a SIGTERM. Once would assume that the same code would be called in both cases since it is registering that hook unconditionally. What am I missing here?
        Hide
        stack added a comment -

        stop-hbase.sh does shutdown hook plus shutting down the cluster. Its different. The hbase-daemon.sh sends signal to kill the daemon (triggering shutdown hook). Thats how I remember it. If I'm off, say so and I'll dig in and give you a more detailed response.

        Show
        stack added a comment - stop-hbase.sh does shutdown hook plus shutting down the cluster. Its different. The hbase-daemon.sh sends signal to kill the daemon (triggering shutdown hook). Thats how I remember it. If I'm off, say so and I'll dig in and give you a more detailed response.
        Hide
        Roman Shaposhnik added a comment -

        stack, before I submit the patch, I would really appreciate if you could let me know the relationship between bin/stop-hbase.sh and bin/hbase-daemon.sh. I was under the impression that whatever stop-hbase.sh triggers in the master code would also be triggered by JVM shutdown hook upon receiving SIGTERM, but it doesn't seem to be that way. Do we have to call bin/stop-hbase.sh manually from within the hbase-daemon.sh before stopping daemons?

        Show
        Roman Shaposhnik added a comment - stack, before I submit the patch, I would really appreciate if you could let me know the relationship between bin/stop-hbase.sh and bin/hbase-daemon.sh. I was under the impression that whatever stop-hbase.sh triggers in the master code would also be triggered by JVM shutdown hook upon receiving SIGTERM, but it doesn't seem to be that way. Do we have to call bin/stop-hbase.sh manually from within the hbase-daemon.sh before stopping daemons?
        Hide
        stack added a comment -

        I do not not remember why it SIGKILLs (I was going to blame hadoop saying we copied all from there but I just checked and don't see it there so it must be an addition of ours). If plain kill works for you, make a patch and we'll add it. Thanks Roman.

        Show
        stack added a comment - I do not not remember why it SIGKILLs (I was going to blame hadoop saying we copied all from there but I just checked and don't see it there so it must be an addition of ours). If plain kill works for you, make a patch and we'll add it. Thanks Roman.

          People

          • Assignee:
            Roman Shaposhnik
            Reporter:
            Roman Shaposhnik
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development