Accumulo
  1. Accumulo
  2. ACCUMULO-1709

failing Master doesn't set exit code of process

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.4, 1.5.0
    • Fix Version/s: 1.5.1, 1.6.0
    • Component/s: None
    • Labels:
      None

      Description

      If the master can't start up due to an exception, it logs it, but doesn't exit the process with any error code -so the failure isn't picked up by any scripts looking at the exit code of bin/accumulo

        Activity

        Hide
        Steve Loughran added a comment -

        I've done something a bit like this, with an exception that can provide an exit code , a list of exceptions, which hooks up somewhat to Hadoop's ExitException and exit logic. That code is intended to provide a single exit point that tests can downgrade to exceptions, but I'd like it improved, hence HADOOP-9626.

        Key point: yes, it's nice for exceptions to serve up numbers, having some interface for this lets you retrofit it into exceptions where the base classes are not under your control. And a single exit method is good for logging, debugging and testing.

        but: with only 255 exceptions on a linux system, you are limited in what you can signal, which is a shame

        Show
        Steve Loughran added a comment - I've done something a bit like this, with an exception that can provide an exit code , a list of exceptions , which hooks up somewhat to Hadoop's ExitException and exit logic . That code is intended to provide a single exit point that tests can downgrade to exceptions, but I'd like it improved, hence HADOOP-9626 . Key point: yes, it's nice for exceptions to serve up numbers, having some interface for this lets you retrofit it into exceptions where the base classes are not under your control. And a single exit method is good for logging, debugging and testing. but: with only 255 exceptions on a linux system, you are limited in what you can signal, which is a shame
        Hide
        John Vines added a comment -

        I wonder if we want to make the exit code equivalent of Exception.hashCode() so there is some uniqueness, in the case of people wanting to do handling of it from calling scripts.

        Show
        John Vines added a comment - I wonder if we want to make the exit code equivalent of Exception.hashCode() so there is some uniqueness, in the case of people wanting to do handling of it from calling scripts.
        Hide
        Billie Rinaldi added a comment -

        I think we've turned all the ones I can find into System.exit(1) calls.

        Show
        Billie Rinaldi added a comment - I think we've turned all the ones I can find into System.exit(1) calls.
        Hide
        Steve Loughran added a comment -

        I think those exceptions shouldn't be swallowed -they do need to be turned into an exit code, otherwise callers cant distinguish exit-healthy from exit-fail

        Show
        Steve Loughran added a comment - I think those exceptions shouldn't be swallowed -they do need to be turned into an exit code, otherwise callers cant distinguish exit-healthy from exit-fail
        Hide
        Mike Drob added a comment -

        Steve Loughran/Billie Rinaldi, are you two happy with this issue? Can the JIRA be resolved?

        Show
        Mike Drob added a comment - Steve Loughran / Billie Rinaldi , are you two happy with this issue? Can the JIRA be resolved?
        Hide
        Steve Loughran added a comment -

        I assumed the main stuff was there so that if someone invoked the main directly rather than via start then an exception is caught, logged via log4j and then the exit kicks in. Which seems good to me for both the logging -it will go to syslog or wherever, and so that if I bypass start and its classloading, failures sill get picked up.

        In Hadoop 2 we have a whole class dedicated to exits, https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ExitUtil.java, which

        1. lets you turn system exiting off (so that tests don't fail, or when you run under a security manager with exit disabled)
        2. has an exit exception that returns an exit code
        3. can extract

        Somewhere there's a JIRA of mine (HADOOP-9626) to add a getExitCode() interface to get the exit code out of any exception that chooses to declare it, so that entry points can field exceptions coming up from below that declare an explicit exit code. This wouldn't impact the structure you have here, except it would give your lower-level code more control than today

        Show
        Steve Loughran added a comment - I assumed the main stuff was there so that if someone invoked the main directly rather than via start then an exception is caught, logged via log4j and then the exit kicks in. Which seems good to me for both the logging -it will go to syslog or wherever, and so that if I bypass start and its classloading, failures sill get picked up. In Hadoop 2 we have a whole class dedicated to exits, https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ExitUtil.java , which lets you turn system exiting off (so that tests don't fail, or when you run under a security manager with exit disabled) has an exit exception that returns an exit code can extract Somewhere there's a JIRA of mine ( HADOOP-9626 ) to add a getExitCode() interface to get the exit code out of any exception that chooses to declare it, so that entry points can field exceptions coming up from below that declare an explicit exit code. This wouldn't impact the structure you have here, except it would give your lower-level code more control than today
        Hide
        Billie Rinaldi added a comment -

        You mean remove the catch blocks from Master and TabletServer main, or rethrow the exceptions? I guess those catches were added for the extra logging. Do you think propagating the exception would be better than exiting immediately?

        Show
        Billie Rinaldi added a comment - You mean remove the catch blocks from Master and TabletServer main, or rethrow the exceptions? I guess those catches were added for the extra logging. Do you think propagating the exception would be better than exiting immediately?
        Hide
        John Vines added a comment -

        Can't we handle all of these in Start?

        Show
        John Vines added a comment - Can't we handle all of these in Start?
        Hide
        ASF subversion and git services added a comment -

        Commit 789c920404561521948f38642c142ec99065f50e in branch refs/heads/master from Billie Rinaldi
        [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=789c920 ]

        ACCUMULO-1709 applied steve loughran's patch to fix process exit code

        Show
        ASF subversion and git services added a comment - Commit 789c920404561521948f38642c142ec99065f50e in branch refs/heads/master from Billie Rinaldi [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=789c920 ] ACCUMULO-1709 applied steve loughran's patch to fix process exit code
        Hide
        ASF subversion and git services added a comment -

        Commit 742b1625ff9bdefbe8e11e79bd38789273b3763e in branch refs/heads/master from Billie Rinaldi
        [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=742b162 ]

        ACCUMULO-1709 Merge branch '1.5.1-SNAPSHOT'

        Show
        ASF subversion and git services added a comment - Commit 742b1625ff9bdefbe8e11e79bd38789273b3763e in branch refs/heads/master from Billie Rinaldi [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=742b162 ] ACCUMULO-1709 Merge branch '1.5.1-SNAPSHOT'
        Hide
        ASF subversion and git services added a comment -

        Commit 789c920404561521948f38642c142ec99065f50e in branch refs/heads/1.5.1-SNAPSHOT from Billie Rinaldi
        [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=789c920 ]

        ACCUMULO-1709 applied steve loughran's patch to fix process exit code

        Show
        ASF subversion and git services added a comment - Commit 789c920404561521948f38642c142ec99065f50e in branch refs/heads/1.5.1-SNAPSHOT from Billie Rinaldi [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=789c920 ] ACCUMULO-1709 applied steve loughran's patch to fix process exit code
        Hide
        Billie Rinaldi added a comment -

        Looks like this is an issue for TabletServer as well. monitor, gc, and tracer propagate their exceptions.

              server.run();
            } catch (Exception ex) {
              log.error("Uncaught exception in TabletServer.main, exiting", ex);
            }
        
        Show
        Billie Rinaldi added a comment - Looks like this is an issue for TabletServer as well. monitor, gc, and tracer propagate their exceptions. server.run(); } catch (Exception ex) { log.error( "Uncaught exception in TabletServer.main, exiting" , ex); }
        Hide
        Billie Rinaldi added a comment -

        Well, maybe they would, but the master is just catching the exception, logging an error, and exiting normally.

        Show
        Billie Rinaldi added a comment - Well, maybe they would, but the master is just catching the exception, logging an error, and exiting normally.
        Hide
        John Vines added a comment - - edited

        I thought java exceptions that percolate all the way up cause a non-zero exit code?

        EDIT: Hmm.. today I learned...

        Show
        John Vines added a comment - - edited I thought java exceptions that percolate all the way up cause a non-zero exit code? EDIT: Hmm.. today I learned...
        Hide
        Steve Loughran added a comment -

        There's no test for this patch, here is a before and after log in a program forking accumulo (hence the log4j in log4j trace)

         shutdown of the AM: Spawned master exited with raw 0 mapped to 0
        2013-09-11 19:07:07,489 ERROR (RunLongLivedApp.java:run(408)) - 
        2013-09-11 19:07:07,489 INFO  (RunLongLivedApp.java:run(419)) - 
        2013-09-11 19:07:07,489 INFO  (RunLongLivedApp.java:run(422)) - 2013-09-11 19:07:07,386 [zookeeper.ZooUtil] ERROR: unable obtain instance id at file:/accumulo/instance_id
        2013-09-11 19:07:07,489 INFO  (RunLongLivedApp.java:run(422)) - 2013-09-11 19:07:07,386 [master.Master] ERROR: Unexpected exception, exiting
        2013-09-11 19:07:07,489 INFO  (RunLongLivedApp.java:run(422)) - java.lang.RuntimeException: Accumulo not initialized, there is no instance id at file:/accumulo/instance_id
        2013-09-11 19:07:07,489 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.core.zookeeper.ZooUtil.getInstanceIDFromHdfs(ZooUtil.java:60)
        2013-09-11 19:07:07,489 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.server.client.HdfsZooInstance._getInstanceID(HdfsZooInstance.java:125)
        2013-09-11 19:07:07,490 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.server.client.HdfsZooInstance.getInstanceID(HdfsZooInstance.java:119)
        2013-09-11 19:07:07,490 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.server.conf.ZooConfiguration.getInstance(ZooConfiguration.java:54)
        2013-09-11 19:07:07,490 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.server.conf.ServerConfiguration.getZooConfiguration(ServerConfiguration.java:50)
        2013-09-11 19:07:07,490 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.server.conf.ServerConfiguration.getSystemConfiguration(ServerConfiguration.java:59)
        2013-09-11 19:07:07,490 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.server.fs.VolumeManagerImpl.get(VolumeManagerImpl.java:312)
        2013-09-11 19:07:07,490 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.server.master.Master.main(Master.java:1635)
        2013-09-11 19:07:07,490 INFO  (RunLongLivedApp.java:run(422)) - 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        2013-09-11 19:07:07,490 INFO  (RunLongLivedApp.java:run(422)) - 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        2013-09-11 19:07:07,490 INFO  (RunLongLivedApp.java:run(422)) - 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        2013-09-11 19:07:07,490 INFO  (RunLongLivedApp.java:run(422)) - 	at java.lang.reflect.Method.invoke(Method.java:597)
        2013-09-11 19:07:07,491 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.start.Main$1.run(Main.java:107)
        2013-09-11 19:07:07,491 INFO  (RunLongLivedApp.java:run(422)) - 	at java.lang.Thread.run(Thread.java:680)
        2013-09-11 19:07:07,563 [main] DEBUG appmaster.HoyaAppMaster (HoyaAppMaster.java:finish(747)) - Stopping forked process: exit code=0
        

        After

        2013-09-11 19:35:02,887 INFO  (RunLongLivedApp.java:run(419)) - 
        2013-09-11 19:35:02,888 INFO  (RunLongLivedApp.java:run(422)) - 2013-09-11 19:35:02,737 [zookeeper.ZooUtil] ERROR: unable obtain instance id at file:/accumulo/instance_id
        2013-09-11 19:35:02,888 INFO  (RunLongLivedApp.java:run(422)) - 2013-09-11 19:35:02,737 [master.Master] ERROR: Unexpected exception, exiting
        2013-09-11 19:35:02,888 INFO  (RunLongLivedApp.java:run(422)) - java.lang.RuntimeException: Accumulo not initialized, there is no instance id at file:/accumulo/instance_id
        2013-09-11 19:35:02,888 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.core.zookeeper.ZooUtil.getInstanceIDFromHdfs(ZooUtil.java:60)
        2013-09-11 19:35:02,888 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.server.client.HdfsZooInstance._getInstanceID(HdfsZooInstance.java:125)
        2013-09-11 19:35:02,888 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.server.client.HdfsZooInstance.getInstanceID(HdfsZooInstance.java:119)
        2013-09-11 19:35:02,888 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.server.conf.ZooConfiguration.getInstance(ZooConfiguration.java:54)
        2013-09-11 19:35:02,888 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.server.conf.ServerConfiguration.getZooConfiguration(ServerConfiguration.java:50)
        2013-09-11 19:35:02,888 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.server.conf.ServerConfiguration.getSystemConfiguration(ServerConfiguration.java:59)
        2013-09-11 19:35:02,888 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.server.fs.VolumeManagerImpl.get(VolumeManagerImpl.java:312)
        2013-09-11 19:35:02,889 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.server.master.Master.main(Master.java:1635)
        2013-09-11 19:35:02,889 INFO  (RunLongLivedApp.java:run(422)) - 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        2013-09-11 19:35:02,889 INFO  (RunLongLivedApp.java:run(422)) - 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        2013-09-11 19:35:02,889 INFO  (RunLongLivedApp.java:run(422)) - 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        2013-09-11 19:35:02,889 INFO  (RunLongLivedApp.java:run(422)) - 	at java.lang.reflect.Method.invoke(Method.java:597)
        2013-09-11 19:35:02,889 INFO  (RunLongLivedApp.java:run(422)) - 	at org.apache.accumulo.start.Main$1.run(Main.java:107)
        2013-09-11 19:35:02,889 INFO  (RunLongLivedApp.java:run(422)) - 	at java.lang.Thread.run(Thread.java:680)
        2013-09-11 19:35:03,522 [main] DEBUG appmaster.HoyaAppMaster (HoyaAppMaster.java:finish(747)) - Stopping forked process: exit code=1
        

        The exit code is being picked up successfully after the patch is applied

        Show
        Steve Loughran added a comment - There's no test for this patch, here is a before and after log in a program forking accumulo (hence the log4j in log4j trace) shutdown of the AM: Spawned master exited with raw 0 mapped to 0 2013-09-11 19:07:07,489 ERROR (RunLongLivedApp.java:run(408)) - 2013-09-11 19:07:07,489 INFO (RunLongLivedApp.java:run(419)) - 2013-09-11 19:07:07,489 INFO (RunLongLivedApp.java:run(422)) - 2013-09-11 19:07:07,386 [zookeeper.ZooUtil] ERROR: unable obtain instance id at file:/accumulo/instance_id 2013-09-11 19:07:07,489 INFO (RunLongLivedApp.java:run(422)) - 2013-09-11 19:07:07,386 [master.Master] ERROR: Unexpected exception, exiting 2013-09-11 19:07:07,489 INFO (RunLongLivedApp.java:run(422)) - java.lang.RuntimeException: Accumulo not initialized, there is no instance id at file:/accumulo/instance_id 2013-09-11 19:07:07,489 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.core.zookeeper.ZooUtil.getInstanceIDFromHdfs(ZooUtil.java:60) 2013-09-11 19:07:07,489 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.server.client.HdfsZooInstance._getInstanceID(HdfsZooInstance.java:125) 2013-09-11 19:07:07,490 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.server.client.HdfsZooInstance.getInstanceID(HdfsZooInstance.java:119) 2013-09-11 19:07:07,490 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.server.conf.ZooConfiguration.getInstance(ZooConfiguration.java:54) 2013-09-11 19:07:07,490 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.server.conf.ServerConfiguration.getZooConfiguration(ServerConfiguration.java:50) 2013-09-11 19:07:07,490 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.server.conf.ServerConfiguration.getSystemConfiguration(ServerConfiguration.java:59) 2013-09-11 19:07:07,490 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.server.fs.VolumeManagerImpl.get(VolumeManagerImpl.java:312) 2013-09-11 19:07:07,490 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.server.master.Master.main(Master.java:1635) 2013-09-11 19:07:07,490 INFO (RunLongLivedApp.java:run(422)) - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2013-09-11 19:07:07,490 INFO (RunLongLivedApp.java:run(422)) - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 2013-09-11 19:07:07,490 INFO (RunLongLivedApp.java:run(422)) - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 2013-09-11 19:07:07,490 INFO (RunLongLivedApp.java:run(422)) - at java.lang.reflect.Method.invoke(Method.java:597) 2013-09-11 19:07:07,491 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.start.Main$1.run(Main.java:107) 2013-09-11 19:07:07,491 INFO (RunLongLivedApp.java:run(422)) - at java.lang. Thread .run( Thread .java:680) 2013-09-11 19:07:07,563 [main] DEBUG appmaster.HoyaAppMaster (HoyaAppMaster.java:finish(747)) - Stopping forked process: exit code=0 After 2013-09-11 19:35:02,887 INFO (RunLongLivedApp.java:run(419)) - 2013-09-11 19:35:02,888 INFO (RunLongLivedApp.java:run(422)) - 2013-09-11 19:35:02,737 [zookeeper.ZooUtil] ERROR: unable obtain instance id at file:/accumulo/instance_id 2013-09-11 19:35:02,888 INFO (RunLongLivedApp.java:run(422)) - 2013-09-11 19:35:02,737 [master.Master] ERROR: Unexpected exception, exiting 2013-09-11 19:35:02,888 INFO (RunLongLivedApp.java:run(422)) - java.lang.RuntimeException: Accumulo not initialized, there is no instance id at file:/accumulo/instance_id 2013-09-11 19:35:02,888 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.core.zookeeper.ZooUtil.getInstanceIDFromHdfs(ZooUtil.java:60) 2013-09-11 19:35:02,888 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.server.client.HdfsZooInstance._getInstanceID(HdfsZooInstance.java:125) 2013-09-11 19:35:02,888 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.server.client.HdfsZooInstance.getInstanceID(HdfsZooInstance.java:119) 2013-09-11 19:35:02,888 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.server.conf.ZooConfiguration.getInstance(ZooConfiguration.java:54) 2013-09-11 19:35:02,888 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.server.conf.ServerConfiguration.getZooConfiguration(ServerConfiguration.java:50) 2013-09-11 19:35:02,888 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.server.conf.ServerConfiguration.getSystemConfiguration(ServerConfiguration.java:59) 2013-09-11 19:35:02,888 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.server.fs.VolumeManagerImpl.get(VolumeManagerImpl.java:312) 2013-09-11 19:35:02,889 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.server.master.Master.main(Master.java:1635) 2013-09-11 19:35:02,889 INFO (RunLongLivedApp.java:run(422)) - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2013-09-11 19:35:02,889 INFO (RunLongLivedApp.java:run(422)) - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 2013-09-11 19:35:02,889 INFO (RunLongLivedApp.java:run(422)) - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 2013-09-11 19:35:02,889 INFO (RunLongLivedApp.java:run(422)) - at java.lang.reflect.Method.invoke(Method.java:597) 2013-09-11 19:35:02,889 INFO (RunLongLivedApp.java:run(422)) - at org.apache.accumulo.start.Main$1.run(Main.java:107) 2013-09-11 19:35:02,889 INFO (RunLongLivedApp.java:run(422)) - at java.lang. Thread .run( Thread .java:680) 2013-09-11 19:35:03,522 [main] DEBUG appmaster.HoyaAppMaster (HoyaAppMaster.java:finish(747)) - Stopping forked process: exit code=1 The exit code is being picked up successfully after the patch is applied
        Hide
        Steve Loughran added a comment -

        Patch that

        1. has master call system.exit on an exception
        2. makes sure all failure paths through start.main() exit
        3. changed exception reporting in start.main() to invoke Throwable.toString() rather than Throwable.getMessage(). Some exceptions including NullPointerException have a null message -calling toString() is the only way to reliably get details on all exceptions
        Show
        Steve Loughran added a comment - Patch that has master call system.exit on an exception makes sure all failure paths through start.main() exit changed exception reporting in start.main() to invoke Throwable.toString() rather than Throwable.getMessage() . Some exceptions including NullPointerException have a null message -calling toString() is the only way to reliably get details on all exceptions

          People

          • Assignee:
            Steve Loughran
            Reporter:
            Steve Loughran
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development