Hadoop Common
  1. Hadoop Common
  2. HADOOP-7331

Make hadoop-daemon.sh to return 1 if daemon processes did not get started

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: scripts
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      hadoop-daemon.sh now returns a non-zero exit code if it detects that the daemon was not still running after 3 seconds.

      Description

      Makes hadoop-daemon.sh to return 1 if daemon processes did not get started.

      1. HADOOP-7331.2.patch
        0.6 kB
        Tanping Wang
      2. HADOOP-7331.patch
        0.7 kB
        Tanping Wang

        Activity

        Hide
        Todd Lipcon added a comment -

        I think we need something a lot more involved to really do a useful job of this. The issue is that most of our daemons start and then exit a few seconds after startup (eg once the JVM has loaded and it has had a chance to do some basic initialization). For example, with an invalid config or bind address, hadoop-daemon.sh will still return 0 in this case.

        Regarding the patch itself: no need to use wc -l on the output of ps. Instead, something like:

        if ! ps -p $! > /dev/null ; then
          exit 1
        fi
        

        since ps has a non-zero exit code if the process is running.

        Show
        Todd Lipcon added a comment - I think we need something a lot more involved to really do a useful job of this. The issue is that most of our daemons start and then exit a few seconds after startup (eg once the JVM has loaded and it has had a chance to do some basic initialization). For example, with an invalid config or bind address, hadoop-daemon.sh will still return 0 in this case. Regarding the patch itself: no need to use wc -l on the output of ps . Instead, something like: if ! ps -p $! > /dev/ null ; then exit 1 fi since ps has a non-zero exit code if the process is running.
        Hide
        Tanping Wang added a comment -

        Thanks for the comment! I agree that we need a more complex mechanism in order for hadoop-daemon.sh process to really know that daemons are not able to start due to various problems that may happen. Suppose now we want to restrict our problem set to deal with the issues like invalid config or unable to bind address in which cases the daemon should die rather quickly after starting. What about having a simple solution for now: we put the process to sleep for a few seconds before checking pid? (Made changes based the feedback.)

        Show
        Tanping Wang added a comment - Thanks for the comment! I agree that we need a more complex mechanism in order for hadoop-daemon.sh process to really know that daemons are not able to start due to various problems that may happen. Suppose now we want to restrict our problem set to deal with the issues like invalid config or unable to bind address in which cases the daemon should die rather quickly after starting. What about having a simple solution for now: we put the process to sleep for a few seconds before checking pid? (Made changes based the feedback.)
        Hide
        Todd Lipcon added a comment -

        Yep, I think sleeping is a good idea.

        Assuming this has been manually tested, +1 for this simple fix before doing something more complicated.

        Show
        Todd Lipcon added a comment - Yep, I think sleeping is a good idea. Assuming this has been manually tested, +1 for this simple fix before doing something more complicated.
        Hide
        Tanping Wang added a comment -

        This was manually tested. Three test cases were covered:

        1. make binding address troublesome – by changing the config file to bind two name node service address to the same. After starting name node, hadoop-daemon returned 1. And error message were logged in log files.
        2. make config file invalid - by renaming hdfs-site.xml to something else. After starting name node, hadoop-daemon returned 1. And error message were logged in log, i.e. "NameNode: java.lang.IllegalArgumentException: Invalid URI for NameNode address.."
        3. make some user errors - by not providing the correct options to hadoop-daemon.sh e.g.
          hadoop-daemon.sh --conf... --script hdfs
          made hdfs script does not exist. hadoop-dameon.sh returned 1 and error was output through std out.
        Show
        Tanping Wang added a comment - This was manually tested. Three test cases were covered: make binding address troublesome – by changing the config file to bind two name node service address to the same. After starting name node, hadoop-daemon returned 1. And error message were logged in log files. make config file invalid - by renaming hdfs-site.xml to something else. After starting name node, hadoop-daemon returned 1. And error message were logged in log, i.e. "NameNode: java.lang.IllegalArgumentException: Invalid URI for NameNode address.." make some user errors - by not providing the correct options to hadoop-daemon.sh e.g. hadoop-daemon.sh --conf... --script hdfs made hdfs script does not exist. hadoop-dameon.sh returned 1 and error was output through std out.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12480505/HADOOP-7331.2.patch
        against trunk revision 1127811.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/530//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/530//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/530//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12480505/HADOOP-7331.2.patch against trunk revision 1127811. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/530//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/530//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HADOOP-Build/530//console This message is automatically generated.
        Hide
        Tanping Wang added a comment -

        No unit tests included since this is scripts. Manually tested.

        Show
        Tanping Wang added a comment - No unit tests included since this is scripts. Manually tested.
        Hide
        Steve Loughran added a comment -

        I understand why there aren't any tests for this, but worry about the sleep 3; statement: why choose three seconds? It seems one of those only-works-on-some-machines sttings.

        Show
        Steve Loughran added a comment - I understand why there aren't any tests for this, but worry about the sleep 3; statement: why choose three seconds? It seems one of those only-works-on-some-machines sttings.
        Hide
        Tanping Wang added a comment -

        The idea is to simply check if the PID still exists shortly after the JVM of the daemon process is created. As discussed early, we are simplifying the solution to deal with common scenarios that may cause JVM to die shortly after the JVM is created, such as invalid configuration files. And yes, this is machine specific. We do not have data to show how much exact time the JVM may die on top of commodity hardware. This is to provide the operator a quick feedback if JVM may not have been started if some straight-forward mis-configuration has happened. And regarding to why choose 3 seconds?? If you notice that there is also

         
        sleep 1; head "$log"
        

        in the same question can be asked regarding to why choose 1 second. I simply want to give a few seconds before checking the existence of JVM after its starts. And I can certainly make this sleep time configurable for the operators to adjust.

        (BTW, what are you referring to by "one of those"?)

        Show
        Tanping Wang added a comment - The idea is to simply check if the PID still exists shortly after the JVM of the daemon process is created. As discussed early, we are simplifying the solution to deal with common scenarios that may cause JVM to die shortly after the JVM is created, such as invalid configuration files. And yes, this is machine specific. We do not have data to show how much exact time the JVM may die on top of commodity hardware. This is to provide the operator a quick feedback if JVM may not have been started if some straight-forward mis-configuration has happened. And regarding to why choose 3 seconds?? If you notice that there is also sleep 1; head "$log" in the same question can be asked regarding to why choose 1 second. I simply want to give a few seconds before checking the existence of JVM after its starts. And I can certainly make this sleep time configurable for the operators to adjust. (BTW, what are you referring to by "one of those"?)
        Hide
        Tanping Wang added a comment -

        If no furture comments, would this be committed?

        Show
        Tanping Wang added a comment - If no furture comments, would this be committed?
        Hide
        Todd Lipcon added a comment -

        Steve, just to check: you're OK with committing this as is, right?

        Show
        Todd Lipcon added a comment - Steve, just to check: you're OK with committing this as is, right?
        Hide
        Todd Lipcon added a comment -

        Committed this to trunk. Thanks, Tanping!

        I marked this as an incompatible change and added a release note just in case anyone was relying on the previously broken behavior.

        Show
        Todd Lipcon added a comment - Committed this to trunk. Thanks, Tanping! I marked this as an incompatible change and added a release note just in case anyone was relying on the previously broken behavior.
        Hide
        Todd Lipcon added a comment -

        (should have mentioned, I realized Steve is on vacation at the moment. Since Tanping had addressed his comment, I decided to commit - if there's any issue we can revert or follow up later with another patch)

        Show
        Todd Lipcon added a comment - (should have mentioned, I realized Steve is on vacation at the moment. Since Tanping had addressed his comment, I decided to commit - if there's any issue we can revert or follow up later with another patch)
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #636 (See https://builds.apache.org/hudson/job/Hadoop-Common-trunk-Commit/636/)
        HADOOP-7331. Make hadoop-daemon.sh return exit code 1 if daemon processes did not get started. Contributed by Tanping Wang.

        todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1130833
        Files :

        • /hadoop/common/trunk/CHANGES.txt
        • /hadoop/common/trunk/bin/hadoop-daemon.sh
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #636 (See https://builds.apache.org/hudson/job/Hadoop-Common-trunk-Commit/636/ ) HADOOP-7331 . Make hadoop-daemon.sh return exit code 1 if daemon processes did not get started. Contributed by Tanping Wang. todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1130833 Files : /hadoop/common/trunk/CHANGES.txt /hadoop/common/trunk/bin/hadoop-daemon.sh
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk #708 (See https://builds.apache.org/hudson/job/Hadoop-Common-trunk/708/)
        HADOOP-7331. Make hadoop-daemon.sh return exit code 1 if daemon processes did not get started. Contributed by Tanping Wang.

        todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1130833
        Files :

        • /hadoop/common/trunk/CHANGES.txt
        • /hadoop/common/trunk/bin/hadoop-daemon.sh
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk #708 (See https://builds.apache.org/hudson/job/Hadoop-Common-trunk/708/ ) HADOOP-7331 . Make hadoop-daemon.sh return exit code 1 if daemon processes did not get started. Contributed by Tanping Wang. todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1130833 Files : /hadoop/common/trunk/CHANGES.txt /hadoop/common/trunk/bin/hadoop-daemon.sh

          People

          • Assignee:
            Tanping Wang
            Reporter:
            Tanping Wang
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development