Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-4426

Secondary namenode shuts down immediately after startup

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0.3-alpha, 0.23.6
    • Fix Version/s: 2.0.3-alpha, 0.23.6, 0.23.7
    • Component/s: namenode
    • Labels:
      None

      Description

      After HADOOP-9181 went in, the secondary namenode immediately shuts down after it is started. From the startup logs:

      2013-01-22 19:54:28,826 INFO  namenode.SecondaryNameNode (SecondaryNameNode.java:initialize(299)) - Checkpoint Period   :3600 secs (60 min)
      2013-01-22 19:54:28,826 INFO  namenode.SecondaryNameNode (SecondaryNameNode.java:initialize(301)) - Log Size Trigger    :40000 txns
      2013-01-22 19:54:28,845 INFO  namenode.SecondaryNameNode (StringUtils.java:run(616)) - SHUTDOWN_MSG: 
      /************************************************************
      SHUTDOWN_MSG: Shutting down SecondaryNameNode at xx
      ************************************************************/
      

      I looked into the issue, and it's shutting down because SecondaryNameNode.main starts a bunch of daemon threads then returns. With nothing but daemon threads remaining, the JVM sees no reason to keep going and proceeds to shutdown. Apparently we were implicitly relying on the fact that the HttpServer QueuedThreadPool threads were not daemon threads to keep the secondary namenode process up.

      1. HDFS-4426.patch
        1 kB
        Arpit Agarwal
      2. HDFS-4426.patch
        0.9 kB
        Suresh Srinivas
      3. HDFS-4426.patch
        0.9 kB
        Suresh Srinivas
      4. HDFS-4426.branch-23.patch
        18 kB
        Suresh Srinivas
      5. HDFS-4426.1.patch
        1 kB
        Arpit Agarwal

        Issue Links

          Activity

          Hide
          Suresh Srinivas added a comment -

          Jason, I will followup on this. Thanks for filing the bug.

          Show
          Suresh Srinivas added a comment - Jason, I will followup on this. Thanks for filing the bug.
          Hide
          Suresh Srinivas added a comment -

          Release Managers for releases this is considered blocking could consider reverting HADOOP-9181.

          Show
          Suresh Srinivas added a comment - Release Managers for releases this is considered blocking could consider reverting HADOOP-9181 .
          Hide
          Daryn Sharp added a comment -

          The 2NN creates a daemon thread of itself and implicitly relies on other threads to keep the process alive - is this a case of two wrongs make a right? Or is there a technical reason why the 2NN shouldn't simply do it's work in the main thread?

          Show
          Daryn Sharp added a comment - The 2NN creates a daemon thread of itself and implicitly relies on other threads to keep the process alive - is this a case of two wrongs make a right? Or is there a technical reason why the 2NN shouldn't simply do it's work in the main thread?
          Hide
          Suresh Srinivas added a comment -

          The 2NN creates a daemon thread of itself and implicitly relies on other threads to keep the process alive - is this a case of two wrongs make a right? Or is there a technical reason why the 2NN shouldn't simply do it's work in the main thread?

          We should follow the same pattern as the namenode and wait for maint thread/threads to end. So main thread should wait on join call.

          Show
          Suresh Srinivas added a comment - The 2NN creates a daemon thread of itself and implicitly relies on other threads to keep the process alive - is this a case of two wrongs make a right? Or is there a technical reason why the 2NN shouldn't simply do it's work in the main thread? We should follow the same pattern as the namenode and wait for maint thread/threads to end. So main thread should wait on join call.
          Hide
          Liang Xie added a comment -

          Suresh Srinivas, i guess a better choice is that we should let the HttpServer'daemon flag could be set in his constructor,right?
          very very sorry for this trouble...

          Show
          Liang Xie added a comment - Suresh Srinivas , i guess a better choice is that we should let the HttpServer'daemon flag could be set in his constructor,right? very very sorry for this trouble...
          Hide
          Suresh Srinivas added a comment -

          Suresh Srinivas, i guess a better choice is that we should let the HttpServer'daemon flag could be set in his constructor,right?

          I am not sure I follow you. How does it solve the problem?

          very very sorry for this trouble...

          These things do happen. It is strange that unit tests did not catch this issue!

          Show
          Suresh Srinivas added a comment - Suresh Srinivas, i guess a better choice is that we should let the HttpServer'daemon flag could be set in his constructor,right? I am not sure I follow you. How does it solve the problem? very very sorry for this trouble... These things do happen. It is strange that unit tests did not catch this issue!
          Hide
          Liang Xie added a comment -

          yes, i did run the whole test cases at my devbox beofre, no failure...

          maybe we can:
          1) add a new "isDaemon" parameter into HttpServer's constructor, and making the default value is "false". but the current parameter list is too long enough or
          2) introduce a new configuration key, HttpServer's constructor has a parameter named "conf"

          Show
          Liang Xie added a comment - yes, i did run the whole test cases at my devbox beofre, no failure... maybe we can: 1) add a new "isDaemon" parameter into HttpServer's constructor, and making the default value is "false". but the current parameter list is too long enough or 2) introduce a new configuration key, HttpServer's constructor has a parameter named "conf"
          Hide
          Arpit Agarwal added a comment -

          Attached a patch to handle this like the NameNode. Thanks to Suresh for the suggesting the fix.

          I have not added a new test case as this appears non-trivial to test with JUnit. I filed HDFS-4430 to investigate adding a test.

          The existing unit tests did not catch the regression because the server did not need to survive beyond the lifetime of the calling JUnit thread.

          Liang, if I understand you then adding such a configuration knob would not have helped in this situation.

          Show
          Arpit Agarwal added a comment - Attached a patch to handle this like the NameNode. Thanks to Suresh for the suggesting the fix. I have not added a new test case as this appears non-trivial to test with JUnit. I filed HDFS-4430 to investigate adding a test. The existing unit tests did not catch the regression because the server did not need to survive beyond the lifetime of the calling JUnit thread. Liang, if I understand you then adding such a configuration knob would not have helped in this situation.
          Hide
          Arpit Agarwal added a comment -

          I forgot to add that I verified the patch manually.

          Show
          Arpit Agarwal added a comment - I forgot to add that I verified the patch manually.
          Hide
          Suresh Srinivas added a comment -

          +1 for the change.

          Show
          Suresh Srinivas added a comment - +1 for the change.
          Hide
          Suresh Srinivas added a comment -

          I will commit it tomorrow morning.

          Show
          Suresh Srinivas added a comment - I will commit it tomorrow morning.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12566081/HDFS-4426.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3869//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566081/HDFS-4426.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3869//console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          rebased patch.

          Show
          Suresh Srinivas added a comment - rebased patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12566085/HDFS-4426.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 javac. The patch appears to cause the build to fail.

          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3870//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566085/HDFS-4426.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javac . The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3870//console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          Hopefully this time correctly rebased patch.

          Show
          Suresh Srinivas added a comment - Hopefully this time correctly rebased patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12566086/HDFS-4426.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3871//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3871//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3871//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566086/HDFS-4426.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3871//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/3871//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3871//console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          Arpit, the findbugs warnings flagged is valid and needs to be fixed. I also noticed that the method join() need not be public.

          Show
          Suresh Srinivas added a comment - Arpit, the findbugs warnings flagged is valid and needs to be fixed. I also noticed that the method join() need not be public.
          Hide
          Daryn Sharp added a comment -

          Just out of curiosity, is there benefit to creating a thread and waiting to join on it, versus not creating a thread and doing the processing in the main thread?

          Show
          Daryn Sharp added a comment - Just out of curiosity, is there benefit to creating a thread and waiting to join on it, versus not creating a thread and doing the processing in the main thread?
          Hide
          Arpit Agarwal added a comment -

          Thanks Suresh. I fixed the warnings. My earlier patch was off the wrong branch (0.26 vs trunk). I hope it doesn't need to be rebased this time.

          Show
          Arpit Agarwal added a comment - Thanks Suresh. I fixed the warnings. My earlier patch was off the wrong branch (0.26 vs trunk). I hope it doesn't need to be rebased this time.
          Hide
          Suresh Srinivas added a comment -

          Just out of curiosity, is there benefit to creating a thread and waiting to join on it, versus not creating a thread and doing the processing in the main thread?

          Typically services get started as Threads by the main. In order to use the the main thread, the run loop from the service needs to be moved into main. That requires changing a service or copying code from a service. That is why typically main starts all the services and waits till one/all of critical service ends using join().

          Show
          Suresh Srinivas added a comment - Just out of curiosity, is there benefit to creating a thread and waiting to join on it, versus not creating a thread and doing the processing in the main thread? Typically services get started as Threads by the main. In order to use the the main thread, the run loop from the service needs to be moved into main. That requires changing a service or copying code from a service. That is why typically main starts all the services and waits till one/all of critical service ends using join().
          Hide
          Arpit Agarwal added a comment -

          Verified the updated patch manually again.

          Show
          Arpit Agarwal added a comment - Verified the updated patch manually again.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12566145/HDFS-4426.1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3873//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3873//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12566145/HDFS-4426.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3873//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3873//console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          I have committed the patch to branch-2 and trunk.

          Thank you Arpit!

          Show
          Suresh Srinivas added a comment - I have committed the patch to branch-2 and trunk. Thank you Arpit!
          Hide
          Suresh Srinivas added a comment -

          Merging this change to 0.23 is not straight forward. Here is the merge patch that is slightly different.

          Can some one quickly test this and +1 it?
          If no one does it, I am going to commit it anyway in an hour, since it seems to be a simple change.

          Show
          Suresh Srinivas added a comment - Merging this change to 0.23 is not straight forward. Here is the merge patch that is slightly different. Can some one quickly test this and +1 it? If no one does it, I am going to commit it anyway in an hour, since it seems to be a simple change.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #3274 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3274/)
          HDFS-4426. Secondary namenode shuts down immediately after startup. Contributed by Arpit Agarwal. (Revision 1437627)

          Result = SUCCESS
          suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1437627
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
          Show
          Hudson added a comment - Integrated in Hadoop-trunk-Commit #3274 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3274/ ) HDFS-4426 . Secondary namenode shuts down immediately after startup. Contributed by Arpit Agarwal. (Revision 1437627) Result = SUCCESS suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1437627 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
          Hide
          Arpit Agarwal added a comment -

          +1 for the merge patch. Thanks Suresh!

          Show
          Arpit Agarwal added a comment - +1 for the merge patch. Thanks Suresh!
          Hide
          Suresh Srinivas added a comment -

          I committed the merge to 0.23 branch as well.

          Show
          Suresh Srinivas added a comment - I committed the merge to 0.23 branch as well.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #106 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/106/)
          HDFS-4426. Secondary namenode shuts down immediately after startup. Contributed by Arpit Agarwal. (Revision 1437627)

          Result = FAILURE
          suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1437627
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
          Show
          Hudson added a comment - Integrated in Hadoop-Yarn-trunk #106 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/106/ ) HDFS-4426 . Secondary namenode shuts down immediately after startup. Contributed by Arpit Agarwal. (Revision 1437627) Result = FAILURE suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1437627 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #504 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/504/)
          HDFS-4426. Merge change 1437627 from trunk. (Revision 1437650)

          Result = FAILURE
          suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1437650
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #504 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/504/ ) HDFS-4426 . Merge change 1437627 from trunk. (Revision 1437650) Result = FAILURE suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1437650 Files : /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1295 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1295/)
          HDFS-4426. Secondary namenode shuts down immediately after startup. Contributed by Arpit Agarwal. (Revision 1437627)

          Result = FAILURE
          suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1437627
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1295 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1295/ ) HDFS-4426 . Secondary namenode shuts down immediately after startup. Contributed by Arpit Agarwal. (Revision 1437627) Result = FAILURE suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1437627 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1323 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1323/)
          HDFS-4426. Secondary namenode shuts down immediately after startup. Contributed by Arpit Agarwal. (Revision 1437627)

          Result = SUCCESS
          suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1437627
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1323 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1323/ ) HDFS-4426 . Secondary namenode shuts down immediately after startup. Contributed by Arpit Agarwal. (Revision 1437627) Result = SUCCESS suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1437627 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
          Hide
          Eli Collins added a comment -

          See HDFS-2896 (the 2NN incorrectly daemonizes).

          Show
          Eli Collins added a comment - See HDFS-2896 (the 2NN incorrectly daemonizes).

            People

            • Assignee:
              Arpit Agarwal
              Reporter:
              Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development