Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      TestBackupNode may fail with different reasons:

      • Unable to open edit log file .\build\test\data\dfs\name-backup1\current\edits (FSEditLog.java:open(371))
      • NullPointerException at org.apache.hadoop.hdfs.server.namenode.EditLogBackupOutputStream.flushAndSync(EditLogBackupOutputStream.java:163)
      • Fatal Error : All storage directories are inaccessible.
        Will provide more information in the comments.
      1. HADOOP-5573.patch
        3 kB
        Boris Shkolnik
      2. NN-EditsBug.patch
        12 kB
        Konstantin Shvachko
      3. NN-EditsBug.patch
        12 kB
        Konstantin Shvachko
      4. NN-EditsBug.patch
        12 kB
        Konstantin Shvachko
      5. NN-EditsBug.patch
        11 kB
        Konstantin Shvachko
      6. NN-EditsBug-21.patch
        13 kB
        Konstantin Shvachko
      7. TestBNFailure.log
        298 kB
        Konstantin Shvachko

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #83 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/83/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #83 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/83/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #164 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/164/)
          . Fix TestBackupNode failures. Contributed by Konstantin Shvachko.

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #164 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/164/ ) . Fix TestBackupNode failures. Contributed by Konstantin Shvachko.
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #135 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/135/)
          . Fix TestBackupNode failures. Contributed by Konstantin Shvachko.

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #135 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/135/ ) . Fix TestBackupNode failures. Contributed by Konstantin Shvachko.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #132 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/132/)
          . Fix TestBackupNode failures. Contributed by Konstantin Shvachko.

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #132 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/132/ ) . Fix TestBackupNode failures. Contributed by Konstantin Shvachko.
          Hide
          Konstantin Shvachko added a comment -

          I just committed this.

          Show
          Konstantin Shvachko added a comment - I just committed this.
          Hide
          Konstantin Shvachko added a comment -

          Here is patch for 0.21

          Show
          Konstantin Shvachko added a comment - Here is patch for 0.21
          Hide
          Konstantin Shvachko added a comment -

          Looks like Hadson did not pick up the latest patch.
          Here are the patch results:

          .    [exec] There appear to be 119 release audit warnings before the patch and 119 release audit warnings after applying the patch.
               [exec] +1 overall.  
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
               [exec] ======================================================================
               [exec] ======================================================================
               [exec]     Finished build.
               [exec] ======================================================================
               [exec] ======================================================================
          BUILD SUCCESSFUL
          Total time: 14 minutes 35 seconds
          
          Show
          Konstantin Shvachko added a comment - Looks like Hadson did not pick up the latest patch. Here are the patch results: . [exec] There appear to be 119 release audit warnings before the patch and 119 release audit warnings after applying the patch. [exec] +1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] ====================================================================== [exec] ====================================================================== [exec] Finished build. [exec] ====================================================================== [exec] ====================================================================== BUILD SUCCESSFUL Total time: 14 minutes 35 seconds
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12427208/NN-EditsBug.patch
          against trunk revision 887413.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/134/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/134/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/134/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/134/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427208/NN-EditsBug.patch against trunk revision 887413. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/134/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/134/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/134/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/134/console This message is automatically generated.
          Hide
          Konstantin Shvachko added a comment -

          This fixes findbugs.

          Show
          Konstantin Shvachko added a comment - This fixes findbugs.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12427002/NN-EditsBug.patch
          against trunk revision 887413.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/133/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/133/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/133/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/133/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427002/NN-EditsBug.patch against trunk revision 887413. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/133/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/133/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/133/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/133/console This message is automatically generated.
          Hide
          Konstantin Shvachko added a comment -

          Here is a new patch, which address Suresh's comments.

          Show
          Konstantin Shvachko added a comment - Here is a new patch, which address Suresh's comments.
          Hide
          Suresh Srinivas added a comment -

          Thanks for detailed and very good explanation of the fix.

          Here are the comments:

          1. BackupNode.java and NameNode.java stop methods should be synchronized. We should record the fact that shutdown happened on BackupNode (similar to NameNode.stop()). This will ensure that even if shutdown is called twice (BackupNameNode triggering CheckPointer shutdown, which in turn calls BackupNameNode shutdown) does not result in cleanup attempt twice.
          2. BackupNode.stop() should set checkPointerManager to null after interrupting it?
          3. BackupNode.stop() comments could be clear on why checkpointManager.shouldRun is set to false, and only later it is interrupted. Otherwise, some one would merge the two.

          Otherwise patch looks good.

          Show
          Suresh Srinivas added a comment - Thanks for detailed and very good explanation of the fix. Here are the comments: BackupNode.java and NameNode.java stop methods should be synchronized. We should record the fact that shutdown happened on BackupNode (similar to NameNode.stop()). This will ensure that even if shutdown is called twice (BackupNameNode triggering CheckPointer shutdown, which in turn calls BackupNameNode shutdown) does not result in cleanup attempt twice. BackupNode.stop() should set checkPointerManager to null after interrupting it? BackupNode.stop() comments could be clear on why checkpointManager.shouldRun is set to false, and only later it is interrupted. Otherwise, some one would merge the two. Otherwise patch looks good.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12426863/NN-EditsBug.patch
          against trunk revision 886322.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/81/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/81/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/81/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/81/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12426863/NN-EditsBug.patch against trunk revision 886322. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/81/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/81/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/81/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/81/console This message is automatically generated.
          Hide
          Konstantin Shvachko added a comment -

          With this patch I consitently see that if there is race between BackupNode.stop() and Checkpointer.doCheckpoint() then doCheckpoint() failes trying to send uploadCheckpoint(), which results in normal shutdown of the backup node without breaking TestBackupNode.

          Show
          Konstantin Shvachko added a comment - With this patch I consitently see that if there is race between BackupNode.stop() and Checkpointer.doCheckpoint() then doCheckpoint() failes trying to send uploadCheckpoint() , which results in normal shutdown of the backup node without breaking TestBackupNode .
          Hide
          Konstantin Shvachko added a comment -

          Since this test has a long history I took some time fixing it. With current code the test fails in two different ways.

          1. This is related to a race condition between FSNamesystem.startCheckpoint() and FSNamesystem.releaseBackupNode().
          NN performs startCheckpoint() in synchronized section, then releases the lock, and does logSync().
          If releaseBackupNode() happens in the middle, then logSync() may not find the backup stream, because it has been removed during releaseBackupNode(), and will throw NPE.
          The patch fixes this synchronization issues.

          2. This was the hard one. The problem is that FSEditLog.processIOError() occasionally exits JVM during backup node shutdown, which also takes down the test before it has a chance to announce the success even though it was indeed successful.
          The failure is the result of race a condition between two backup node threads:

          • The BackupNode main thread, and
          • The Checkpointer thread (see doCheckpoint())

          BackupNode thread calls stop(), which in turn interrupts the Checkpointer.
          The interrupt eventually closes FileChannel for the underlying VERSION file in the StorageDirectory of the backup node.
          If after that doCheckpoint() tries to do anything with this StorageDirectory the file channel throws ClosedByInterruptException (see the attached log).
          This causes FSEditLog.processIOError(). In current implementation processIOError(), since this is the last StorageDirectory, will call Runtime.exit(-1).
          This terminates JVM including the test itself before it's had a chance to complete.
          Test fails.
          I changed BackupNode.stop() so that it

          1. First prevents from starting new checkpoints (but does not interrupt the running ones yet).
          2. Then sends errorReport() to the name-node to let it know it's going to shutdown. This will "unregister" the backup node, and NN will ignore any communications from the BN from now on.
          3. Interrupts the Checkpointer.
          4. Stops the rest of the node threads.

          In current implementaion 2 and 3 are reversed, so if doCheckpoint() proceeds while the main BackupNode thread does errorReport(), the file channels will be ClosedByInterruptException, which will cause Runtime.exit(-1).
          With the patch doCheckpoint() can proceed during error reporting since channels are not closed. If they are then it means errorReport() has completed and then doCheckpoint() will not be able to send uploadCheckpoint() or endCheckpoint(), because NN is not already accepting this BN. Thus BN will not get to execution of the methods (like convergeJournalSpool()) that may deal with channels ClosedByInterruptException.

          3. I fixed a bug in processIOError())). In current implementation it works correctly only if storage directories fail one at a time. If two or more storage directories fail simultaneously {{processIOError() will not shut down the node.
          The natural fix is to first remove edits streams then verify the number of remaining ones and exit if there are none.

          4. Improved logging when storage directories fail. Now the logs will show the reason why storage fails not just the fact that it failed.

          Show
          Konstantin Shvachko added a comment - Since this test has a long history I took some time fixing it. With current code the test fails in two different ways. 1. This is related to a race condition between FSNamesystem.startCheckpoint() and FSNamesystem.releaseBackupNode() . NN performs startCheckpoint() in synchronized section, then releases the lock, and does logSync() . If releaseBackupNode() happens in the middle, then logSync() may not find the backup stream, because it has been removed during releaseBackupNode() , and will throw NPE. The patch fixes this synchronization issues. 2. This was the hard one. The problem is that FSEditLog.processIOError() occasionally exits JVM during backup node shutdown, which also takes down the test before it has a chance to announce the success even though it was indeed successful. The failure is the result of race a condition between two backup node threads: The BackupNode main thread, and The Checkpointer thread (see doCheckpoint() ) BackupNode thread calls stop() , which in turn interrupts the Checkpointer. The interrupt eventually closes FileChannel for the underlying VERSION file in the StorageDirectory of the backup node. If after that doCheckpoint() tries to do anything with this StorageDirectory the file channel throws ClosedByInterruptException (see the attached log). This causes FSEditLog.processIOError() . In current implementation processIOError() , since this is the last StorageDirectory , will call Runtime.exit(-1) . This terminates JVM including the test itself before it's had a chance to complete. Test fails. I changed BackupNode.stop() so that it First prevents from starting new checkpoints (but does not interrupt the running ones yet). Then sends errorReport() to the name-node to let it know it's going to shutdown. This will "unregister" the backup node, and NN will ignore any communications from the BN from now on. Interrupts the Checkpointer. Stops the rest of the node threads. In current implementaion 2 and 3 are reversed, so if doCheckpoint() proceeds while the main BackupNode thread does errorReport() , the file channels will be ClosedByInterruptException , which will cause Runtime.exit(-1) . With the patch doCheckpoint() can proceed during error reporting since channels are not closed. If they are then it means errorReport() has completed and then doCheckpoint() will not be able to send uploadCheckpoint() or endCheckpoint() , because NN is not already accepting this BN. Thus BN will not get to execution of the methods (like convergeJournalSpool() ) that may deal with channels ClosedByInterruptException . 3. I fixed a bug in processIOError())). In current implementation it works correctly only if storage directories fail one at a time. If two or more storage directories fail simultaneously {{processIOError() will not shut down the node. The natural fix is to first remove edits streams then verify the number of remaining ones and exit if there are none. 4. Improved logging when storage directories fail. Now the logs will show the reason why storage fails not just the fact that it failed.
          Hide
          steve_l added a comment -

          I'm seeing this test fail with a timeout when I test everything. when I only run this testcase, all is well.

          Show
          steve_l added a comment - I'm seeing this test fail with a timeout when I test everything. when I only run this testcase, all is well.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Konstantin, do you think the patch is good?

          TestBackupNode.testBackupRegistration is still failing, see build #337.

          Show
          Tsz Wo Nicholas Sze added a comment - Konstantin, do you think the patch is good? TestBackupNode.testBackupRegistration is still failing, see build #337 .
          Hide
          Boris Shkolnik added a comment -

          No test is needed because this patch fixes the test.

          Show
          Boris Shkolnik added a comment - No test is needed because this patch fixes the test.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12405815/HADOOP-5573.patch
          against trunk revision 767699.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/231/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/231/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/231/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/231/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12405815/HADOOP-5573.patch against trunk revision 767699. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/231/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/231/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/231/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/231/console This message is automatically generated.
          Hide
          Boris Shkolnik added a comment -

          Put synchronization (wait/notify) for BackupNode to wait for any undergoing Checkpoint to complete before stopping.

          Show
          Boris Shkolnik added a comment - Put synchronization (wait/notify) for BackupNode to wait for any undergoing Checkpoint to complete before stopping.
          Hide
          Konstantin Shvachko added a comment -

          Attaching failure log.
          Looks like that BackupNode fails in the very end doing processIOError()

          [junit] 2009-04-07 19:35:53,848 ERROR common.Storage (FSImage.java:resetVersion(1489)) - Cannot write file 
                  /home/hudson/hudson-slave/workspace/Hadoop-Patch-vesta.apache.org/trunk/build/test/data/dfs/name-backup1
          [junit] 2009-04-07 19:35:53,849 WARN  common.Storage (FSImage.java:processIOError(744)) - FSImage:processIOError: removing storage: 
                  /home/hudson/hudson-slave/workspace/Hadoop-Patch-vesta.apache.org/trunk/build/test/data/dfs/name-backup1
          [junit] 2009-04-07 19:35:53,849 INFO  namenode.FSNamesystem (FSEditLog.java:processIOError(471)) - current list of storage dirs:
          [junit] 2009-04-07 19:35:53,849 FATAL namenode.FSNamesystem (FSEditLog.java:processIOError(479)) - Fatal Error : All storage directories are inaccessible.
          
          Show
          Konstantin Shvachko added a comment - Attaching failure log. Looks like that BackupNode fails in the very end doing processIOError() [junit] 2009-04-07 19:35:53,848 ERROR common.Storage (FSImage.java:resetVersion(1489)) - Cannot write file /home/hudson/hudson-slave/workspace/Hadoop-Patch-vesta.apache.org/trunk/build/test/data/dfs/name-backup1 [junit] 2009-04-07 19:35:53,849 WARN common.Storage (FSImage.java:processIOError(744)) - FSImage:processIOError: removing storage: /home/hudson/hudson-slave/workspace/Hadoop-Patch-vesta.apache.org/trunk/build/test/data/dfs/name-backup1 [junit] 2009-04-07 19:35:53,849 INFO namenode.FSNamesystem (FSEditLog.java:processIOError(471)) - current list of storage dirs: [junit] 2009-04-07 19:35:53,849 FATAL namenode.FSNamesystem (FSEditLog.java:processIOError(479)) - Fatal Error : All storage directories are inaccessible.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Seems TestBackupNode still having problems. It failed on Hudson build #160.

          Show
          Tsz Wo Nicholas Sze added a comment - Seems TestBackupNode still having problems. It failed on Hudson build #160 .
          Hide
          Konstantin Shvachko added a comment -

          The first two bugs (NPE) are fixed by HADOOP-5119.
          The story here is that testBackupRegistration() starts two backup nodes one ofter another. The first one keeps making chackpoints. But the second is just initializing. During initialization it creates new FSNamesystem class, which in the beginning sets the static variable fsNamesystemObject to null. It takes time to initialize the BackupNode until it will set fsNamesystemObject = this.
          In the meantime the first backup node start a checkpoint, which accesses FSNamesystem via fsNamesystemObject. Since it is static it contains the value the second node assigned it, which is null at that moment. Therefore different NPEs depending on the timing of the checkpoint.
          We should not see that again, since HADOOP-5119 eliminated fsNamesystemObject.

          Third error is also gone, because processIOError() was recently changed by HADOOP-4045.
          But I am still looking at it. I am getting some strange asserts there.

          Show
          Konstantin Shvachko added a comment - The first two bugs (NPE) are fixed by HADOOP-5119 . The story here is that testBackupRegistration() starts two backup nodes one ofter another. The first one keeps making chackpoints. But the second is just initializing. During initialization it creates new FSNamesystem class, which in the beginning sets the static variable fsNamesystemObject to null. It takes time to initialize the BackupNode until it will set fsNamesystemObject = this . In the meantime the first backup node start a checkpoint, which accesses FSNamesystem via fsNamesystemObject . Since it is static it contains the value the second node assigned it, which is null at that moment. Therefore different NPEs depending on the timing of the checkpoint. We should not see that again, since HADOOP-5119 eliminated fsNamesystemObject . Third error is also gone, because processIOError() was recently changed by HADOOP-4045 . But I am still looking at it. I am getting some strange asserts there.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          The failures can be reproduced by repeatedly running TestBackupNode a few times.

          Show
          Tsz Wo Nicholas Sze added a comment - The failures can be reproduced by repeatedly running TestBackupNode a few times.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Here are more details:

          • Unable to open edit log file .\build\test\data\dfs\name-backup1\current\edits (FSEditLog.java:open(371))
            2009-03-24 17:36:39,421 WARN  namenode.FSNamesystem (FSEditLog.java:open(371)) - Unable to open edit log
             file d:\@sze\hadoop\latest\build\test\data\dfs\name-backup1\current\edits
            2009-03-24 17:36:39,421 ERROR namenode.Checkpointer (Checkpointer.java:run(138)) - Exception in doCheckpoint: 
            java.io.IOException: Could not locate checkpoint directories
            	at org.apache.hadoop.hdfs.server.namenode.BackupStorage.loadCheckpoint(BackupStorage.java:157)
            	at org.apache.hadoop.hdfs.server.namenode.Checkpointer.doCheckpoint(Checkpointer.java:232)
            	at org.apache.hadoop.hdfs.server.namenode.Checkpointer.run(Checkpointer.java:134)
            	at java.lang.Thread.run(Thread.java:619)
            
          • NullPointerException at org.apache.hadoop.hdfs.server.namenode.EditLogBackupOutputStream.flushAndSync(EditLogBackupOutputStream.java:163)
            2009-03-24 17:56:09,750 INFO  ipc.Server (Server.java:run(968)) - IPC Server handler 6 on 1441, call startCheckpoint(
            NamenodeRegistration(xx.xx.xx.xx:50100, role=Backup Node)) from 127.0.0.1:1485: error: java.io.IOException: java.lang.NullPointerException
            java.io.IOException: java.lang.NullPointerException
            	at org.apache.hadoop.hdfs.server.namenode.EditLogBackupOutputStream.flushAndSync(EditLogBackupOutputStream.java:163)
            	at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:83)
            	at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:989)
            	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startCheckpoint(FSNamesystem.java:4395)
            	at org.apache.hadoop.hdfs.server.namenode.NameNode.startCheckpoint(NameNode.java:440)
            	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
            	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
            	at java.lang.reflect.Method.invoke(Method.java:597)
            	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
            	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
            	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
            	at java.security.AccessController.doPrivileged(Native Method)
            	at javax.security.auth.Subject.doAs(Subject.java:396)
            	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
            
          • Fatal Error : All storage directories are inaccessible.
            2009-03-25 14:27:06,828 INFO  namenode.FSNamesystem (FSEditLog.java:printStatistics(1044))
             - Number of transactions: 0 Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 
            2009-03-25 14:27:06,937 WARN  namenode.FSNamesystem (FSEditLog.java:close(420))
             - FSEditLog:close - failed to close stream d:\@sze\hadoop\testing\build\test\data\dfs\name-checkpoint1\current\edits
            2009-03-25 14:27:06,937 ERROR namenode.FSNamesystem (FSEditLog.java:processIOError(506))
             - Unable to log edits to d:\@sze\hadoop\testing\build\test\data\dfs\name-checkpoint1\current\edits
            2009-03-25 14:27:06,937 FATAL namenode.FSNamesystem (FSEditLog.java:processIOError(450))
             - Fatal Error : All storage directories are inaccessible.
            2009-03-25 14:27:06,937 INFO  namenode.NameNode (NameNode.java:errorReport(421))
             - Error report from NamenodeRegistration(servicehot-dx.ds.corp.yahoo.com:50100, role=Checkpoint Node): Shutting down.
            2009-03-25 14:27:06,953 WARN  namenode.DecommissionManager (DecommissionManager.java:run(67))
             - Monitor interrupted: java.lang.InterruptedException: sleep interrupted
            2009-03-25 14:27:06,953 WARN  namenode.FSNamesystem (FSNamesystem.java:run(2346))
             - ReplicationMonitor thread received InterruptedException.java.lang.InterruptedException: sleep interrupted
            Test org.apache.hadoop.hdfs.server.namenode.TestBackupNode FAILED (crashed)
            
          Show
          Tsz Wo Nicholas Sze added a comment - Here are more details: Unable to open edit log file .\build\test\data\dfs\name-backup1\current\edits (FSEditLog.java:open(371)) 2009-03-24 17:36:39,421 WARN namenode.FSNamesystem (FSEditLog.java:open(371)) - Unable to open edit log file d:\@sze\hadoop\latest\build\test\data\dfs\name-backup1\current\edits 2009-03-24 17:36:39,421 ERROR namenode.Checkpointer (Checkpointer.java:run(138)) - Exception in doCheckpoint: java.io.IOException: Could not locate checkpoint directories at org.apache.hadoop.hdfs.server.namenode.BackupStorage.loadCheckpoint(BackupStorage.java:157) at org.apache.hadoop.hdfs.server.namenode.Checkpointer.doCheckpoint(Checkpointer.java:232) at org.apache.hadoop.hdfs.server.namenode.Checkpointer.run(Checkpointer.java:134) at java.lang.Thread.run(Thread.java:619) NullPointerException at org.apache.hadoop.hdfs.server.namenode.EditLogBackupOutputStream.flushAndSync(EditLogBackupOutputStream.java:163) 2009-03-24 17:56:09,750 INFO ipc.Server (Server.java:run(968)) - IPC Server handler 6 on 1441, call startCheckpoint( NamenodeRegistration(xx.xx.xx.xx:50100, role=Backup Node)) from 127.0.0.1:1485: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.EditLogBackupOutputStream.flushAndSync(EditLogBackupOutputStream.java:163) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:83) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:989) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startCheckpoint(FSNamesystem.java:4395) at org.apache.hadoop.hdfs.server.namenode.NameNode.startCheckpoint(NameNode.java:440) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) Fatal Error : All storage directories are inaccessible. 2009-03-25 14:27:06,828 INFO namenode.FSNamesystem (FSEditLog.java:printStatistics(1044)) - Number of transactions: 0 Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 2009-03-25 14:27:06,937 WARN namenode.FSNamesystem (FSEditLog.java:close(420)) - FSEditLog:close - failed to close stream d:\@sze\hadoop\testing\build\test\data\dfs\name-checkpoint1\current\edits 2009-03-25 14:27:06,937 ERROR namenode.FSNamesystem (FSEditLog.java:processIOError(506)) - Unable to log edits to d:\@sze\hadoop\testing\build\test\data\dfs\name-checkpoint1\current\edits 2009-03-25 14:27:06,937 FATAL namenode.FSNamesystem (FSEditLog.java:processIOError(450)) - Fatal Error : All storage directories are inaccessible. 2009-03-25 14:27:06,937 INFO namenode.NameNode (NameNode.java:errorReport(421)) - Error report from NamenodeRegistration(servicehot-dx.ds.corp.yahoo.com:50100, role=Checkpoint Node): Shutting down. 2009-03-25 14:27:06,953 WARN namenode.DecommissionManager (DecommissionManager.java:run(67)) - Monitor interrupted: java.lang.InterruptedException: sleep interrupted 2009-03-25 14:27:06,953 WARN namenode.FSNamesystem (FSNamesystem.java:run(2346)) - ReplicationMonitor thread received InterruptedException.java.lang.InterruptedException: sleep interrupted Test org.apache.hadoop.hdfs.server.namenode.TestBackupNode FAILED (crashed)

            People

            • Assignee:
              Konstantin Shvachko
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development