Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2011

Removal and restoration of storage directories on checkpointing failure doesn't work properly

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: namenode
    • Labels:
      None

      Description

      Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory

      1. HDFS-2011.patch
        2 kB
        Ravi Prakash
      2. HDFS-2011.patch
        6 kB
        Ravi Prakash
      3. HDFS-2011.patch
        6 kB
        Ravi Prakash
      4. HDFS-2011.3.patch
        6 kB
        Ravi Prakash
      5. HDFS-2011.4.patch
        6 kB
        Ravi Prakash
      6. HDFS-2011.5.patch
        5 kB
        Ravi Prakash
      7. HDFS-2011.6.patch
        6 kB
        Ravi Prakash
      8. HDFS-2011.7.patch
        6 kB
        Ravi Prakash
      9. HDFS-2011.8.patch
        6 kB
        Ravi Prakash
      10. elfos-close-patch-on-1073.txt
        4 kB
        Todd Lipcon
      11. elfos-close-patch-on-1073-2.txt
        5 kB
        Eli Collins
      12. elfos-close-patch-on-1073-3.txt
        5 kB
        Eli Collins

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          4m 2s 1 Ravi Prakash 30/May/11 17:34
          Patch Available Patch Available Resolved Resolved
          31d 6h 50m 1 Matt Foley 01/Jul/11 00:25
          Resolved Resolved Closed Closed
          137d 1h 27m 1 Arun C Murthy 15/Nov/11 00:52
          Arun C Murthy made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Eli Collins added a comment -

          Ah, right, abort is new in 1073, I think the tests you have for trunk are fine for now. The new test with abort will come in when 1073 is merged.

          Show
          Eli Collins added a comment - Ah, right, abort is new in 1073, I think the tests you have for trunk are fine for now. The new test with abort will come in when 1073 is merged.
          Hide
          Ravi Prakash added a comment -

          @Eli - The new patch tests abort() but I couldn't find the method in EditLogFileOutputStream.java in trunk. Its available in HDFS-1073. Could you please point me to what exactly I should incorporate into trunk?

          Show
          Ravi Prakash added a comment - @Eli - The new patch tests abort() but I couldn't find the method in EditLogFileOutputStream.java in trunk. Its available in HDFS-1073 . Could you please point me to what exactly I should incorporate into trunk?
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-1073-branch #9 (See https://builds.apache.org/job/Hadoop-Hdfs-1073-branch/9/)
          Amend HDFS-2011 for HDFS-1073 branch. Update test cases for new behavior of EditLogFileOutputStream. Contributed by Todd Lipcon and Eli Collins.

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1146848
          Files :

          • /hadoop/common/branches/HDFS-1073/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java
          • /hadoop/common/branches/HDFS-1073/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-1073-branch #9 (See https://builds.apache.org/job/Hadoop-Hdfs-1073-branch/9/ ) Amend HDFS-2011 for HDFS-1073 branch. Update test cases for new behavior of EditLogFileOutputStream. Contributed by Todd Lipcon and Eli Collins. todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1146848 Files : /hadoop/common/branches/ HDFS-1073 /hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java /hadoop/common/branches/ HDFS-1073 /hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestEditLogFileOutputStream.java
          Hide
          Todd Lipcon added a comment -

          Committed elfos-close-patch-on-1073-3.txt to the HDFS-1073 branch to fix the test case. Thanks Eli.

          Show
          Todd Lipcon added a comment - Committed elfos-close-patch-on-1073-3.txt to the HDFS-1073 branch to fix the test case. Thanks Eli.
          Eli Collins made changes -
          Attachment elfos-close-patch-on-1073-3.txt [ 12486381 ]
          Hide
          Eli Collins added a comment -

          Attaching patch with minor fix to comment in the double close test.

          Show
          Eli Collins added a comment - Attaching patch with minor fix to comment in the double close test.
          Hide
          John George added a comment -

          +1 Looks good to me. Thanks Eli.

          Show
          John George added a comment - +1 Looks good to me. Thanks Eli.
          Eli Collins made changes -
          Attachment elfos-close-patch-on-1073-2.txt [ 12486369 ]
          Hide
          Eli Collins added a comment -

          @John. I agree. Here's an updated (1073) patch that tests the double close and asserts that we get an IOE for using an aborted stream.

          @Ravi - want to incorporate this new test into the patch for trunk?

          Show
          Eli Collins added a comment - @John. I agree. Here's an updated (1073) patch that tests the double close and asserts that we get an IOE for using an aborted stream. @Ravi - want to incorporate this new test into the patch for trunk?
          Hide
          John George added a comment -

          The patch looks good. Shouldn't the sequence "close() close()" be tested as well, since that could be a case that could possibly happen?

          Show
          John George added a comment - The patch looks good. Shouldn't the sequence "close() close()" be tested as well, since that could be a case that could possibly happen?
          Todd Lipcon made changes -
          Attachment elfos-close-patch-on-1073.txt [ 12485516 ]
          Hide
          Todd Lipcon added a comment -

          Here's the patch I'm planning to commit to 1073 branch. Look good?

          I will also do some stress testing similar to what Ravi described on the branch to see if I can reproduce the issue he saw.

          Show
          Todd Lipcon added a comment - Here's the patch I'm planning to commit to 1073 branch. Look good? I will also do some stress testing similar to what Ravi described on the branch to see if I can reproduce the issue he saw.
          Hide
          John George added a comment -

          I think calling
          1. abort() twice
          2. close() twice
          3. close() followed by an abort()

          would test most cases.

          Show
          John George added a comment - I think calling 1. abort() twice 2. close() twice 3. close() followed by an abort() would test most cases.
          Hide
          Todd Lipcon added a comment -

          In the HDFS-1073 branch, EditLogOutputStream now has separate close() and abort() methods. abort() is used when there has been some error on the stream and we expect to do an "unclean" close (ie without flushing). close() is used for clean closes. If close() itself fails, it will then proceed to abort() when the IO error is handled.

          So, I think the correct test case on the branch is to call abort() twice and make sure that's ignored, or call close() and then abort() to make sure that's ignored. Does that sound reasonable?

          Show
          Todd Lipcon added a comment - In the HDFS-1073 branch, EditLogOutputStream now has separate close() and abort() methods. abort() is used when there has been some error on the stream and we expect to do an "unclean" close (ie without flushing). close() is used for clean closes. If close() itself fails, it will then proceed to abort() when the IO error is handled. So, I think the correct test case on the branch is to call abort() twice and make sure that's ignored, or call close() and then abort() to make sure that's ignored. Does that sound reasonable?
          Hide
          John George added a comment -

          If I remember right, it was a case of an "incomplete create" as opposed to close being called twice. So, the close() was being called on a stream that was not really created...

          Show
          John George added a comment - If I remember right, it was a case of an "incomplete create" as opposed to close being called twice. So, the close() was being called on a stream that was not really created...
          Hide
          Ravi Prakash added a comment -

          The program above output

          Hello World
          Hello Champaign
          
          Show
          Ravi Prakash added a comment - The program above output Hello World Hello Champaign
          Hide
          Ravi Prakash added a comment -

          I had noticed close being called twice while testing this functionality . This was causing a NullPointerException the second time. The stack trace is given in comment https://issues.apache.org/jira/browse/HDFS-2011?focusedCommentId=13041858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13041858

          2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 87 on 8020, call getEditLogSize() from
          98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException
          java.io.IOException: java.lang.NullPointerException
          at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:109)
          at org.apache.hadoop.hdfs.server.namenode.FSEditLog.processIOError(FSEditLog.java:299)
          at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getEditLogSize(FSEditLog.java:849)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getEditLogSize(FSNamesystem.java:4270)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.getEditLogSize(NameNode.java:1095)
          at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:346)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1399)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1395)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:396)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1094)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1393)

          The bug itself is quite hard to reproduce. I had to run my tests in an infinite loop and the NullPointerException happened after 3-4 hours (each run of the test would take 2 mins maybe). After the NullPointerException, the namenode would essentially be useless. Even hdfs dfs -ls would throw a NullPointerException.

          I am not sure myself which philosophy would be better. FileOutputStream itself ignores a second close. I checked this with the following program

          import java.io.*;
          
          public class TestJAVA 
          {
          
          	public static void main(String args[]) 
          	{
          		System.out.println("Hello World");
          		try {
          		
          			FileOutputStream fos = new FileOutputStream("/tmp/ravi.txt");
          			fos.write(50);
          			fos.write(50);
          			fos.write(50);
          			fos.write(50);
          			fos.write(50);
          			fos.write(50);
          			fos.close();
          			fos.close();
          		} catch (IOException ioe) {
          			System.out.println("Hello California");
          			System.out.println (ioe);
          		}
          		System.out.println("Hello Champaign");
          		
          	}
          	
          }
          
          Show
          Ravi Prakash added a comment - I had noticed close being called twice while testing this functionality . This was causing a NullPointerException the second time. The stack trace is given in comment https://issues.apache.org/jira/browse/HDFS-2011?focusedCommentId=13041858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13041858 2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 87 on 8020, call getEditLogSize() from 98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:109) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.processIOError(FSEditLog.java:299) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getEditLogSize(FSEditLog.java:849) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getEditLogSize(FSNamesystem.java:4270) at org.apache.hadoop.hdfs.server.namenode.NameNode.getEditLogSize(NameNode.java:1095) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:346) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1399) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1395) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1094) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1393) The bug itself is quite hard to reproduce. I had to run my tests in an infinite loop and the NullPointerException happened after 3-4 hours (each run of the test would take 2 mins maybe). After the NullPointerException, the namenode would essentially be useless. Even hdfs dfs -ls would throw a NullPointerException. I am not sure myself which philosophy would be better. FileOutputStream itself ignores a second close. I checked this with the following program import java.io.*; public class TestJAVA { public static void main(String args[]) { System.out.println("Hello World"); try { FileOutputStream fos = new FileOutputStream("/tmp/ravi.txt"); fos.write(50); fos.write(50); fos.write(50); fos.write(50); fos.write(50); fos.write(50); fos.close(); fos.close(); } catch (IOException ioe) { System.out.println("Hello California"); System.out.println (ioe); } System.out.println("Hello Champaign"); } }
          Hide
          Todd Lipcon added a comment -

          I'm working on merging this with HDFS-1073, and had one question: when do we expect that an editlog stream would be closed twice? In 1073 there are some extra asserts, so instead of ignoring the second close, it now throws "java.io.IOException: Trying to use aborted output stream". I'm debating whether to remove this exception like you've done in this patch, vs remove the patch, since it seems like it might be indicative of a bug to close a stream twice.

          Show
          Todd Lipcon added a comment - I'm working on merging this with HDFS-1073 , and had one question: when do we expect that an editlog stream would be closed twice? In 1073 there are some extra asserts, so instead of ignoring the second close, it now throws "java.io.IOException: Trying to use aborted output stream". I'm debating whether to remove this exception like you've done in this patch, vs remove the patch, since it seems like it might be indicative of a bug to close a stream twice.
          Hide
          Ravi Prakash added a comment -

          Thanks Matt, Todd and Cos! My first patch into Hadoop. Yaayyyyy!!!

          Show
          Ravi Prakash added a comment - Thanks Matt, Todd and Cos! My first patch into Hadoop. Yaayyyyy!!!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #712 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/712/)
          HDFS-2011. Removal and restoration of storage directories on checkpointing failure doesn't work properly. Contributed by Ravi Prakash.

          mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1141748
          Files :

          • /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java
          • /hadoop/common/trunk/hdfs/CHANGES.txt
          • /hadoop/common/trunk/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java
          • /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #712 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/712/ ) HDFS-2011 . Removal and restoration of storage directories on checkpointing failure doesn't work properly. Contributed by Ravi Prakash. mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1141748 Files : /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java /hadoop/common/trunk/hdfs/CHANGES.txt /hadoop/common/trunk/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #771 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/771/)
          HDFS-2011. Removal and restoration of storage directories on checkpointing failure doesn't work properly. Contributed by Ravi Prakash.

          mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1141748
          Files :

          • /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java
          • /hadoop/common/trunk/hdfs/CHANGES.txt
          • /hadoop/common/trunk/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java
          • /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #771 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/771/ ) HDFS-2011 . Removal and restoration of storage directories on checkpointing failure doesn't work properly. Contributed by Ravi Prakash. mattf : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1141748 Files : /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java /hadoop/common/trunk/hdfs/CHANGES.txt /hadoop/common/trunk/hdfs/src/test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java /hadoop/common/trunk/hdfs/src/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java
          Matt Foley made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Fix Version/s 0.23.0 [ 12315571 ]
          Resolution Fixed [ 1 ]
          Hide
          Matt Foley added a comment -

          Committed to trunk. Thanks Ravi! And thanks to Todd and Cos for reviews.

          Show
          Matt Foley added a comment - Committed to trunk. Thanks Ravi! And thanks to Todd and Cos for reviews.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12484433/HDFS-2011.8.patch
          against trunk revision 1140030.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/859//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/859//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/859//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12484433/HDFS-2011.8.patch against trunk revision 1140030. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/859//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/859//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/859//console This message is automatically generated.
          Ravi Prakash made changes -
          Attachment HDFS-2011.8.patch [ 12484433 ]
          Hide
          Ravi Prakash added a comment -

          Hi Matt,

          I've incorporated both changes. Thanks for your insightful reviews

          Cheers
          Ravi

          Show
          Ravi Prakash added a comment - Hi Matt, I've incorporated both changes. Thanks for your insightful reviews Cheers Ravi
          Hide
          Matt Foley added a comment -

          Looking great! And you're right about needing the iterator to access nnStorage.getEditsDirectories(), since it's just a Collection.

          There's just one small thing I'd like to fix:

          • in testSetCheckpointTimeInStorageHandlesIOException, new File ctor, you shouldn't need slashes around "/storageDirToCheck/", "storageDirToCheck" should work.
          • and one trivial edit: The comment in EditLogFileOutputStream.close(), "// if already closed, just return" isn't correct any more. "// if already closed, just skip" would be correct.

          If that's okay, I'll commit it on the next round.

          Show
          Matt Foley added a comment - Looking great! And you're right about needing the iterator to access nnStorage.getEditsDirectories(), since it's just a Collection. There's just one small thing I'd like to fix: in testSetCheckpointTimeInStorageHandlesIOException, new File ctor, you shouldn't need slashes around "/storageDirToCheck/", "storageDirToCheck" should work. and one trivial edit: The comment in EditLogFileOutputStream.close(), "// if already closed, just return" isn't correct any more. "// if already closed, just skip" would be correct. If that's okay, I'll commit it on the next round.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12484007/HDFS-2011.7.patch
          against trunk revision 1140030.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/849//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/849//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/849//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12484007/HDFS-2011.7.patch against trunk revision 1140030. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/849//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/849//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/849//console This message is automatically generated.
          Ravi Prakash made changes -
          Attachment HDFS-2011.7.patch [ 12484007 ]
          Hide
          Ravi Prakash added a comment -

          Included fc.close() in EditLogFileOutputStream.close()

          Show
          Ravi Prakash added a comment - Included fc.close() in EditLogFileOutputStream.close()
          Hide
          Ravi Prakash added a comment -

          The test failed because fc.close() wasn't being called. Thanks Matt! Including that in the latest patch. The two failed tests passed and test-patch too. Also my automated unit tests.

          Show
          Ravi Prakash added a comment - The test failed because fc.close() wasn't being called. Thanks Matt! Including that in the latest patch. The two failed tests passed and test-patch too. Also my automated unit tests.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12483975/HDFS-2011.6.patch
          against trunk revision 1140030.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.server.namenode.TestCheckpoint
          org.apache.hadoop.hdfs.TestFileAppend2

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/848//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/848//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/848//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12483975/HDFS-2011.6.patch against trunk revision 1140030. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestCheckpoint org.apache.hadoop.hdfs.TestFileAppend2 +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/848//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/848//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/848//console This message is automatically generated.
          Hide
          Ravi Prakash added a comment -

          Oh, and in that last bit of code, if fc is still open I would think it should be closed after the truncate. But it isn't in the current code. Can you see a reason why?

          I don't know how File Channels work, but in the constructor you can see that fc and fp are both derived from the same RandomAccessFile (rp). Could calling fp.close() automatically close fc too?

          Ravi, I don't think this collides with HDFS-988, but please check.

          diffstat's didn't have any common files.

          $ diffstat hdfs-988-7.patch 
           java/org/apache/hadoop/hdfs/DFSOutputStream.java                                |    5 
           java/org/apache/hadoop/hdfs/server/namenode/BlockManager.java                   |    5 
           java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java                    |  804 ++--
           java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java                |   10 
           java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java                   | 1891 +++++-----
           java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java                   |    1 
           test/hdfs/org/apache/hadoop/cli/TestHDFSCLI.java                                |    2 
           test/hdfs/org/apache/hadoop/hdfs/DFSTestUtil.java                               |    9 
           test/hdfs/org/apache/hadoop/hdfs/TestDecommission.java                          |    6 
           test/hdfs/org/apache/hadoop/hdfs/TestSafeMode.java                              |  208 -
           test/hdfs/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java           |   15 
           test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestDeadDatanode.java          |    9 
           test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestHeartbeatHandling.java     |    8 
           test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestNNThroughputBenchmark.java |    1 
           test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestSafeMode.java              |   90 
           test/unit/org/apache/hadoop/hdfs/server/namenode/TestNNLeaseRecovery.java       |   39 
           16 files changed, 1676 insertions(+), 1427 deletions(-)
          $ diffstat HDFS-2011.6.patch 
           java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java |   31 +++-
           java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java               |    6 
           test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java     |   67 ++++++++++
           3 files changed, 94 insertions(+), 10 deletions(-)
          

          test-patch passed and I also ran my automated test twice just to be sure. Functionality doesn't seem to have collided.

          Show
          Ravi Prakash added a comment - Oh, and in that last bit of code, if fc is still open I would think it should be closed after the truncate. But it isn't in the current code. Can you see a reason why? I don't know how File Channels work, but in the constructor you can see that fc and fp are both derived from the same RandomAccessFile (rp). Could calling fp.close() automatically close fc too? Ravi, I don't think this collides with HDFS-988 , but please check. diffstat's didn't have any common files. $ diffstat hdfs-988-7.patch java/org/apache/hadoop/hdfs/DFSOutputStream.java | 5 java/org/apache/hadoop/hdfs/server/namenode/BlockManager.java | 5 java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java | 804 ++-- java/org/apache/hadoop/hdfs/server/namenode/FSEditLogLoader.java | 10 java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java | 1891 +++++----- java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java | 1 test/hdfs/org/apache/hadoop/cli/TestHDFSCLI.java | 2 test/hdfs/org/apache/hadoop/hdfs/DFSTestUtil.java | 9 test/hdfs/org/apache/hadoop/hdfs/TestDecommission.java | 6 test/hdfs/org/apache/hadoop/hdfs/TestSafeMode.java | 208 - test/hdfs/org/apache/hadoop/hdfs/server/namenode/NameNodeAdapter.java | 15 test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestDeadDatanode.java | 9 test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestHeartbeatHandling.java | 8 test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestNNThroughputBenchmark.java | 1 test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestSafeMode.java | 90 test/unit/org/apache/hadoop/hdfs/server/namenode/TestNNLeaseRecovery.java | 39 16 files changed, 1676 insertions(+), 1427 deletions(-) $ diffstat HDFS-2011.6.patch java/org/apache/hadoop/hdfs/server/namenode/EditLogFileOutputStream.java | 31 +++- java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java | 6 test/hdfs/org/apache/hadoop/hdfs/server/namenode/TestCheckpoint.java | 67 ++++++++++ 3 files changed, 94 insertions(+), 10 deletions(-) test-patch passed and I also ran my automated test twice just to be sure. Functionality doesn't seem to have collided.
          Hide
          Ravi Prakash added a comment -

          Thanks Matt,

          Incorporated all your comments

          9. In "assertTrue("List of storage directories didn't have storageDirToCheck."...), did you intend to iterate over all elements in the list? You go to the trouble of creating an iterator, and then only use the first element.

          I meant to get the first element of the Collection (since that's what nnStorage.getEditsDirectories() returns me).

          Show
          Ravi Prakash added a comment - Thanks Matt, Incorporated all your comments 9. In "assertTrue("List of storage directories didn't have storageDirToCheck."...), did you intend to iterate over all elements in the list? You go to the trouble of creating an iterator, and then only use the first element. I meant to get the first element of the Collection (since that's what nnStorage.getEditsDirectories() returns me).
          Ravi Prakash made changes -
          Attachment HDFS-2011.6.patch [ 12483975 ]
          Hide
          Ravi Prakash added a comment -

          Incorporated Matt's latest comments

          Show
          Ravi Prakash added a comment - Incorporated Matt's latest comments
          Hide
          Matt Foley added a comment -

          Ravi, I don't think this collides with HDFS-988, but please check.

          Show
          Matt Foley added a comment - Ravi, I don't think this collides with HDFS-988 , but please check.
          Hide
          Matt Foley added a comment -

          Oh, and in that last bit of code, if fc is still open I would think it should be closed after the truncate. But it isn't in the current code. Can you see a reason why?

          Show
          Matt Foley added a comment - Oh, and in that last bit of code, if fc is still open I would think it should be closed after the truncate. But it isn't in the current code. Can you see a reason why?
          Hide
          Matt Foley added a comment -

          Hi Ravi, looking a lot better. Here's a few more.

          TestCheckpoint.testEditLogFileOutputStreamCloses():
          1. Use the two-argument form of File ctor:
          File(System.getProperty("test.build.data","/tmp"), "editLogStream.dat")
          not
          File(System.getProperty("test.build.data","/tmp") + "editLogStream.dat")
          This will insure the path delimiter is inserted correctly.
          2. in the "finally" clause:
          Again, there's no point in catch-then-assert, unless you need to do something in between. You can just let it fail. The point of the try/catch I recommended was so that it WOULDN'T fail, because failing could prevent any prior exception info from propagating. So the "catch" clause should use println to log the problem, but not fail or otherwise cause an assert.
          3. if you want to fine-tune that a little, you could have a variable "success" which is set to false at the beginning, and set to true at the end of the main body (before the "finally" clause). Then in this "catch" clause you could throw if success==true, but just println if !success.
          4. If you need to use println or Assert to message an exception, you can use StringUtils.stringifyException(e), which prints the whole stack trace, vs e.toString(), which only prints one line of info. But "LOG" and "throw" messages allow using Exception objects as additional arguments, giving the same result as StringUtils.stringifyException() but with cleaner syntax.

          testSetCheckpointTimeInStorageHandlesIOException():
          5. Use File(System.getProperty("test.build.data","/tmp"), "storageDirToCheck")
          instead of File (System.getProperty("test.build.data","/tmp") + "/storageDirToCheck/")
          6. and don't put a space between "File" and the following parenthesis. A ctor is a method call.
          7. You probably want to use mkdirs() rather than mkdir().
          8. Extra blanks before and after argument lists are against the coding style standard.
          9. In "assertTrue("List of storage directories didn't have storageDirToCheck."...), did you intend to iterate over all elements in the list? You go to the trouble of creating an iterator, and then only use the first element.
          10. Doesn't this routine also need a try/catch context that cleans up the created directory if an error occurs?

          EditLogFileOutputStream.close():
          11. Interesting problem. It looks like fc and fp are not directly dependent on bufCurrent and bufReady, but simply are likely to be null if bufCurrent and bufReady end up null, therefore I think we should still treat bufCurrent and bufReady as possibly valid/invalid separately. Can you tell if the problem value of fc is null or simply a file descriptor that is already closed? What if you use the previously suggested statements for bufCurrent and bufReady, followed by:

              // remove the last INVALID marker from transaction log.
              if (fc != null && fc.isOpen()) {
                fc.truncate(fc.position());
              }
              if (fp != null) {
                fp.close();
              }
          
          Show
          Matt Foley added a comment - Hi Ravi, looking a lot better. Here's a few more. TestCheckpoint.testEditLogFileOutputStreamCloses(): 1. Use the two-argument form of File ctor: File(System.getProperty("test.build.data","/tmp"), "editLogStream.dat") not File(System.getProperty("test.build.data","/tmp") + "editLogStream.dat") This will insure the path delimiter is inserted correctly. 2. in the "finally" clause: Again, there's no point in catch-then-assert, unless you need to do something in between. You can just let it fail. The point of the try/catch I recommended was so that it WOULDN'T fail, because failing could prevent any prior exception info from propagating. So the "catch" clause should use println to log the problem, but not fail or otherwise cause an assert. 3. if you want to fine-tune that a little, you could have a variable "success" which is set to false at the beginning, and set to true at the end of the main body (before the "finally" clause). Then in this "catch" clause you could throw if success==true, but just println if !success. 4. If you need to use println or Assert to message an exception, you can use StringUtils.stringifyException(e), which prints the whole stack trace, vs e.toString(), which only prints one line of info. But "LOG" and "throw" messages allow using Exception objects as additional arguments, giving the same result as StringUtils.stringifyException() but with cleaner syntax. testSetCheckpointTimeInStorageHandlesIOException(): 5. Use File(System.getProperty("test.build.data","/tmp"), "storageDirToCheck") instead of File (System.getProperty("test.build.data","/tmp") + "/storageDirToCheck/") 6. and don't put a space between "File" and the following parenthesis. A ctor is a method call. 7. You probably want to use mkdirs() rather than mkdir(). 8. Extra blanks before and after argument lists are against the coding style standard. 9. In "assertTrue("List of storage directories didn't have storageDirToCheck."...), did you intend to iterate over all elements in the list? You go to the trouble of creating an iterator, and then only use the first element. 10. Doesn't this routine also need a try/catch context that cleans up the created directory if an error occurs? EditLogFileOutputStream.close(): 11. Interesting problem. It looks like fc and fp are not directly dependent on bufCurrent and bufReady, but simply are likely to be null if bufCurrent and bufReady end up null, therefore I think we should still treat bufCurrent and bufReady as possibly valid/invalid separately. Can you tell if the problem value of fc is null or simply a file descriptor that is already closed? What if you use the previously suggested statements for bufCurrent and bufReady, followed by: // remove the last INVALID marker from transaction log. if (fc != null && fc.isOpen()) { fc.truncate(fc.position()); } if (fp != null ) { fp.close(); }
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481396/HDFS-2011.5.patch
          against trunk revision 1131124.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/701//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/701//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/701//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481396/HDFS-2011.5.patch against trunk revision 1131124. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/701//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/701//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/701//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          re SecurityException: I still don't see any reason that delete() would throw such an exception. AFAIK that only happens if a security manager is installed, which we never expect in unit tests.

          Will try to look over the new patch revision later today or early next week. Thanks!

          Show
          Todd Lipcon added a comment - re SecurityException: I still don't see any reason that delete() would throw such an exception. AFAIK that only happens if a security manager is installed, which we never expect in unit tests. Will try to look over the new patch revision later today or early next week. Thanks!
          Ravi Prakash made changes -
          Attachment HDFS-2011.5.patch [ 12481396 ]
          Hide
          Ravi Prakash added a comment -

          Hi Todd,

          Thanks a lot for reviewing the patch. I continue to learn I have followed all of your suggestions. The only note is that I am checking for SecurityException so that in the finally block it doesn't mask an IOException / NullPointerException that was possibly thrown in the try block. I hope that is fine.

          Show
          Ravi Prakash added a comment - Hi Todd, Thanks a lot for reviewing the patch. I continue to learn I have followed all of your suggestions. The only note is that I am checking for SecurityException so that in the finally block it doesn't mask an IOException / NullPointerException that was possibly thrown in the try block. I hope that is fine.
          Hide
          Todd Lipcon added a comment -

          A few style nits on HDFS-2011.4.patch:

          • please try to keep lines under 80 columns wide where possible. If it spills over to 85 or 90 here and there, not a huge deal, but 100+ columns should be avoided
          • why catch SecurityException? that looks very much out of place, and given it's an unchecked exception, you don't need to catch it at all
          • the assertTrue around mkdir() in testSetCheckpoingTimeInStorageHandlesIOException should probably check "exists() || mkdir()". Or call deleteFully on it at the top of the test
          • you construct that same file path several times in the same test. Please just make it once as a constant
          • in the error messages, better to do something like: "Couldn't remove directory " + TEST_STORAGE_DIR.getAbsoluteFile(). That way the developer can easily track down the full path
          • alignment is off in NNStorage.java change
          • the comment referring to "edit and edits.new" in ELFOS is out of place - that class shouldn't know about details of how it's used. Instead it should read something like "// if already closed, just return"
          • TestCheckpoint inherits from TestCase, so you don't need to import org.junit.Assert
          Show
          Todd Lipcon added a comment - A few style nits on HDFS-2011 .4.patch: please try to keep lines under 80 columns wide where possible. If it spills over to 85 or 90 here and there, not a huge deal, but 100+ columns should be avoided why catch SecurityException? that looks very much out of place, and given it's an unchecked exception, you don't need to catch it at all the assertTrue around mkdir() in testSetCheckpoingTimeInStorageHandlesIOException should probably check "exists() || mkdir()". Or call deleteFully on it at the top of the test you construct that same file path several times in the same test. Please just make it once as a constant in the error messages, better to do something like: "Couldn't remove directory " + TEST_STORAGE_DIR.getAbsoluteFile(). That way the developer can easily track down the full path alignment is off in NNStorage.java change the comment referring to "edit and edits.new" in ELFOS is out of place - that class shouldn't know about details of how it's used. Instead it should read something like "// if already closed, just return" TestCheckpoint inherits from TestCase, so you don't need to import org.junit.Assert
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481349/HDFS-2011.4.patch
          against trunk revision 1130870.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/696//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/696//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/696//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481349/HDFS-2011.4.patch against trunk revision 1130870. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/696//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/696//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/696//console This message is automatically generated.
          Hide
          Ravi Prakash added a comment -

          Hi Matt,

          Thanks a ton for your review! I learned a lot from your detailed explanations. I followed all of your suggestions. Couple of things to note

          1. To be able to throw exceptions like you suggested, I had to make my two functions individual jUnit tests. I hope that is fine. (Earlier they were being called from testCheckpoint() throws IOException)
          2. Thanks for the tip to use toURI. However, when I used new Path(System.getProperty("test.build.data","/tmp"), "storageDirToCheck").toUri(), the test failed saying

           
          Testcase: testSetCheckpointTimeInStorageHandlesIOException took 0.077 sec
                  Caused an ERROR
          Undefined scheme for /home/raviprak/Code/hadoop/hadoop-hdfs/build/test/data/storageDirToCheck
          java.io.IOException: Undefined scheme for /home/raviprak/Code/hadoop/hadoop-hdfs/build/test/data/storageDirToCheck
                  at org.apache.hadoop.hdfs.server.namenode.NNStorage.checkSchemeConsistency(NNStorage.java:348)
                  at org.apache.hadoop.hdfs.server.namenode.NNStorage.setStorageDirectories(NNStorage.java:306)
                  at org.apache.hadoop.hdfs.server.namenode.TestCheckpoint.testSetCheckpointTimeInStorageHandlesIOException(TestCheckpoint.java:179)
          

          So I changed it to use new File(...).toURI(). I hope that is fine too.
          3. In the comment, I meant to convey that the block of code was for when writeCheckpointTime incurred an IOException. I've removed the comment seeing that it had been already mentioned by the comment above it. Sorry for the ambiguity.
          4. When I separated the bufCurrent and bufReady cases, the test failed saying

           
          Testcase: testEditLogFileOutputStreamCloses took 0.042 sec
                  Caused an ERROR
          Bad file descriptor
          java.io.IOException: Bad file descriptor
                  at sun.nio.ch.FileChannelImpl.position0(Native Method)
                  at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:284)
                  at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:141)
                  at org.apache.hadoop.hdfs.server.namenode.TestCheckpoint.testEditLogFileOutputStreamCloses(TestCheckpoint.java:154)
          

          This was because these lines (more specifically the 1st) were still being called.

           
              // remove the last INVALID marker from transaction log.
              fc.truncate(fc.position());
              fp.close();
          

          I've let it remain the same.

          Show
          Ravi Prakash added a comment - Hi Matt, Thanks a ton for your review! I learned a lot from your detailed explanations. I followed all of your suggestions. Couple of things to note 1. To be able to throw exceptions like you suggested, I had to make my two functions individual jUnit tests. I hope that is fine. (Earlier they were being called from testCheckpoint() throws IOException) 2. Thanks for the tip to use toURI. However, when I used new Path(System.getProperty("test.build.data","/tmp"), "storageDirToCheck").toUri(), the test failed saying Testcase: testSetCheckpointTimeInStorageHandlesIOException took 0.077 sec Caused an ERROR Undefined scheme for /home/raviprak/Code/hadoop/hadoop-hdfs/build/test/data/storageDirToCheck java.io.IOException: Undefined scheme for /home/raviprak/Code/hadoop/hadoop-hdfs/build/test/data/storageDirToCheck at org.apache.hadoop.hdfs.server.namenode.NNStorage.checkSchemeConsistency(NNStorage.java:348) at org.apache.hadoop.hdfs.server.namenode.NNStorage.setStorageDirectories(NNStorage.java:306) at org.apache.hadoop.hdfs.server.namenode.TestCheckpoint.testSetCheckpointTimeInStorageHandlesIOException(TestCheckpoint.java:179) So I changed it to use new File(...).toURI(). I hope that is fine too. 3. In the comment, I meant to convey that the block of code was for when writeCheckpointTime incurred an IOException. I've removed the comment seeing that it had been already mentioned by the comment above it. Sorry for the ambiguity. 4. When I separated the bufCurrent and bufReady cases, the test failed saying Testcase: testEditLogFileOutputStreamCloses took 0.042 sec Caused an ERROR Bad file descriptor java.io.IOException: Bad file descriptor at sun.nio.ch.FileChannelImpl.position0(Native Method) at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:284) at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:141) at org.apache.hadoop.hdfs.server.namenode.TestCheckpoint.testEditLogFileOutputStreamCloses(TestCheckpoint.java:154) This was because these lines (more specifically the 1st) were still being called. // remove the last INVALID marker from transaction log. fc.truncate(fc.position()); fp.close(); I've let it remain the same.
          Ravi Prakash made changes -
          Attachment HDFS-2011.4.patch [ 12481349 ]
          Hide
          Ravi Prakash added a comment -

          Incorporated Matt's and Konstantin's comments

          Show
          Ravi Prakash added a comment - Incorporated Matt's and Konstantin's comments
          Hide
          Konstantin Boudnik added a comment -

          There's also this error message
          + LOG.error("Problem erroring streams " + ioe);
          which is somewhat moot.

          Show
          Konstantin Boudnik added a comment - There's also this error message + LOG.error("Problem erroring streams " + ioe); which is somewhat moot.
          Hide
          Matt Foley added a comment -

          Hi Ravi, the logic of your changes is fine. The following comments are almost all regarding common usages in Hadoop code base and unit tests.

          TestCheckpoint.checkEditLogFileOutputStreamCloses():

          • To obtain the build/test/data directory correctly, use System.getProperty("test.build.data","/tmp") rather than hardcoding it; then create your desired file relative to that directory.
          • I find "elfosFile" to be a very opaque name. Would it be reasonable to use something like "editLogStream" instead?
          • Instead of Assert.assertTrue("msg",false), use Assert.fail("msg").
          • But, there is no need to catch and message exceptions that shouldn't happen. Both "catch{}" clauses add no significant value compared to the stack trace that will be printed on exception, by junit. In fact, the stack trace from the catch-and-Assert is LESS informative than the original exception stack trace would have been, because it points into the catch clause instead of into where the exception actually occurred.
          • It's good that you bracket both the beginning and end with printlns that clearly state what is being tested; if an exception occurs the developer will immediately see what went wrong (with the help of the stack trace).
          • Within the "finally{}" clause, it might be a good idea to put the delete() call in its own try/catch context. If another exception happened, you wouldn't want to interfere with the original exception message, which carries the info you created the testcase to expose.

          checkSetCheckpointTimeInStorageHandlesIOException():

          • alFS and alES are also very opaque names. Hadoop doesn't subscribe to Hungarian naming, so the "al" prefix isn't needed. "FS" usually means FileSystem, which isn't the same as a StorageDirectory. So consider renaming these, perhaps to fsImageDirs and editsDirs.
          • First try/catch context: Again, there's no need to catch-and-Assert unexpected failures. If they occur, they will be duly reported by junit.
          • As before, the place to put the directories you create should be relative to System.getProperty("test.build.data","/tmp").
          • And to create the URIs, it is probably best to do something equivalent to "new Path(System.getProperty("test.build.data","/tmp"), "storageDirToCheck").toUri()". This will work around any filesystem path naming oddities.
          • In the assert, use listRsd.get(listRsd.size()-1) instead of listRsd.get(0), because the new element would be added to the end of the list – I think
          • It might be good to use nnStorage.getEditsDirectories() and/or nnStorage.getImageDirectories() before deleting the dir, to assure that the setStorageDirectories() had the expected result, and call nnStorage.getRemovedStorageDirs() before to assure that the list initially does not contain "storageDirToCheck".

          NNStorage.setCheckpointTimeInStorage():

          • In the comment "//Since writeCheckpointTime may also encounter an IOException in case underlying storage fails" substitute "reportErrorsOnDirectories()" for "writeCheckpointTime".
          • There is a singular reportErrorsOnDirectory() method. Could you use it instead of reportErrorsOnDirectories()? Then you wouldn't need to construct the ArrayList.
          • In the LOG.error if the second IOE happens, suggest
            LOG.error("Failed to report and remove NN storage directory " + sd.getRoot().getPath(), ioe);
            Besides clarifying the msg, note that "+ ioe" uses ioe.toString(), which only prints a single line about the exception, while using ", ioe" in a LOG argument list causes the entire stack trace to be printed.

          EditLogFileOutputStream.close():

          • Suggest you separate the bufCurrent and bufReady cases. Do:
            if (bufCurrent != null) {
                int bufSize = bufCurrent.size();
                if (bufSize != 0) {
                  throw new IOException("FSEditStream has " + bufSize
                      + " bytes still to be flushed and cannot " + "be closed.");
                }
                bufCurrent.close();
            }
            if (bufReady != null) {
                bufReady.close();
            }
            
          Show
          Matt Foley added a comment - Hi Ravi, the logic of your changes is fine. The following comments are almost all regarding common usages in Hadoop code base and unit tests. TestCheckpoint.checkEditLogFileOutputStreamCloses(): To obtain the build/test/data directory correctly, use System.getProperty("test.build.data","/tmp") rather than hardcoding it; then create your desired file relative to that directory. I find "elfosFile" to be a very opaque name. Would it be reasonable to use something like "editLogStream" instead? Instead of Assert.assertTrue("msg",false), use Assert.fail("msg"). But, there is no need to catch and message exceptions that shouldn't happen. Both "catch{}" clauses add no significant value compared to the stack trace that will be printed on exception, by junit. In fact, the stack trace from the catch-and-Assert is LESS informative than the original exception stack trace would have been, because it points into the catch clause instead of into where the exception actually occurred. It's good that you bracket both the beginning and end with printlns that clearly state what is being tested; if an exception occurs the developer will immediately see what went wrong (with the help of the stack trace). Within the "finally{}" clause, it might be a good idea to put the delete() call in its own try/catch context. If another exception happened, you wouldn't want to interfere with the original exception message, which carries the info you created the testcase to expose. checkSetCheckpointTimeInStorageHandlesIOException(): alFS and alES are also very opaque names. Hadoop doesn't subscribe to Hungarian naming, so the "al" prefix isn't needed. "FS" usually means FileSystem, which isn't the same as a StorageDirectory. So consider renaming these, perhaps to fsImageDirs and editsDirs. First try/catch context: Again, there's no need to catch-and-Assert unexpected failures. If they occur, they will be duly reported by junit. As before, the place to put the directories you create should be relative to System.getProperty("test.build.data","/tmp"). And to create the URIs, it is probably best to do something equivalent to "new Path(System.getProperty("test.build.data","/tmp"), "storageDirToCheck").toUri()". This will work around any filesystem path naming oddities. In the assert, use listRsd.get(listRsd.size()-1) instead of listRsd.get(0), because the new element would be added to the end of the list – I think It might be good to use nnStorage.getEditsDirectories() and/or nnStorage.getImageDirectories() before deleting the dir, to assure that the setStorageDirectories() had the expected result, and call nnStorage.getRemovedStorageDirs() before to assure that the list initially does not contain "storageDirToCheck". NNStorage.setCheckpointTimeInStorage(): In the comment "//Since writeCheckpointTime may also encounter an IOException in case underlying storage fails" substitute "reportErrorsOnDirectories()" for "writeCheckpointTime". There is a singular reportErrorsOnDirectory() method. Could you use it instead of reportErrorsOnDirectories()? Then you wouldn't need to construct the ArrayList. In the LOG.error if the second IOE happens, suggest LOG.error("Failed to report and remove NN storage directory " + sd.getRoot().getPath(), ioe); Besides clarifying the msg, note that "+ ioe" uses ioe.toString(), which only prints a single line about the exception, while using ", ioe" in a LOG argument list causes the entire stack trace to be printed. EditLogFileOutputStream.close(): Suggest you separate the bufCurrent and bufReady cases. Do: if (bufCurrent != null ) { int bufSize = bufCurrent.size(); if (bufSize != 0) { throw new IOException( "FSEditStream has " + bufSize + " bytes still to be flushed and cannot " + "be closed." ); } bufCurrent.close(); } if (bufReady != null ) { bufReady.close(); }
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481129/HDFS-2011.3.patch
          against trunk revision 1130262.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/676//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/676//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/676//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481129/HDFS-2011.3.patch against trunk revision 1130262. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/676//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/676//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/676//console This message is automatically generated.
          Ravi Prakash made changes -
          Attachment HDFS-2011.3.patch [ 12481129 ]
          Hide
          Ravi Prakash added a comment -

          Updated patch. Fixed some things I looked over.

          Show
          Ravi Prakash added a comment - Updated patch. Fixed some things I looked over.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481119/HDFS-2011.patch
          against trunk revision 1129942.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/674//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/674//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/674//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481119/HDFS-2011.patch against trunk revision 1129942. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/674//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/674//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/674//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12481110/HDFS-2011.patch
          against trunk revision 1129942.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 8 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.TestDFSUpgradeFromImage
          org.apache.hadoop.hdfs.TestHFlush

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/672//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/672//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/672//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12481110/HDFS-2011.patch against trunk revision 1129942. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestDFSUpgradeFromImage org.apache.hadoop.hdfs.TestHFlush +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/672//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/672//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/672//console This message is automatically generated.
          Ravi Prakash made changes -
          Attachment HDFS-2011.patch [ 12481119 ]
          Hide
          Ravi Prakash added a comment -

          Granting license to ASF.

          Show
          Ravi Prakash added a comment - Granting license to ASF.
          Hide
          Ravi Prakash added a comment -

          I ran test-patch. Also ran ant-test and no new test failures have been introduced. Can someone please review / commit the patch?

          Show
          Ravi Prakash added a comment - I ran test-patch. Also ran ant-test and no new test failures have been introduced. Can someone please review / commit the patch?
          Ravi Prakash made changes -
          Attachment HDFS-2011.patch [ 12481110 ]
          Hide
          Ravi Prakash added a comment -

          HDFS-2011.patch

          Show
          Ravi Prakash added a comment - HDFS-2011 .patch
          Hide
          Ravi Prakash added a comment -

          Thanks for your comments Todd and Matt!

          I'm working on a unit test. I'm almost done.

          Sorry for the junking the emails. I've edited the JIRA and shortened the description. I promise it won't happen again. Thanks for the advice.

          Show
          Ravi Prakash added a comment - Thanks for your comments Todd and Matt! I'm working on a unit test. I'm almost done. Sorry for the junking the emails. I've edited the JIRA and shortened the description. I promise it won't happen again. Thanks for the advice.
          Hide
          Ravi Prakash added a comment -

          I had been automating tests to verify the removal and restoration of storage directories. I was testing by setting up a loopback file system, using that as one of the storage directories, and filling it up to make the writes from Hadoop namenode to the checkpoint fail.
          Mostly I would see the functionality work. However, very often I would see this exception in the logs:

          2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage failed. java.io.IOException: No space left on device
          at java.io.FileOutputStream.writeBytes(Native Method)
          at java.io.FileOutputStream.write(FileOutputStream.java:297)
          at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224)
          at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101)
          at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:416)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
          at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97)
          at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:416)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
          at org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74)
          at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
          at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
          at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124)
          at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871)
          at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
          at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
          at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
          at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
          at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
          at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
          at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
          at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
          at org.mortbay.jetty.Server.handle(Server.java:324)
          at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
          at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
          at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
          at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
          at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
          at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
          at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)

          In this case the storage directory wasn't taken offline. It would not be removed from the list. John George figured out this was because the IOException was happening in a code path fromm where the function to remove the corresponding wasn't being called.

          Also, very rarely, I would see this exception

          2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 87 on 8020, call getEditLogSize() from
          98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException
          java.io.IOException: java.lang.NullPointerException
          at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:109)
          at org.apache.hadoop.hdfs.server.namenode.FSEditLog.processIOError(FSEditLog.java:299)
          at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getEditLogSize(FSEditLog.java:849)
          at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getEditLogSize(FSNamesystem.java:4270)
          at org.apache.hadoop.hdfs.server.namenode.NameNode.getEditLogSize(NameNode.java:1095)
          at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:346)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1399)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1395)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:396)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1094)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1393)

          After this, the Secondary Namenode and the Namenode would go into infinite loops of this NullPointerExceptions. John George figured out this was because close was being called on the editStream twice (so it was trying to close an editstream which was already closed).

          Show
          Ravi Prakash added a comment - I had been automating tests to verify the removal and restoration of storage directories. I was testing by setting up a loopback file system, using that as one of the storage directories, and filling it up to make the writes from Hadoop namenode to the checkpoint fail. Mostly I would see the functionality work. However, very often I would see this exception in the logs: 2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage failed. java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:297) at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131) at org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) In this case the storage directory wasn't taken offline. It would not be removed from the list. John George figured out this was because the IOException was happening in a code path fromm where the function to remove the corresponding wasn't being called. Also, very rarely, I would see this exception 2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 87 on 8020, call getEditLogSize() from 98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:109) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.processIOError(FSEditLog.java:299) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getEditLogSize(FSEditLog.java:849) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getEditLogSize(FSNamesystem.java:4270) at org.apache.hadoop.hdfs.server.namenode.NameNode.getEditLogSize(NameNode.java:1095) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:346) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1399) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1395) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1094) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1393) After this, the Secondary Namenode and the Namenode would go into infinite loops of this NullPointerExceptions. John George figured out this was because close was being called on the editStream twice (so it was trying to close an editstream which was already closed).
          Ravi Prakash made changes -
          Description I had been automating tests to verify the removal and restoration of storage directories. I was testing by setting up a loopback file system, using that as one of the storage directories, and filling it up to make the writes from Hadoop namenode to the checkpoint fail.
          Mostly I would see the functionality work. However, very often I would see this exception in the logs:

          2011-05-29 23:34:30,241 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage failed. java.io.IOException: No space left on device
                  at java.io.FileOutputStream.writeBytes(Native Method)
                  at java.io.FileOutputStream.write(FileOutputStream.java:297)
                  at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:224)
                  at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:101)
                  at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1$1.run(GetImageServlet.java:98)
                  at java.security.AccessController.doPrivileged(Native Method)
                  at javax.security.auth.Subject.doAs(Subject.java:416)
                  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
                  at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:97)
                  at org.apache.hadoop.hdfs.server.namenode.GetImageServlet$1.run(GetImageServlet.java:74)
                  at java.security.AccessController.doPrivileged(Native Method)
                  at javax.security.auth.Subject.doAs(Subject.java:416)
                  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1131)
                  at org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:74)
                  at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
                  at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
                  at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
                  at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1124)
                  at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:871)
                  at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1115)
                  at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:361)
                  at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
                  at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
                  at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
                  at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
                  at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
                  at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
                  at org.mortbay.jetty.Server.handle(Server.java:324)
                  at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
                  at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
                  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
                  at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
                  at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
                  at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
                  at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)

          In this case the storage directory wasn't taken offline. It would not be removed from the list. John George figured out this was because the IOException was happening in a code path fromm where the function to remove the corresponding wasn't being called.

          Also, very rarely, I would see this exception

          2011-04-05 17:36:56,187 INFO org.apache.hadoop.ipc.Server: IPC Server handler 87 on 8020, call getEditLogSize() from
          98.137.97.99:35862: error: java.io.IOException: java.lang.NullPointerException
          java.io.IOException: java.lang.NullPointerException
                  at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.close(EditLogFileOutputStream.java:109)
                  at org.apache.hadoop.hdfs.server.namenode.FSEditLog.processIOError(FSEditLog.java:299)
                  at org.apache.hadoop.hdfs.server.namenode.FSEditLog.getEditLogSize(FSEditLog.java:849)
                  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getEditLogSize(FSNamesystem.java:4270)
                  at org.apache.hadoop.hdfs.server.namenode.NameNode.getEditLogSize(NameNode.java:1095)
                  at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                  at java.lang.reflect.Method.invoke(Method.java:597)
                  at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:346)
                  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1399)
                  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1395)
                  at java.security.AccessController.doPrivileged(Native Method)
                  at javax.security.auth.Subject.doAs(Subject.java:396)
                  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1094)
                  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1393)

          After this, the Secondary Namenode and the Namenode would go into infinite loops of this NullPointerExceptions. John George figured out this was because close was being called on the editStream twice (so it was trying to close an editstream which was already closed).

          Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory
          Eli Collins made changes -
          Link This issue relates to HDFS-1602 [ HDFS-1602 ]
          Hide
          Matt Foley added a comment -

          Hi Ravi, for future reference please write a short Description field, then add the long details in a first Comment. The problem is the Description gets re-sent in every Jira email about the ticket. Thanks.

          Show
          Matt Foley added a comment - Hi Ravi, for future reference please write a short Description field, then add the long details in a first Comment. The problem is the Description gets re-sent in every Jira email about the ticket. Thanks.
          Hide
          Todd Lipcon added a comment -

          Any chance of unit tests for these?

          Show
          Todd Lipcon added a comment - Any chance of unit tests for these?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12480858/HDFS-2011.patch
          against trunk revision 1128987.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/659//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12480858/HDFS-2011.patch against trunk revision 1128987. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/659//console This message is automatically generated.
          Ravi Prakash made changes -
          Attachment HDFS-2011.patch [ 12480858 ]
          Hide
          Ravi Prakash added a comment -

          This applies to commit a8cacc60847be89b5769741f0eb5f560cdb64691

          Show
          Ravi Prakash added a comment - This applies to commit a8cacc60847be89b5769741f0eb5f560cdb64691
          Ravi Prakash made changes -
          Field Original Value New Value
          Status Open [ 1 ] Patch Available [ 10002 ]
          Ravi Prakash created issue -

            People

            • Assignee:
              Ravi Prakash
              Reporter:
              Ravi Prakash
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development