Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-615

TestLargeDirectoryDelete fails with NullPointerException

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: namenode
    • Labels:
      None
    • Environment:

      64-bit debian 5, 64-bit sun java6, running in a single processor VM.

      Description

      I've hit the following failure two out of two times running "ant test" at rev 813587. This test doesn't appear to be failing on hudson. All other tests passed except TestHDFSFileSystemContract which timed out, so perhaps there's a race due to the test executing slowly.

      [junit] Running org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete
      [junit] Exception in thread "Thread-30148" java.lang.NullPointerException
      [junit] at org.apache.hadoop.hdfs.server.namenode.NameNodeAdapter.getNamesystem(NameNodeAdapter.java:32)
      [junit] at org.apache.hadoop.hdfs.MiniDFSCluster.getNamesystem(MiniDFSCluster.java:522)
      [junit] at org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete.getBlockCount(TestLargeDirectoryDelete.java:75)
      [junit] at org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete.access$000(TestLargeDirectoryDelete.java:38)
      [junit] at org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete$1.run(TestLargeDirectoryDelete.java:90)
      [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 94.264 sec

      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ No failures or errors?

      public static FSNamesystem getNamesystem(NameNode namenode)

      { return namenode.getNamesystem(); <=========== }
      1. TestLargeDirectoryDelete.zip
        798 kB
        Anatoli Fomenko
      2. HDFS-615.patch
        0.6 kB
        Konstantin Boudnik

        Issue Links

          Activity

          Hide
          Konstantin Boudnik added a comment -

          It seems to me that what MiniDFSCluster.waitActive() does contradicts to the semantic of the call. It checks if the namenode reference is null and then simply returns. Apparently the very first call which needs namenode reference dumps NPE immediately.

          Shall it be fixed along the lines of actually waiting for the namenode to start?

          Show
          Konstantin Boudnik added a comment - It seems to me that what MiniDFSCluster.waitActive() does contradicts to the semantic of the call. It checks if the namenode reference is null and then simply returns. Apparently the very first call which needs namenode reference dumps NPE immediately. Shall it be fixed along the lines of actually waiting for the namenode to start?
          Hide
          Anatoli Fomenko added a comment -

          This issue is currently consistently observed on my Mac OS 10.5.8 under Java 1.6.0_15-b03-226 with HDFS-840.1.patch installed.

          Show
          Anatoli Fomenko added a comment - This issue is currently consistently observed on my Mac OS 10.5.8 under Java 1.6.0_15-b03-226 with HDFS-840 .1.patch installed.
          Hide
          Anatoli Fomenko added a comment -

          Stacktrace:

          java.io.IOException: Cannot lock storage ./Hadoop-HDFS/build/test/data/dfs/name1. The directory is already locked.
          	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:524)
          	at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:1278)
          	at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:1301)
          	at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1126)
          	at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:175)
          	at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:280)
          	at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:132)
          	at org.apache.hadoop.hdfs.server.namenode.TestFsck.testFsckError(TestFsck.java:354)
          
          Show
          Anatoli Fomenko added a comment - Stacktrace: java.io.IOException: Cannot lock storage ./Hadoop-HDFS/build/test/data/dfs/name1. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:524) at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:1278) at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:1301) at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1126) at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:175) at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:280) at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:132) at org.apache.hadoop.hdfs.server.namenode.TestFsck.testFsckError(TestFsck.java:354)
          Hide
          Anatoli Fomenko added a comment -

          My apologies, the stacktrace in my previous comment belongs to a different test.

          The correct stacktrace:

             [junit] Running org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete
             [junit] Exception in thread "Thread-30151" java.lang.NullPointerException
             [junit]     at org.apache.hadoop.hdfs.server.namenode.NameNodeAdapter.getNamesystem(NameNodeAdapter.java:32)
             [junit]     at org.apache.hadoop.hdfs.MiniDFSCluster.getNamesystem(MiniDFSCluster.java:544)
             [junit]     at org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete.getBlockCount(TestLargeDirectoryDelete.java:77)
             [junit]     at org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete.access$000(TestLargeDirectoryDelete.java:40)
             [junit]     at org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete$2.run(TestLargeDirectoryDelete.java:113)
          
          Show
          Anatoli Fomenko added a comment - My apologies, the stacktrace in my previous comment belongs to a different test. The correct stacktrace: [junit] Running org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete [junit] Exception in thread "Thread-30151" java.lang.NullPointerException [junit] at org.apache.hadoop.hdfs.server.namenode.NameNodeAdapter.getNamesystem(NameNodeAdapter.java:32) [junit] at org.apache.hadoop.hdfs.MiniDFSCluster.getNamesystem(MiniDFSCluster.java:544) [junit] at org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete.getBlockCount(TestLargeDirectoryDelete.java:77) [junit] at org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete.access$000(TestLargeDirectoryDelete.java:40) [junit] at org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete$2.run(TestLargeDirectoryDelete.java:113)
          Hide
          Suresh Srinivas added a comment -

          Can you please attach the whole test log to the jira.

          Show
          Suresh Srinivas added a comment - Can you please attach the whole test log to the jira.
          Hide
          Anatoli Fomenko added a comment -

          The test log is attached.

          Show
          Anatoli Fomenko added a comment - The test log is attached.
          Hide
          Konstantin Boudnik added a comment -

          Even though it seems to be like a threading issue I have a strong feeling that MiniDFSCluster#waitActive() does wrong thing. So, I'd suggest to include something like the content of the attached patch.

          Show
          Konstantin Boudnik added a comment - Even though it seems to be like a threading issue I have a strong feeling that MiniDFSCluster#waitActive() does wrong thing. So, I'd suggest to include something like the content of the attached patch.
          Hide
          Konstantin Shvachko added a comment -

          Is this related to HDFS-816?

          Show
          Konstantin Shvachko added a comment - Is this related to HDFS-816 ?
          Hide
          Steve Loughran added a comment -

          I'm seeing this too, today, OS/X, my lifecycle enhanced branch

             [junit] Running org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete
              [junit] Exception in thread "Thread-30145" java.lang.NullPointerException
              [junit] 	at org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete.getBlockCount(TestLargeDirectoryDelete.java:77)
              [junit] 	at org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete.access$000(TestLargeDirectoryDelete.java:40)
              [junit] 	at org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete$2.run(TestLargeDirectoryDelete.java:113)
              [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 69.069 sec
          

          Interesting elapsed time, just over a minute, that's the kind of spin-for-over-a-minute kind of situation that usually implies some timeout has happened but not fed upwards.

          Show
          Steve Loughran added a comment - I'm seeing this too, today, OS/X, my lifecycle enhanced branch [junit] Running org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete [junit] Exception in thread " Thread -30145" java.lang.NullPointerException [junit] at org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete.getBlockCount(TestLargeDirectoryDelete.java:77) [junit] at org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete.access$000(TestLargeDirectoryDelete.java:40) [junit] at org.apache.hadoop.hdfs.server.namenode.TestLargeDirectoryDelete$2.run(TestLargeDirectoryDelete.java:113) [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 69.069 sec Interesting elapsed time, just over a minute, that's the kind of spin-for-over-a-minute kind of situation that usually implies some timeout has happened but not fed upwards.
          Hide
          Steve Loughran added a comment -

          The line in question has two options to go null, the minicluster is null or cluster is up but the namesystem itself is null

          return (int)mc.getNamesystem().getBlocksTotal();
          
          Show
          Steve Loughran added a comment - The line in question has two options to go null, the minicluster is null or cluster is up but the namesystem itself is null return ( int )mc.getNamesystem().getBlocksTotal();
          Hide
          Steve Loughran added a comment -

          The reason the NPEs aren't failing the test is because they are happening in separate threads -they are not being picked up by the thread runner and recognised as test failures. As a result, this test may be failing on other people's machines, its only those systems where people are looking at the output that the NPE is being spotted.

          Show
          Steve Loughran added a comment - The reason the NPEs aren't failing the test is because they are happening in separate threads -they are not being picked up by the thread runner and recognised as test failures. As a result, this test may be failing on other people's machines, its only those systems where people are looking at the output that the NPE is being spotted.
          Hide
          Steve Loughran added a comment -

          I don't know the root cause of the original stack, mine is slightly different and triggered by a null NN which only happens sometimes, and which even minor changes to the code/system use cause to go away. This makes me think its triggered by the namenode being set to null when the NN shuts down, which would explain why the assertion only kicks in after 60-90 seconds: it's not a test failure so much as race conditions in the shutdown logic.

          Show
          Steve Loughran added a comment - I don't know the root cause of the original stack, mine is slightly different and triggered by a null NN which only happens sometimes, and which even minor changes to the code/system use cause to go away. This makes me think its triggered by the namenode being set to null when the NN shuts down, which would explain why the assertion only kicks in after 60-90 seconds: it's not a test failure so much as race conditions in the shutdown logic.
          Hide
          Steve Loughran added a comment -

          I believe the NPEs seein here and the HDFS-816 failures are all shutdown race conditions: the test case just interrupts the thread and immediately shuts down the cluster

          1. There's no attempt to wait for the threads to finish
          2. Interrupts are only picked up by threads in some operations (like Sleep(), even they happen earlier).

          The patches of HDFS-1170 should either fix this or make the failures more visible, by actually catching the problems and reporting them.

          Show
          Steve Loughran added a comment - I believe the NPEs seein here and the HDFS-816 failures are all shutdown race conditions: the test case just interrupts the thread and immediately shuts down the cluster There's no attempt to wait for the threads to finish Interrupts are only picked up by threads in some operations (like Sleep(), even they happen earlier). The patches of HDFS-1170 should either fix this or make the failures more visible, by actually catching the problems and reporting them.
          Hide
          Eli Collins added a comment -

          I'm seeing this test timeout and also get NPEs on trunk (including up to HDFS-1038) on my linux host and also on Hudson over on HDFS-1161.

          Show
          Eli Collins added a comment - I'm seeing this test timeout and also get NPEs on trunk (including up to HDFS-1038 ) on my linux host and also on Hudson over on HDFS-1161 .
          Hide
          Konstantin Shvachko added a comment -

          Looks like this causes a real pain for people. Should probably be fixed for 0.21 as a release cannot go on with failed tests.

          Show
          Konstantin Shvachko added a comment - Looks like this causes a real pain for people. Should probably be fixed for 0.21 as a release cannot go on with failed tests.
          Hide
          Steve Loughran added a comment -

          HDFS-1170 should correct this, so try it out on your systems

          Show
          Steve Loughran added a comment - HDFS-1170 should correct this, so try it out on your systems
          Hide
          Eli Collins added a comment -

          Looped the test after HDFS-1170 and can not reproduce, closing.

          Show
          Eli Collins added a comment - Looped the test after HDFS-1170 and can not reproduce, closing.

            People

            • Assignee:
              Unassigned
              Reporter:
              Eli Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development