[HDFS-2838] HA: NPE in FSNamesystem when in safe mode - ASF JIRA

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: HA branch (HDFS-1623)
Fix Version/s: HA branch (HDFS-1623)
Component/s: ha, namenode
Labels:
None

Target Version/s:

HA branch (HDFS-1623)
Hadoop Flags:

Reviewed

Description

I'm seeing an NPE when running HBase 0.92 unit tests against the HA branch. The test failure is: org.apache.hadoop.hbase.regionserver.wal.TestHLog.testAppendClose.

Here is the backtrace:
java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.size(BlocksMap.java:179)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.getActiveBlockCount(BlockManager.java:2465)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.doConsistencyCheck(FSNamesystem.java:3591)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.isOn(FSNamesystem.java:3285)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.access$900(FSNamesystem.java:3196)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.isInSafeMode(FSNamesystem.java:3670)
at org.apache.hadoop.hdfs.server.namenode.NameNode.isInSafeMode(NameNode.java:609)
at org.apache.hadoop.hdfs.MiniDFSCluster.isNameNodeUp(MiniDFSCluster.java:1476)
at org.apache.hadoop.hdfs.MiniDFSCluster.isClusterUp(MiniDFSCluster.java:1487)

Here is the relevant section of the test:

   try {
      DistributedFileSystem dfs = (DistributedFileSystem) cluster.getFileSystem();
      dfs.setSafeMode(FSConstants.SafeModeAction.SAFEMODE_ENTER);
      cluster.shutdown();
      try {
        // wal.writer.close() will throw an exception,
        // but still call this since it closes the LogSyncer thread first
        wal.close();
      } catch (IOException e) {
        LOG.info(e);
      }
      fs.close(); // closing FS last so DFSOutputStream can't call close
      LOG.info("STOPPED first instance of the cluster");
    } finally {
      // Restart the cluster
      while (cluster.isClusterUp()){
        LOG.error("Waiting for cluster to go down");
        Thread.sleep(1000);
      }

Fix looks trivial, will include patch shortly.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-2838-v2.patch
26/Jan/12 21:35
3 kB
Gregory Chanan
HDFS-2838.patch
25/Jan/12 02:19
0.9 kB
Gregory Chanan

Activity

Ascending order - Click to sort in descending order

Eli Collins added a comment - 25/Jan/12 02:30

Eli Collins added a comment - 25/Jan/12 02:30 +1

Uma Maheswara Rao G added a comment - 25/Jan/12 02:34

make sense. I just verified in trunk. Looks this is the bug in branch only.
BTW, could you please provide test also to replicate this issue?

Uma Maheswara Rao G added a comment - 25/Jan/12 02:34 make sense. I just verified in trunk. Looks this is the bug in branch only. BTW, could you please provide test also to replicate this issue?

Uma Maheswara Rao G added a comment - 25/Jan/12 02:40

sorry, i did not notice Eli's review above.

Uma Maheswara Rao G added a comment - 25/Jan/12 02:40 sorry, i did not notice Eli's review above.

Eli Collins added a comment - 25/Jan/12 03:01

No worries. Greg is going to take a stab at moving the kernel of TestHLog.testAppendClose into an HDFS test.

Eli Collins added a comment - 25/Jan/12 03:01 No worries. Greg is going to take a stab at moving the kernel of TestHLog.testAppendClose into an HDFS test.

Todd Lipcon added a comment - 25/Jan/12 03:12

looks good to me. Getting an HDFS test for this might be tricky, since this is only the case during startup, right?

Todd Lipcon added a comment - 25/Jan/12 03:12 looks good to me. Getting an HDFS test for this might be tricky, since this is only the case during startup, right?

Uma Maheswara Rao G added a comment - 25/Jan/12 03:52

I just verified his sample test code. It passes for me. Yes, it would be tricky to create the situation where safemode object is not null and blockmanager not up completely. Thanks Greg for the patch.

Uma Maheswara Rao G added a comment - 25/Jan/12 03:52 +1 I just verified his sample test code. It passes for me. Yes, it would be tricky to create the situation where safemode object is not null and blockmanager not up completely. Thanks Greg for the patch.

Gregory Chanan added a comment - 26/Jan/12 21:36

Added version 2 of patch that contains a test case that fails without change and passes with.

Gregory Chanan added a comment - 26/Jan/12 21:36 Added version 2 of patch that contains a test case that fails without change and passes with.

Eli Collins added a comment - 26/Jan/12 23:47

+1 nice test.

Eli Collins added a comment - 26/Jan/12 23:47 +1 nice test.

Eli Collins added a comment - 26/Jan/12 23:48

I've committed this. Thanks Greg!

Eli Collins added a comment - 26/Jan/12 23:48 I've committed this. Thanks Greg!

Uma Maheswara Rao G added a comment - 27/Jan/12 02:01

Thanks Greg,
Eli, is this test failing reliably for you without fix? For me, it passes even with out fix.
It may be ok to keep this test, at least this can reproduce randomly. may be better than nothing
@Greg, small suggestion, from next time you can use HdfsConstants instead of FSConstants.

Uma Maheswara Rao G added a comment - 27/Jan/12 02:01 Thanks Greg, Eli, is this test failing reliably for you without fix? For me, it passes even with out fix. It may be ok to keep this test, at least this can reproduce randomly. may be better than nothing @Greg, small suggestion, from next time you can use HdfsConstants instead of FSConstants.

Hudson added a comment - 27/Jan/12 12:57

Integrated in Hadoop-Hdfs-HAbranch-build #60 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/60/)
~~HDFS-2838~~. NPE in FSNamesystem when in safe mode. Contributed by Gregory Chanan

eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1236450
Files :

/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/CHANGES.~~HDFS-1623~~.txt
/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
/hadoop/common/branches/~~HDFS-1623~~/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestMiniDFSCluster.java

Hudson added a comment - 27/Jan/12 12:57 Integrated in Hadoop-Hdfs-HAbranch-build #60 (See https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/60/ ) HDFS-2838 . NPE in FSNamesystem when in safe mode. Contributed by Gregory Chanan eli : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1236450 Files : /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/CHANGES. HDFS-1623 .txt /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java /hadoop/common/branches/ HDFS-1623 /hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestMiniDFSCluster.java

People

Assignee:: Gregory Chanan

Reporter:: Gregory Chanan

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 25/Jan/12 02:11

Updated:: 02/Mar/12 06:17

Resolved:: 26/Jan/12 23:48

Hadoop HDFS

Details

Description

Attachments

Attachments

Activity

People

Dates