Hadoop Common
  1. Hadoop Common
  2. HADOOP-4840

TestNodeCount sometimes fails with NullPointerException

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.18.3
    • Fix Version/s: 0.18.3
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Testcase: testNodeCount took 9.628 sec
      Caused an ERROR

      java.lang.NullPointerException
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.countNodes(FSNamesystem.java:3523)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.countNodes(FSNamesystem.java:3543)
      at org.apache.hadoop.hdfs.server.namenode.TestNodeCount.testNodeCount(TestNodeCount.java:64)

      1. nodeCountNPE1-br18.patch
        0.6 kB
        Hairong Kuang
      2. nodeCountNPE1.patch
        0.6 kB
        Hairong Kuang
      3. nodeCountNPE.patch
        0.6 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #698 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/698/ )
          Hide
          Hairong Kuang added a comment -

          I've just committed this.

          Show
          Hairong Kuang added a comment - I've just committed this.
          Hide
          Hairong Kuang added a comment -

          patch for branch 0.18.

          Show
          Hairong Kuang added a comment - patch for branch 0.18.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 patch looks good.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 patch looks good.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Can we handle only the unit test failure in this jira and handle the non-synchronized call to countNodes in a different jira?

          Sure. Let do it in a separated issue.

          Show
          Tsz Wo Nicholas Sze added a comment - > Can we handle only the unit test failure in this jira and handle the non-synchronized call to countNodes in a different jira? Sure. Let do it in a separated issue.
          Hide
          Hairong Kuang added a comment -

          I totally agree with Nicholas's observation. Can we handle only the unit test failure in this jira and handle the non-synchronized call to countNodes in a different jira?

          Show
          Hairong Kuang added a comment - I totally agree with Nicholas's observation. Can we handle only the unit test failure in this jira and handle the non-synchronized call to countNodes in a different jira?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          FSNamesystem.countNodes(..) is called in many places including:

          • FSNamesystem.addStoredBlock(Block, DatanodeDescriptor, DatanodeDescriptor)
          • FSNamesystem.checkReplicationFactor(INodeFile)
          • FSNamesystem.decrementSafeBlockCount(Block)
          • FSNamesystem.getBlockLocationsInternal(String, INodeFile, long, long, int, boolean)
          • FSNamesystem.invalidateBlock(Block, DatanodeInfo)
          • FSNamesystem.isReplicationInProgress(DatanodeDescriptor)
          • FSNamesystem.markBlockAsCorrupt(Block, DatanodeInfo)
          • FSNamesystem.processMisReplicatedBlocks()
          • FSNamesystem.processPendingReplications()
          • FSNamesystem.updateNeededReplications(Block, int, int)

          However, some of them, e.g. getBlockLocationsInternal, call countNodes(..) without owning the fsnamesystem lock before calling . It may causes NPE in runtime.

          Show
          Tsz Wo Nicholas Sze added a comment - FSNamesystem.countNodes(..) is called in many places including: FSNamesystem.addStoredBlock(Block, DatanodeDescriptor, DatanodeDescriptor) FSNamesystem.checkReplicationFactor(INodeFile) FSNamesystem.decrementSafeBlockCount(Block) FSNamesystem.getBlockLocationsInternal(String, INodeFile, long, long, int, boolean) FSNamesystem.invalidateBlock(Block, DatanodeInfo) FSNamesystem.isReplicationInProgress(DatanodeDescriptor) FSNamesystem.markBlockAsCorrupt(Block, DatanodeInfo) FSNamesystem.processMisReplicatedBlocks() FSNamesystem.processPendingReplications() FSNamesystem.updateNeededReplications(Block, int, int) However, some of them, e.g. getBlockLocationsInternal, call countNodes(..) without owning the fsnamesystem lock before calling . It may causes NPE in runtime.
          Hide
          Hairong Kuang added a comment -

          I ran TestNodeCount for 50 times on my local machine without seeing NPE.

          Show
          Hairong Kuang added a comment - I ran TestNodeCount for 50 times on my local machine without seeing NPE.
          Hide
          Hairong Kuang added a comment -

          This patch adds the synchronization on FsNamesystem in the failed test.

          Show
          Hairong Kuang added a comment - This patch adds the synchronization on FsNamesystem in the failed test.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I think we should synchronize fsnamesystem but not blocksMap.

          Show
          Tsz Wo Nicholas Sze added a comment - I think we should synchronize fsnamesystem but not blocksMap.
          Hide
          Hairong Kuang added a comment -

          The error was caused by an unsynchronized access to the list of block locations. The patch fixed the problem.

          Show
          Hairong Kuang added a comment - The error was caused by an unsynchronized access to the list of block locations. The patch fixed the problem.

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development