Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-4482

ReplicationMonitor thread can exit with NPE due to the race between delete and replication of same file.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 2.0.1-alpha, 3.0.0-alpha1
    • 2.0.5-alpha, 0.23.10
    • namenode
    • None
    • Reviewed

    Description

      Trace:

      java.lang.NullPointerException
      	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFullPathName(FSDirectory.java:1442)
      	at org.apache.hadoop.hdfs.server.namenode.INode.getFullPathName(INode.java:269)
      	at org.apache.hadoop.hdfs.server.namenode.INodeFile.getName(INodeFile.java:163)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:131)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1157)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1063)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3085)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3047)
      	at java.lang.Thread.run(Thread.java:619)
      
      

      What I am seeing here is:

      1) create a file and write with 2 DNS
      2) Close the file.
      3) Kill one DN
      4) Let replication start.
      Info:

       // choose replication targets: NOT HOLDING THE GLOBAL LOCK
            // It is costly to extract the filename for which chooseTargets is called,
            // so for now we pass in the block collection itself.
            rw.targets = blockplacement.chooseTarget(rw.bc,
                rw.additionalReplRequired, rw.srcNode, rw.liveReplicaNodes,
                excludedNodes, rw.block.getNumBytes());

      Here we are choosing target outside the global lock. Inside we will try to get the src path from blockCollection(nothing but INodeFile here).

      see the code for FSDirectory#getFullPathName
      Here it is incrementing the depth until it has parent. and Later it will iterate and access parent again in next loop.

      5) before going to secnd loop in FSDirectory#getFullPathName, if file is deleted by client then that parent would have been set as null. So, here accessing the parent can cause NPE because it is not under lock.

      brahmareddy reported this issue.

      Attachments

        1. HDFS-4482.patch
          2 kB
          Uma Maheswara Rao G
        2. HDFS-4482.patch
          0.8 kB
          Uma Maheswara Rao G
        3. HDFS-4482-1.patch
          2 kB
          Uma Maheswara Rao G

        Issue Links

          Activity

            People

              umamaheswararao Uma Maheswara Rao G
              umamaheswararao Uma Maheswara Rao G
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: