Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-8113

Add check for null BlockCollection pointers in BlockInfoContiguous structures

    Details

    • Target Version/s:

      Description

      The following copy constructor can throw NullPointerException if bc is null.

        protected BlockInfoContiguous(BlockInfoContiguous from) {
          this(from, from.bc.getBlockReplication());
          this.bc = from.bc;
        }
      

      We have observed that some DataNodes keeps failing doing block reports with NameNode. The stacktrace is as follows. Though we are not using the latest version, the problem still exists.

      2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: RemoteException in offerService
      org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException
      at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80)
      at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696)
      at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185)
      at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047)
      at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950)
      at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823)
      at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750)
      at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069)
      at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152)
      at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:415)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

      1. HDFS-8113.02.patch
        2 kB
        Chengbing Liu
      2. HDFS-8113.patch
        0.8 kB
        Chengbing Liu

        Issue Links

          Activity

          Hide
          chengbing.liu Chengbing Liu added a comment -

          Uploaded a patch to fix this.

          Show
          chengbing.liu Chengbing Liu added a comment - Uploaded a patch to fix this.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12724205/HDFS-8113.patch
          against trunk revision 6495940.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10230//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10230//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724205/HDFS-8113.patch against trunk revision 6495940. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10230//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10230//console This message is automatically generated.
          Hide
          vinayrpet Vinayakumar B added a comment -

          +1, patch lgtm

          Show
          vinayrpet Vinayakumar B added a comment - +1, patch lgtm
          Hide
          brahmareddy Brahma Reddy Battula added a comment -

          LGTM, +1 ( non binding)

          Show
          brahmareddy Brahma Reddy Battula added a comment - LGTM, +1 ( non binding)
          Hide
          atm Aaron T. Myers added a comment -

          Hi Chengbing, can you elaborate a bit on what circumstances in the cluster cause this to happen? Clearly it doesn't always occur, so I'm curious what triggers it. Also, have you been able to identify when this issue was introduced?

          Show
          atm Aaron T. Myers added a comment - Hi Chengbing, can you elaborate a bit on what circumstances in the cluster cause this to happen? Clearly it doesn't always occur, so I'm curious what triggers it. Also, have you been able to identify when this issue was introduced?
          Hide
          chengbing.liu Chengbing Liu added a comment -

          Hi Aaron, I have no idea where the corrupt replica come from. From the log I see that the problem had been there for at lease one month. A few DataNodes have the same problem. Their block reports to NameNode will always throw the described exception. After the exception, they wait one second before attempting another block report, and so forth.

          However, the HDFS as a whole still works. The only bad thing is that the RPC processing time in average is around 10ms, which is too high compared to our other clusters. We expect this to be around 1ms.

          Does that answer your question?

          Show
          chengbing.liu Chengbing Liu added a comment - Hi Aaron, I have no idea where the corrupt replica come from. From the log I see that the problem had been there for at lease one month. A few DataNodes have the same problem. Their block reports to NameNode will always throw the described exception. After the exception, they wait one second before attempting another block report, and so forth. However, the HDFS as a whole still works. The only bad thing is that the RPC processing time in average is around 10ms, which is too high compared to our other clusters. We expect this to be around 1ms. Does that answer your question?
          Hide
          atm Aaron T. Myers added a comment -

          Hi Chenbing, sorry, I should have been more clear. I meant to ask what exact circumstances cause the BlockCollection to be null in the code, which is presumably what causes this NPE. Is it just whenever a replica becomes corrupted? Or does it take a confluence of other factors as well? I'd expect it to be the latter, since I'd expect our existing tests to have caught this if just a single corrupt replica would cause this. It'd be nice to write a test for this, but to do that we'd need to understand the precise mechanism by which a BlockCollection can end up being null at the NN.

          As for when this was introduced, I meant when was this bug introduced in the code, i.e. what version introduced this, or ideally what specific code change?

          Show
          atm Aaron T. Myers added a comment - Hi Chenbing, sorry, I should have been more clear. I meant to ask what exact circumstances cause the BlockCollection to be null in the code, which is presumably what causes this NPE. Is it just whenever a replica becomes corrupted? Or does it take a confluence of other factors as well? I'd expect it to be the latter, since I'd expect our existing tests to have caught this if just a single corrupt replica would cause this. It'd be nice to write a test for this, but to do that we'd need to understand the precise mechanism by which a BlockCollection can end up being null at the NN. As for when this was introduced, I meant when was this bug introduced in the code, i.e. what version introduced this, or ideally what specific code change?
          Hide
          chengbing.liu Chengbing Liu added a comment -

          Aaron, thanks for the classification. I agree with you that we should find out what causes the BlockCollection to be null. I will look into this shortly.

          In my opinion, we should divide the issue into two: the problem with BlockInfoContiguous itself and the probable misuse of it.

          For the problem with BlockInfoContiguous itself, it cannot guarantee that people who instantiate it have updated the BlockCollection before calling the copy constructor. I find it in the earliest commit that I can see on GitHub, which is HADOOP-7560 on Aug 25, 2011.

          The second problem, the misuse of BlockInfoContiguous, might be introduced recently. Should we deal with it in another JIRA?

          Show
          chengbing.liu Chengbing Liu added a comment - Aaron, thanks for the classification. I agree with you that we should find out what causes the BlockCollection to be null . I will look into this shortly. In my opinion, we should divide the issue into two: the problem with BlockInfoContiguous itself and the probable misuse of it. For the problem with BlockInfoContiguous itself, it cannot guarantee that people who instantiate it have updated the BlockCollection before calling the copy constructor. I find it in the earliest commit that I can see on GitHub, which is HADOOP-7560 on Aug 25, 2011. The second problem, the misuse of BlockInfoContiguous , might be introduced recently. Should we deal with it in another JIRA?
          Hide
          chengbing.liu Chengbing Liu added a comment -

          The following code in BlockManager#processReportedBlock returns BlockInfoContiguous with BlockCollection equal to null:

          BlockInfoContiguous storedBlock = blocksMap.getStoredBlock(block);
          

          There are two methods that can add entries to blocksMap:

          • BlocksMap#addBlockCollection(BlockInfoContiguous b, BlockCollection bc), we should check whether bc is null.
          • BlocksMap#replaceBlock(BlockInfoContiguous newBlock), we should check whether newBlock.getBlockCollection() is null.

          Both methods are called from many places. To get more debug information, I think we should at least log it as WARN or ERROR if the BlockCollection happens to be null.

          Show
          chengbing.liu Chengbing Liu added a comment - The following code in BlockManager#processReportedBlock returns BlockInfoContiguous with BlockCollection equal to null : BlockInfoContiguous storedBlock = blocksMap.getStoredBlock(block); There are two methods that can add entries to blocksMap : BlocksMap#addBlockCollection(BlockInfoContiguous b, BlockCollection bc) , we should check whether bc is null . BlocksMap#replaceBlock(BlockInfoContiguous newBlock) , we should check whether newBlock.getBlockCollection() is null . Both methods are called from many places. To get more debug information, I think we should at least log it as WARN or ERROR if the BlockCollection happens to be null .
          Hide
          cmccabe Colin P. McCabe added a comment -

          It seems like BlockCollection will be null if the block doesn't belong to any file. We should also have a unit test for this. I was thinking:

          1. start a mini dfs cluster with 2 datanodes
          2. create a file with repl=2 and close it
          3. take down one DN
          4. delete the file
          5. wait
          6. bring back up the other DN, which will still have the block from the file which was deleted

          Show
          cmccabe Colin P. McCabe added a comment - It seems like BlockCollection will be null if the block doesn't belong to any file. We should also have a unit test for this. I was thinking: 1. start a mini dfs cluster with 2 datanodes 2. create a file with repl=2 and close it 3. take down one DN 4. delete the file 5. wait 6. bring back up the other DN, which will still have the block from the file which was deleted
          Hide
          chengbing.liu Chengbing Liu added a comment -

          Hi Aaron T. Myers and Colin P. McCabe, from the stacktrace we know that the reportedState is RBW or RWR, and condition
          storedBlock.getGenerationStamp() != reported.getGenerationStamp() is satisfied. Since storedBlock is an entry in blocksMap, the file/block should not have been deleted.

          I did some tests using MiniDFSCluster. The result is as follows:

          • If a file is deleted, then BlockInfo is removed from blocksMap.
          • If a file is not deleted, then BlockInfo.bc is the file, which cannot be null.

          I'm wondering if it could happen that a block does not belong to any file, yet it does exist? Could you kindly explain this? Thanks!

          Show
          chengbing.liu Chengbing Liu added a comment - Hi Aaron T. Myers and Colin P. McCabe , from the stacktrace we know that the reportedState is RBW or RWR, and condition storedBlock.getGenerationStamp() != reported.getGenerationStamp() is satisfied. Since storedBlock is an entry in blocksMap , the file/block should not have been deleted. I did some tests using MiniDFSCluster. The result is as follows: If a file is deleted, then BlockInfo is removed from blocksMap . If a file is not deleted, then BlockInfo.bc is the file, which cannot be null. I'm wondering if it could happen that a block does not belong to any file, yet it does exist? Could you kindly explain this? Thanks!
          Hide
          vinayrpet Vinayakumar B added a comment -

          There is one possibility.
          Actual file delete and blocks removal will happen in separate writelocks in FSNameSystem.
          In the first lock, INode will be deleted and it will make blockInfo.bc to null. As below in INodeFile#destroyAndCollectBlocks()

                for (BlockInfo blk : blks) {
                  collectedBlocks.addDeleteBlock(blk);
                  blk.setBlockCollection(null);
                }

          Before actual removal of blocks from blocksMap in FSNameSystem#removeBlocks, if blockreport comes, then it will find blockInfo.bc as null.

          Show
          vinayrpet Vinayakumar B added a comment - There is one possibility. Actual file delete and blocks removal will happen in separate writelocks in FSNameSystem. In the first lock, INode will be deleted and it will make blockInfo.bc to null . As below in INodeFile#destroyAndCollectBlocks() for (BlockInfo blk : blks) { collectedBlocks.addDeleteBlock(blk); blk.setBlockCollection( null ); } Before actual removal of blocks from blocksMap in FSNameSystem#removeBlocks , if blockreport comes, then it will find blockInfo.bc as null.
          Hide
          chengbing.liu Chengbing Liu added a comment -

          Vinayakumar B Actually, whenever I start the problematic DataNode, NPE happens in every block report. That doesn't seem to be a transient problem as you have mentioned. Is it possible that the file is deleted without removal of the blocks?

          Show
          chengbing.liu Chengbing Liu added a comment - Vinayakumar B Actually, whenever I start the problematic DataNode, NPE happens in every block report. That doesn't seem to be a transient problem as you have mentioned. Is it possible that the file is deleted without removal of the blocks?
          Hide
          qwertymaniac Harsh J added a comment -

          Stale block copies leftover in the DN can cause the condition - it
          indeed goes away if you clear out the RBW directory in the DN.

          Imagine this condition:
          1. File is being written. Has replica on node X among others.
          2. Replica write to node X in pipeline fails. Write carries on,
          leaving stale block copy in RBW of node X.
          3. File gets closed and deleted away soon/immediately after (but well
          before a block report from X).
          4. Block report now sends the RBW info but NN has no knowledge of the
          block anymore.

          I think modifying Colin's test this way should reproduce the issue:

          1. start a mini dfs cluster with 2 datanodes
          2. create a file with repl=2, but do not close it (flush it to ensure
          on-disk RBW write)
          3. take down one DN
          4. close and delete the file
          5. wait
          6. bring back up the other DN, which will still have the RBW block
          from the file which was deleted


          Harsh J

          Show
          qwertymaniac Harsh J added a comment - Stale block copies leftover in the DN can cause the condition - it indeed goes away if you clear out the RBW directory in the DN. Imagine this condition: 1. File is being written. Has replica on node X among others. 2. Replica write to node X in pipeline fails. Write carries on, leaving stale block copy in RBW of node X. 3. File gets closed and deleted away soon/immediately after (but well before a block report from X). 4. Block report now sends the RBW info but NN has no knowledge of the block anymore. I think modifying Colin's test this way should reproduce the issue: 1. start a mini dfs cluster with 2 datanodes 2. create a file with repl=2, but do not close it (flush it to ensure on-disk RBW write) 3. take down one DN 4. close and delete the file 5. wait 6. bring back up the other DN, which will still have the RBW block from the file which was deleted – Harsh J
          Hide
          cmccabe Colin P. McCabe added a comment -

          Thanks for the explanation, guys. I wasn't aware of the invariant that BlockInfoContiguous structures with bc == null were not in the BlocksMap. I think we should remove this invariant, and instead simply have the BlocksMap contain all the blocks. The memory savings from keeping them out is trivial, since the number of blocks without associated inodes should be very small. I think we can just check whether the INode field is null when appropriate. That seems to be the direction that the patch here is taking, and I think it makes sense.

          Show
          cmccabe Colin P. McCabe added a comment - Thanks for the explanation, guys. I wasn't aware of the invariant that BlockInfoContiguous structures with bc == null were not in the BlocksMap . I think we should remove this invariant, and instead simply have the BlocksMap contain all the blocks. The memory savings from keeping them out is trivial, since the number of blocks without associated inodes should be very small. I think we can just check whether the INode field is null when appropriate. That seems to be the direction that the patch here is taking, and I think it makes sense.
          Hide
          atm Aaron T. Myers added a comment -

          That all makes sense to me as well.

          Chengbing Liu - would you be up for adding a unit test to this patch as Harsh and Colin have described?

          Show
          atm Aaron T. Myers added a comment - That all makes sense to me as well. Chengbing Liu - would you be up for adding a unit test to this patch as Harsh and Colin have described?
          Hide
          chengbing.liu Chengbing Liu added a comment -

          Hi Harsh J and Aaron T. Myers, this is one of the test sequences I did yesterday, but was still not able to reproduce the issue. The problem is that if you delete the file, the block will not be in blocksMap, then we won't be able to reproduce it.

          To reproduce this, we must make sure that the blockInfo is in blocksMap and blockInfo.bc == null. I tried several test sequences with no luck.

          I just tried cleaning the rbw directory and restarted the DataNode. However, the problem still exists. Maybe you have ideas about this?

          And Colin P. McCabe, are you suggesting the patch here is ok or we should additionally check nullity for each storedBlock.getBlockCollection()?

          Show
          chengbing.liu Chengbing Liu added a comment - Hi Harsh J and Aaron T. Myers , this is one of the test sequences I did yesterday, but was still not able to reproduce the issue. The problem is that if you delete the file, the block will not be in blocksMap , then we won't be able to reproduce it. To reproduce this, we must make sure that the blockInfo is in blocksMap and blockInfo.bc == null . I tried several test sequences with no luck. I just tried cleaning the rbw directory and restarted the DataNode. However, the problem still exists. Maybe you have ideas about this? And Colin P. McCabe , are you suggesting the patch here is ok or we should additionally check nullity for each storedBlock.getBlockCollection() ?
          Hide
          cmccabe Colin P. McCabe added a comment -

          There are already a bunch of places in the code where we check whether BlockCollection is null before doing something with it. Example:

              if (block instanceof BlockInfoContiguous) {
                BlockCollection bc = ((BlockInfoContiguous) block).getBlockCollection();
                String fileName = (bc == null) ? "[orphaned]" : bc.getName();
                out.print(fileName + ": ");
              }
          

          also:

            private int getReplication(Block block) {
              final BlockCollection bc = blocksMap.getBlockCollection(block);
              return bc == null? 0: bc.getBlockReplication();
            }
          

          I think that the majority of cases already have a check. My suggestion is just that we extend this checking against null to all uses of the BlockInfoContiguous structure's block collection.

          If the problem is too difficult to reproduce with a MiniDFSCluster, perhaps we can just do a unit test of the copy constructor itself.

          As I said earlier, I don't understand the rationale for keeping blocks with no associated INode out of the BlocksMap. It complicates the block report since it requires us to check whether each block has an associated inode or not before adding it to the BlocksMap. But if that change seems too ambitious for this JIRA, we can deal with that later.

          Show
          cmccabe Colin P. McCabe added a comment - There are already a bunch of places in the code where we check whether BlockCollection is null before doing something with it. Example: if (block instanceof BlockInfoContiguous) { BlockCollection bc = ((BlockInfoContiguous) block).getBlockCollection(); String fileName = (bc == null ) ? "[orphaned]" : bc.getName(); out.print(fileName + ": " ); } also: private int getReplication(Block block) { final BlockCollection bc = blocksMap.getBlockCollection(block); return bc == null ? 0: bc.getBlockReplication(); } I think that the majority of cases already have a check. My suggestion is just that we extend this checking against null to all uses of the BlockInfoContiguous structure's block collection. If the problem is too difficult to reproduce with a MiniDFSCluster , perhaps we can just do a unit test of the copy constructor itself. As I said earlier, I don't understand the rationale for keeping blocks with no associated INode out of the BlocksMap. It complicates the block report since it requires us to check whether each block has an associated inode or not before adding it to the BlocksMap. But if that change seems too ambitious for this JIRA, we can deal with that later.
          Hide
          vinayrpet Vinayakumar B added a comment -

          As I said earlier, I don't understand the rationale for keeping blocks with no associated INode out of the BlocksMap. It complicates the block report since it requires us to check whether each block has an associated inode or not before adding it to the BlocksMap. But if that change seems too ambitious for this JIRA, we can deal with that later.

          As I can see from the code ( trunk code), Its not kept for long time. In case of deletion, after dis-associating from file in main writeLock, blocks are kept in blockmap until aquiring different writelock in the same RPC, and it will be removed from blocksMap.
          This is just to avoid holding writelock for long time in case of deletion of big directory.
          But I dont see any case where its kept in blocksmap for long time without any file associated.

          Show
          vinayrpet Vinayakumar B added a comment - As I said earlier, I don't understand the rationale for keeping blocks with no associated INode out of the BlocksMap. It complicates the block report since it requires us to check whether each block has an associated inode or not before adding it to the BlocksMap. But if that change seems too ambitious for this JIRA, we can deal with that later. As I can see from the code ( trunk code), Its not kept for long time. In case of deletion, after dis-associating from file in main writeLock, blocks are kept in blockmap until aquiring different writelock in the same RPC, and it will be removed from blocksMap. This is just to avoid holding writelock for long time in case of deletion of big directory. But I dont see any case where its kept in blocksmap for long time without any file associated.
          Hide
          chengbing.liu Chengbing Liu added a comment -

          Added a unit test for the copy constructor.

          I suggest dealing with null-checks in another JIRA, since there might be some discussions on how to handle these "null" situations.

          Show
          chengbing.liu Chengbing Liu added a comment - Added a unit test for the copy constructor. I suggest dealing with null-checks in another JIRA, since there might be some discussions on how to handle these "null" situations.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Hi Chengbing Liu,
          If you are constantly getting this NPE, could you re-run again with BlockManager DEBUG log enabled?
          You can enable the debug log via NameNode WebUI without need of restarting NN. http://<NNip:port>/logLevel

          You might get the Debug log as below

              if(LOG.isDebugEnabled()) {
                LOG.debug("Reported block " + block
                    + " on " + dn + " size " + block.getNumBytes()
                    + " replicaState = " + reportedState);
              }

          After that you can anaylize for which block its throwing NPE, and based on that can analyze the state of the file and operations on that.

          Show
          vinayrpet Vinayakumar B added a comment - Hi Chengbing Liu , If you are constantly getting this NPE, could you re-run again with BlockManager DEBUG log enabled? You can enable the debug log via NameNode WebUI without need of restarting NN. http://<NNip:port>/logLevel You might get the Debug log as below if (LOG.isDebugEnabled()) { LOG.debug( "Reported block " + block + " on " + dn + " size " + block.getNumBytes() + " replicaState = " + reportedState); } After that you can anaylize for which block its throwing NPE, and based on that can analyze the state of the file and operations on that.
          Hide
          chengbing.liu Chengbing Liu added a comment -

          Thanks Vinayakumar B for your advice! I got the following debug logs.

          2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Reported block blk_1143745403_70011665 on 10.153.80.84:1004 size 2631763 replicaState = FINALIZED
          2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: In memory blockUCState = COMPLETE
          2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Reported block blk_1185006557_111278782 on 10.153.80.84:1004 size 19005434 replicaState = FINALIZED
          2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Reported block blk_1189413471_115690616 on 10.153.80.84:1004 size 99678737 replicaState = FINALIZED
          2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Reported block blk_1171261663_97530254 on 10.153.80.84:1004 size 13847 replicaState = FINALIZED
          2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Reported block blk_1149751102_76017688 on 10.153.80.84:1004 size 6702 replicaState = FINALIZED
          2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: In memory blockUCState = COMPLETE
          2015-04-17 15:38:54,801 WARN org.apache.hadoop.ipc.Server: IPC Server handler 109 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.blockReport from 10.153.80.84:38504 Call#4258262 R
          etry#0
          java.lang.NullPointerException

          The stacktrace is missing due to JVM default optimization... OmitStackTraceInFastThrow is the default option, and I didn't unset it. It will recompile a method if it has thrown some exception too many times. The stacktrace in the issue description is got from a DN a month ago.

          From the above logs, it is a FINALIZED block in a report that caused the NPE. So the stacktrace in the description is incorrect. Really sorry for that.

          Then I checked the last block blk_1149751102_76017688 with oiv against fsimage. The file is OK. I can download it through FS shell. I also checked all three DNs containing this block, and they all have the same file, genstamp and meta. It seems the active NameNode's holding incorrect information on this block.

          Show
          chengbing.liu Chengbing Liu added a comment - Thanks Vinayakumar B for your advice! I got the following debug logs. 2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Reported block blk_1143745403_70011665 on 10.153.80.84:1004 size 2631763 replicaState = FINALIZED 2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: In memory blockUCState = COMPLETE 2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Reported block blk_1185006557_111278782 on 10.153.80.84:1004 size 19005434 replicaState = FINALIZED 2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Reported block blk_1189413471_115690616 on 10.153.80.84:1004 size 99678737 replicaState = FINALIZED 2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Reported block blk_1171261663_97530254 on 10.153.80.84:1004 size 13847 replicaState = FINALIZED 2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Reported block blk_1149751102_76017688 on 10.153.80.84:1004 size 6702 replicaState = FINALIZED 2015-04-17 15:38:54,801 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: In memory blockUCState = COMPLETE 2015-04-17 15:38:54,801 WARN org.apache.hadoop.ipc.Server: IPC Server handler 109 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.blockReport from 10.153.80.84:38504 Call#4258262 R etry#0 java.lang.NullPointerException The stacktrace is missing due to JVM default optimization... OmitStackTraceInFastThrow is the default option, and I didn't unset it. It will recompile a method if it has thrown some exception too many times. The stacktrace in the issue description is got from a DN a month ago. From the above logs, it is a FINALIZED block in a report that caused the NPE. So the stacktrace in the description is incorrect. Really sorry for that. Then I checked the last block blk_1149751102_76017688 with oiv against fsimage. The file is OK. I can download it through FS shell. I also checked all three DNs containing this block, and they all have the same file, genstamp and meta. It seems the active NameNode's holding incorrect information on this block.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Then I checked the last block blk_1149751102_76017688 with oiv against fsimage. The file is OK. I can download it through FS shell. I also checked all three DNs containing this block, and they all have the same file, genstamp and meta. It seems the active NameNode's holding incorrect information on this block.

          Genstamp on all other nodes is 76017688? or different ?
          Because I can see only genstamp mismatch case could lead to specifed stacktrace in case of FINALIZED and COMPLETE block.

          Show
          vinayrpet Vinayakumar B added a comment - Then I checked the last block blk_1149751102_76017688 with oiv against fsimage. The file is OK. I can download it through FS shell. I also checked all three DNs containing this block, and they all have the same file, genstamp and meta. It seems the active NameNode's holding incorrect information on this block. Genstamp on all other nodes is 76017688? or different ? Because I can see only genstamp mismatch case could lead to specifed stacktrace in case of FINALIZED and COMPLETE block.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12726091/HDFS-8113.02.patch
          against trunk revision 76e7264.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10296//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10296//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726091/HDFS-8113.02.patch against trunk revision 76e7264. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10296//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10296//console This message is automatically generated.
          Hide
          chengbing.liu Chengbing Liu added a comment -

          Vinayakumar B Genstamp on all other nodes is 76017688, yes.
          The stacktrace I gave in the description was wrong, I believe. Current stacktrace is missing thanks to the JVM's feature.

          Show
          chengbing.liu Chengbing Liu added a comment - Vinayakumar B Genstamp on all other nodes is 76017688, yes. The stacktrace I gave in the description was wrong, I believe. Current stacktrace is missing thanks to the JVM's feature.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Thanks Chengbing Liu.
          According to code in branch-2.6, genstamp mismatch was the only chance. Now I am clueless

          Show
          vinayrpet Vinayakumar B added a comment - Thanks Chengbing Liu . According to code in branch-2.6, genstamp mismatch was the only chance. Now I am clueless
          Hide
          chengbing.liu Chengbing Liu added a comment -

          Yes, indeed. It is too hard to analyze the issue without the stacktrace.
          Maybe we can fix the copy constructor first and leave furthur investigation of the root cause later?

          Show
          chengbing.liu Chengbing Liu added a comment - Yes, indeed. It is too hard to analyze the issue without the stacktrace. Maybe we can fix the copy constructor first and leave furthur investigation of the root cause later?
          Hide
          cmccabe Colin P. McCabe added a comment -

          +1 for HDFS-8113.02.patch. I think it's a good robustness improvement to the code.

          It would be nice to continue the investigation about why you hit this issue in another jira, as Chengbing Liu suggested.

          Show
          cmccabe Colin P. McCabe added a comment - +1 for HDFS-8113 .02.patch. I think it's a good robustness improvement to the code. It would be nice to continue the investigation about why you hit this issue in another jira, as Chengbing Liu suggested.
          Hide
          chengbing.liu Chengbing Liu added a comment -

          Created HDFS-8330 for further tracking.

          Colin P. McCabe Would you mind committing this?

          Show
          chengbing.liu Chengbing Liu added a comment - Created HDFS-8330 for further tracking. Colin P. McCabe Would you mind committing this?
          Hide
          walter.k.su Walter Su added a comment -

          patch is good.

          Hi, Chengbing Liu. Have you tried restart NN? I think so. Fsimage saves Files, and block Ids belonging to some files.
          when fsimage is loaded, before first block report, and NN in safe mode. Each block should belong to some file. Because stored BlockInfo is created according to INodeFile proto. It's impossible NN has orphan blocks. should we try find null bc here?
          When first block reports finished, I think we should try looking for null bc again.

          Show
          walter.k.su Walter Su added a comment - patch is good. Hi, Chengbing Liu . Have you tried restart NN? I think so. Fsimage saves Files, and block Ids belonging to some files. when fsimage is loaded, before first block report, and NN in safe mode. Each block should belong to some file. Because stored BlockInfo is created according to INodeFile proto. It's impossible NN has orphan blocks. should we try find null bc here? When first block reports finished, I think we should try looking for null bc again.
          Hide
          chengbing.liu Chengbing Liu added a comment -

          Hi Walter Su, I haven't tried restart/failover NN yet.

          I have analyzed fsimage by oiv tool, and there is no orphan blocks. So the fsimage looks fine. The only possibility I can think of is that the active NN has problem with in-memory data structure. I will do a NN-failover shortly and see if the problem vanishes.

          Show
          chengbing.liu Chengbing Liu added a comment - Hi Walter Su , I haven't tried restart/failover NN yet. I have analyzed fsimage by oiv tool, and there is no orphan blocks. So the fsimage looks fine. The only possibility I can think of is that the active NN has problem with in-memory data structure. I will do a NN-failover shortly and see if the problem vanishes.
          Hide
          cmccabe Colin P. McCabe added a comment -

          Colin Patrick McCabe Would you mind committing this?

          Sure. Will commit now. It is a good robustness improvement.

          If we find more information about why the BlockInfoContinguous was added to the BlocksMap without a BlockCollection, we can file a separate JIRA for that. Thanks, guys.

          Show
          cmccabe Colin P. McCabe added a comment - Colin Patrick McCabe Would you mind committing this? Sure. Will commit now. It is a good robustness improvement. If we find more information about why the BlockInfoContinguous was added to the BlocksMap without a BlockCollection , we can file a separate JIRA for that. Thanks, guys.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #7779 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7779/)
          HDFS-8113. Add check for null BlockCollection pointers in BlockInfoContiguous structures (Chengbing Liu via Colin P. McCabe) (cmccabe: rev f523e963e4d88e4e459352387c6efeab59e7a809)

          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java
          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java
          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #7779 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7779/ ) HDFS-8113 . Add check for null BlockCollection pointers in BlockInfoContiguous structures (Chengbing Liu via Colin P. McCabe) (cmccabe: rev f523e963e4d88e4e459352387c6efeab59e7a809) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          chengbing.liu Chengbing Liu added a comment -

          Thanks Colin.

          Show
          chengbing.liu Chengbing Liu added a comment - Thanks Colin.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #191 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/191/)
          HDFS-8113. Add check for null BlockCollection pointers in BlockInfoContiguous structures (Chengbing Liu via Colin P. McCabe) (cmccabe: rev f523e963e4d88e4e459352387c6efeab59e7a809)

          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java
          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #191 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/191/ ) HDFS-8113 . Add check for null BlockCollection pointers in BlockInfoContiguous structures (Chengbing Liu via Colin P. McCabe) (cmccabe: rev f523e963e4d88e4e459352387c6efeab59e7a809) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #922 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/922/)
          HDFS-8113. Add check for null BlockCollection pointers in BlockInfoContiguous structures (Chengbing Liu via Colin P. McCabe) (cmccabe: rev f523e963e4d88e4e459352387c6efeab59e7a809)

          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java
          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #922 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/922/ ) HDFS-8113 . Add check for null BlockCollection pointers in BlockInfoContiguous structures (Chengbing Liu via Colin P. McCabe) (cmccabe: rev f523e963e4d88e4e459352387c6efeab59e7a809) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2120 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2120/)
          HDFS-8113. Add check for null BlockCollection pointers in BlockInfoContiguous structures (Chengbing Liu via Colin P. McCabe) (cmccabe: rev f523e963e4d88e4e459352387c6efeab59e7a809)

          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java
          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2120 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2120/ ) HDFS-8113 . Add check for null BlockCollection pointers in BlockInfoContiguous structures (Chengbing Liu via Colin P. McCabe) (cmccabe: rev f523e963e4d88e4e459352387c6efeab59e7a809) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #180 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/180/)
          HDFS-8113. Add check for null BlockCollection pointers in BlockInfoContiguous structures (Chengbing Liu via Colin P. McCabe) (cmccabe: rev f523e963e4d88e4e459352387c6efeab59e7a809)

          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java
          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #180 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/180/ ) HDFS-8113 . Add check for null BlockCollection pointers in BlockInfoContiguous structures (Chengbing Liu via Colin P. McCabe) (cmccabe: rev f523e963e4d88e4e459352387c6efeab59e7a809) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #190 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/190/)
          HDFS-8113. Add check for null BlockCollection pointers in BlockInfoContiguous structures (Chengbing Liu via Colin P. McCabe) (cmccabe: rev f523e963e4d88e4e459352387c6efeab59e7a809)

          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java
          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java
          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #190 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/190/ ) HDFS-8113 . Add check for null BlockCollection pointers in BlockInfoContiguous structures (Chengbing Liu via Colin P. McCabe) (cmccabe: rev f523e963e4d88e4e459352387c6efeab59e7a809) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2138 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2138/)
          HDFS-8113. Add check for null BlockCollection pointers in BlockInfoContiguous structures (Chengbing Liu via Colin P. McCabe) (cmccabe: rev f523e963e4d88e4e459352387c6efeab59e7a809)

          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java
          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2138 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2138/ ) HDFS-8113 . Add check for null BlockCollection pointers in BlockInfoContiguous structures (Chengbing Liu via Colin P. McCabe) (cmccabe: rev f523e963e4d88e4e459352387c6efeab59e7a809) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockInfo.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoContiguous.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          chengbing.liu Chengbing Liu added a comment -

          Just an update: I have done a NN-failover and the NPE never appears again. So I think it's an issue with the active NN's in-memory data structure. The fsimage is OK.

          Show
          chengbing.liu Chengbing Liu added a comment - Just an update: I have done a NN-failover and the NPE never appears again. So I think it's an issue with the active NN's in-memory data structure. The fsimage is OK.

            People

            • Assignee:
              chengbing.liu Chengbing Liu
              Reporter:
              chengbing.liu Chengbing Liu
            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development