Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.1.0, 2.10.0, 2.9.1, 3.2.0, 3.0.3
-
None
-
None
Description
While investigating a data corruption bug caused by concurrent recoverLease() and close(), I found HDFS-12886 may cause close() to throw AssertionError under a corner case, because the block has zero live replica, and client calls recoverLease() immediately followed by close().
org.apache.hadoop.ipc.RemoteException(java.lang.AssertionError): Negative replicas! at org.apache.hadoop.hdfs.server.blockmanagement.LowRedundancyBlocks.getPriority(LowRedundancyBlocks.java:197) at org.apache.hadoop.hdfs.server.blockmanagement.LowRedundancyBlocks.update(LowRedundancyBlocks.java:422) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.updateNeededReconstructions(BlockManager.java:4274) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.commitOrCompleteLastBlock(BlockManager.java:1001) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitOrCompleteLastBlock(FSNamesystem.java:3471) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:713) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:671) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2854) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:928) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:607) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1689) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
I have a test case to reproduce it.
lukmajercak elgoiri would you please take a look at it? I think we should add a check to reject completeFile() if the block is under recovery, similar to what's proposed in HDFS-10240.
Attachments
Attachments
Issue Links
- is caused by
-
HDFS-12886 Ignore minReplication for block recovery
- Resolved