[HDFS-13758] DatanodeManager should throw exception if it has BlockRecoveryCommand but the block is not under construction - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0-alpha1
Fix Version/s: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2
Component/s: namenode
Labels:
None

Description

In Hadoop 3, ~~HDFS-8909~~ added an assertion assumption that if a BlockRecoveryCommand exists for a block, the block is under construction.

DatanodeManager#getBlockRecoveryCommand()


  BlockRecoveryCommand brCommand = new BlockRecoveryCommand(blocks.length);
  for (BlockInfo b : blocks) {
    BlockUnderConstructionFeature uc = b.getUnderConstructionFeature();
    assert uc != null;
...

This assertion accidentally fixed one of the possible scenario of ~~HDFS-10240~~ data corruption, if a recoverLease() is made immediately followed by a close(), before DataNodes have the chance to heartbeat.

In a unit test you'll get:

2018-07-19 09:43:41,331 [IPC Server handler 9 on 57890] WARN  ipc.Server (Server.java:logException(2724)) - IPC Server handler 9 on 57890, call Call#41 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 127.0.0.1:57903
java.lang.AssertionError
	at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.getBlockRecoveryCommand(DatanodeManager.java:1551)
	at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleHeartbeat(DatanodeManager.java:1661)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleHeartbeat(FSNamesystem.java:3865)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendHeartbeat(NameNodeRpcServer.java:1504)
	at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.sendHeartbeat(DatanodeProtocolServerSideTranslatorPB.java:119)
	at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:31660)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1689)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)

I propose to change this assertion even though it address the data corruption, because:

We should throw an more meaningful exception than an NPE
on a production cluster, the assert is ignored, and you'll get a more noticeable NPE. Future HDFS developers might fix this NPE, causing regression. An NPE is typically not captured and handled, so there's a chance to result in internal state inconsistency.
It doesn't address all possible scenarios of ~~HDFS-10240~~. A proper fix should reject close() if the block is being recovered.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-10240 scenarios.jpg
20/Jul/18 21:42
48 kB
Wei-Chiu Chuang
HDFS-13758.001.patch
03/Aug/18 07:49
1 kB
chencan
HDFS-13758.branch-2.patch
09/Aug/18 08:14
1 kB
chencan

Activity

People

Assignee:: chencan

Reporter:: Wei-Chiu Chuang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/Jul/18 21:41

Updated:: 02/Oct/19 17:15

Resolved:: 14/Aug/18 19:00