[HDFS-14429] Block remain in COMMITTED but not COMPLETE caused by Decommission - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.9.2
Fix Version/s: 2.10.0, 3.3.0, 3.2.1, 2.9.3, 3.1.3
Component/s: None
Labels:
None

Target Version/s:

2.10.0, 3.3.0, 2.9.3

Description

In the following scenario, the Block will remain in the COMMITTED but not COMPLETE state and cannot be closed properly:

Client writes Block(bk1) to three data nodes (dn1/dn2/dn3).
bk1 has been completely written to three data nodes, and the data node succeeds FinalizeBlock, joins IBR and waits to report to NameNode.
The client commits bk1 after receiving the ACK.
When the DN has not been reported to the IBR, all three nodes dn1/dn2/dn3 enter Decommissioning.
The DN reports the IBR, but the block cannot be completed normally.

Then it will lead to the following related exceptions:

Exception

2019-04-02 13:40:31,882 INFO namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(2790)) - BLOCK* blk_4313483521_3245321090 is COMMITTED but not COMPLETE(numNodes= 3 >= minimum = 1) in file xxx
2019-04-02 13:40:31,882 INFO ipc.Server (Server.java:logException(2650)) - IPC Server handler 499 on 8020, call Call#122552 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from xxx:47615
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet: xxx
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2579)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)

And will cause the scenario described in HDFS-12747

The root cause is that addStoredBlock does not consider the case where the replications are in Decommission.
This problem needs to be fixed like ~~HDFS-11499~~.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-14429.03.patch
24/Jun/19 10:11
7 kB
Yicong Cai
HDFS-14429.branch-2.02.patch
24/Jun/19 10:10
7 kB
Yicong Cai
HDFS-14429.branch-2.01.patch
23/Jun/19 15:33
6 kB
Yicong Cai
HDFS-14429.02.patch
23/Jun/19 13:25
7 kB
Yicong Cai
HDFS-14429.01.patch
07/May/19 07:53
1 kB
Yicong Cai

Issue Links

causes

HDFS-12747 Lease monitor may infinitely loop on the same lease

Open

relates to

HDFS-11499 Decommissioning stuck because of failing recovery

Resolved

Activity

People

Assignee:: Yicong Cai

Reporter:: Yicong Cai

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 15/Apr/19 11:45

Updated:: 24/Mar/20 07:31

Resolved:: 29/Jul/19 21:35