Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
3.4.0, 3.3.5
-
None
Description
Recently, we change the replication factor of a directory which has 1.6PB from 2 to 3. There are 76 million blocks in this directory. After we execute the setrep cmd, the active namenode prints lots of logs like below:
PendingReconstructionMonitor timed out blk_xxxx_260285131
and many datanodes prints lots of duplicated logs like below:
2023-04-21 13:58:17,627 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(1.1.1.1:50010, datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, infoSecurePort=0, ipcPort=8010, storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010 2023-04-21 14:21:21,296 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer, at 1.1.1.1:50010: Transmitted BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 (numBytes=524384907) to /2.2.2.2:50010 14:34:19,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(1.1.1.1:50010, datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, infoSecurePort=0, ipcPort=8010, storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010 2023-04-21 14:37:58,207 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(1.1.1.1:50010, datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, infoSecurePort=0, ipcPort=8010, storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010 14:40:46,817 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(1.1.1.1:50010, datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, infoSecurePort=0, ipcPort=8010, storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010
It is funny that an same block blk_1333463885_260285131 was transfer multiple times even though it has been transmitted successfully. the excess transfer request will trigger lots of ReplicaAlreadyExistsException in target datanode, because the replica has already been transmitted and the state of the replica is FINALIZED.
The reason is in RedundancyMonitor#processPendingReconstructions and BlcokManager#validateReconstructionWork.
1、RedundancyMonitor#computeDatanodeWork() generates transfer tasks using neededReconstruction and addTaskToDatanode, and put the tasks into pendingReconstruction.
2、We set dfs.namenode.replication.work.multiplier.per.iteration = 200, the specific cluster has 400 datanodes. So, RedundancyMonitor may generate 80K block transfer tasks per iteration at most. After dfs.namenode.reconstruction.pending.timeout-sec (5min), the requests in pendingReconstruction will time-out, and PendingReconstructionMonitor thread will put the time-out requests into timedOutItems.
3、RedundancyMonitor#processPendingReconstructions() will put the requests in timedOutItems into neededReconstruction again.
4、TimeUnit.MILLISECONDS.sleep(redundancyRecheckIntervalMs);
5、next iteration of while loop,RedundancyMonitor#computeDatanodeWork() will generates transfer tasks using neededReconstruction again, here will generate repeated task or different target node task(due to the chooseTarget method).
Attachments
Issue Links
- links to