Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.4.0
-
None
Description
Consider the following scenario:
(1) The hdfs cluster is big enough.
Namenode Config:
DFS_NAMENODE_REPLICATION_WORK_MULTIPLIER_PER_ITERATION ->200
DFS_NAMENODE_REPLICATION_MAX_STREAMS_KEY->100
DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_KEY->200
Datanode :
It has large disk on one datanode. One datanode has millions or even tens of millions blocks. Or maybe more in archive cluster.
(2) The cluster administrator may do one of the following actions:
2.1 Increase replication on one large dir ,eg 2PB
2.2 Decommissioning some nodes from cluster
(3) When the cluster datanode is on heavy read and write.
Thus, datanode receive transfer commands form heartbeat whit namenode. When datanode is on heavily load (write and read), the SumOfActorCommandQueueLength metris will increase to 1k+ because of slow consumption with lock competition. See details code : BPServiceActor.CommandProcessingThread
And Namenode distribute all replicationWork to datanode through heartbeat. All LowerReconstruction blocks will in neededReconstruction. Every heartbeat will distribute liveDatnaodes*DFS_NAMENODE_REPLICATION_WORK_MULTIPLIER_PER_ITERATION to the related datanodes. But as mentioned above,the datanode blocked executing the transfer block commands. When 5min later by default , the pendingReconstruction will into timeout and then re-entry neededReconstruction, and then re-entry pendingReconstruction . So the datanodes will receive multiple transfer command with same block. That will lead to excess blocks.
Continuous cycle,it will lead to serious consequences. Datanode has more tranfers commands in queue and waste more disk space and result in more excess blocks.
So, I think we should limit the max size of pendingReconstruction.
Attachments
Issue Links
- links to