Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.4.0
Description
UnhealthyReplicationProcessor#processAll requeues any failed task. Such tasks are attempted in the same processAll call. This can flood SCM logs until the cause of the error is resolved.
Example steps:
- Start cluster with 5 datanodes
- Create EC(3,2) key
- Stop two datanodes
- Wait until SCM starts emitting error for the same container
scm_1 | 2023-02-17 18:08:51,091 [Under Replicated Processor] WARN replication.ECUnderReplicationHandler: Exception while processing for creating the EC reconstruction container commands for #5. scm_1 | org.apache.hadoop.hdds.scm.exceptions.SCMException: No enough datanodes to choose. TotalNode = 3 AvailableNode = 0 RequiredNode = 2 ExcludedNode = 3 scm_1 | at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter.chooseDatanodesInternal(SCMContainerPlacementRackScatter.java:238) scm_1 | at org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:185) scm_1 | at org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:127) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.getTargetDatanodes(ECUnderReplicationHandler.java:266) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processMissingIndexes(ECUnderReplicationHandler.java:295) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processAndCreateCommands(ECUnderReplicationHandler.java:174) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:608) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:58) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:32) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:119) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:93) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:132) scm_1 | at java.base/java.lang.Thread.run(Thread.java:829) scm_1 | 2023-02-17 18:08:51,091 [Under Replicated Processor] ERROR replication.UnhealthyReplicationProcessor: Error processing Health result of class: class org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult for container ContainerInfo{id=#5, state=CLOSED, pipelineID=PipelineID=0ccdaf17-dc73-4974-a660-c2bb51a3402e, stateEnterTime=2023-02-17T17:59:05.707Z, owner=om1} scm_1 | org.apache.hadoop.hdds.scm.exceptions.SCMException: No enough datanodes to choose. TotalNode = 3 AvailableNode = 0 RequiredNode = 2 ExcludedNode = 3 scm_1 | at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter.chooseDatanodesInternal(SCMContainerPlacementRackScatter.java:238) scm_1 | at org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:185) scm_1 | at org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:127) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.getTargetDatanodes(ECUnderReplicationHandler.java:266) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processMissingIndexes(ECUnderReplicationHandler.java:295) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processAndCreateCommands(ECUnderReplicationHandler.java:174) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:608) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:58) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:32) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:119) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:93) scm_1 | at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:132) scm_1 | at java.base/java.lang.Thread.run(Thread.java:829) ... No space left on device
The same messages are repeated without any delay.
I think tasks should be collected and requeued outside of the processing loop.
Attachments
Issue Links
- links to