Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-7759 Improve Ozone Replication Manager
  3. HDDS-7989

UnhealthyReplicationProcessor retries failure without delay

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.0
    • 1.4.0
    • SCM

    Description

      UnhealthyReplicationProcessor#processAll requeues any failed task. Such tasks are attempted in the same processAll call. This can flood SCM logs until the cause of the error is resolved.

      Example steps:

      1. Start cluster with 5 datanodes
      2. Create EC(3,2) key
      3. Stop two datanodes
      4. Wait until SCM starts emitting error for the same container
      scm_1       | 2023-02-17 18:08:51,091 [Under Replicated Processor] WARN replication.ECUnderReplicationHandler: Exception while processing for creating the EC reconstruction container commands for #5.
      scm_1       | org.apache.hadoop.hdds.scm.exceptions.SCMException: No enough datanodes to choose. TotalNode = 3 AvailableNode = 0 RequiredNode = 2 ExcludedNode = 3
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter.chooseDatanodesInternal(SCMContainerPlacementRackScatter.java:238)
      scm_1       | 	at org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:185)
      scm_1       | 	at org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:127)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.getTargetDatanodes(ECUnderReplicationHandler.java:266)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processMissingIndexes(ECUnderReplicationHandler.java:295)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processAndCreateCommands(ECUnderReplicationHandler.java:174)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:608)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:58)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:32)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:119)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:93)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:132)
      scm_1       | 	at java.base/java.lang.Thread.run(Thread.java:829)
      scm_1       | 2023-02-17 18:08:51,091 [Under Replicated Processor] ERROR replication.UnhealthyReplicationProcessor: Error processing Health result of class: class org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult for container ContainerInfo{id=#5, state=CLOSED, pipelineID=PipelineID=0ccdaf17-dc73-4974-a660-c2bb51a3402e, stateEnterTime=2023-02-17T17:59:05.707Z, owner=om1}
      scm_1       | org.apache.hadoop.hdds.scm.exceptions.SCMException: No enough datanodes to choose. TotalNode = 3 AvailableNode = 0 RequiredNode = 2 ExcludedNode = 3
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter.chooseDatanodesInternal(SCMContainerPlacementRackScatter.java:238)
      scm_1       | 	at org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:185)
      scm_1       | 	at org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:127)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.getTargetDatanodes(ECUnderReplicationHandler.java:266)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processMissingIndexes(ECUnderReplicationHandler.java:295)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processAndCreateCommands(ECUnderReplicationHandler.java:174)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:608)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:58)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.getDatanodeCommands(UnderReplicatedProcessor.java:32)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:119)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:93)
      scm_1       | 	at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:132)
      scm_1       | 	at java.base/java.lang.Thread.run(Thread.java:829)
      
      ...
      
      No space left on device
      

      The same messages are repeated without any delay.

      I think tasks should be collected and requeued outside of the processing loop.

      Attachments

        Issue Links

          Activity

            People

              adoroszlai Attila Doroszlai
              adoroszlai Attila Doroszlai
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: