Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-7759 Improve Ozone Replication Manager
  3. HDDS-8459

Ratis under replication handling in a rack aware environment doesn't work

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.4.0
    • SCM

    Description

      This is the rack aware environment defined in dist/target/ozone-1.4.0-SNAPSHOT/compose/ozone-topology. I additionally added the following configurations to enable the new ReplicationManager and ContainerScanner. The ContainerBalancer configurations shouldn't be relevant here.

      OZONE-SITE.XML_hdds.scm.replication.enable.legacy=false
      OZONE-SITE.XML_hdds.container.balancer.balancing.iteration.interval=5m
      OZONE-SITE.XML_hdds.container.balancer.move.timeout=295s
      OZONE-SITE.XML_hdds.container.balancer.move.replication.timeout=200s
      OZONE-SITE.XML_hdds.scm.replication.thread.interval=100s
      OZONE-SITE.XML_hdds.container.scrub.enabled=true
      OZONE-SITE.XML_hdds.container.scrub.metadata.scan.interval=20s
      OZONE-SITE.XML_hdds.container.scrub.data.scan.interval=20s
      

      When I manually change the checksum of a container replica in a DN, the container scanner detects this and marks it UNHEALTHY. But RM is not able to handle this under replicated container.
      EDIT: The stack trace looks slightly different on the latest apache master and is more helpful:

      scm_1         | 2023-04-19 12:00:09,485 [Under Replicated Processor] ERROR replication.UnhealthyReplicationProcessor: Error processing Health result of class: class org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult for container ContainerInfo{id=#2, state=CLOSED, pipelineID=PipelineID=c273b63f-0d6d-4701-b333-c8bcf3e85ba6, stateEnterTime=2023-04-19T11:55:13.697Z, owner=om1}
      scm_1         | org.apache.hadoop.hdds.scm.exceptions.SCMException: Placement Policy: class org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware did not return any nodes. Number of required Nodes 0, Datasize Required: 998244352
      scm_1         | 	at org.apache.hadoop.hdds.scm.container.replication.ReplicationManagerUtil.getTargetDatanodes(ReplicationManagerUtil.java:87)
      scm_1         | 	at org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.getTargets(RatisUnderReplicationHandler.java:243)
      scm_1         | 	at org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.processAndSendCommands(RatisUnderReplicationHandler.java:111)
      scm_1         | 	at org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:819)
      scm_1         | 	at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:53)
      scm_1         | 	at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:27)
      scm_1         | 	at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:127)
      scm_1         | 	at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:93)
      scm_1         | 	at org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:136)
      scm_1         | 	at java.base/java.lang.Thread.run(Thread.java:829)
      scm_1         | 2023-04-19 12:00:09,485 [Under Replicated Processor] INFO replication.UnhealthyReplicationProcessor: Processed 0 containers with health state counts {}, failed processing 1
      

      Attachments

        Issue Links

          Activity

            People

              siddhant Siddhant Sangwan
              siddhant Siddhant Sangwan
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: