Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-7759 Improve Ozone Replication Manager
  3. HDDS-8494

Adjust replication queue limits for decommissioning nodes

    XMLWordPrintableJSON

Details

    Description

      When a node hosting a Ratis container is decommissioned, there are generally 3 sources available for the container replicas. One on the decommissioning host, and then 2 others on somewhat random nodes across the cluster. This allows the decommissioning load and hence speed of decommission to be shared across many more nodes.

      For an EC container, the decommissioning host is likely the only source of the replica which needs to be copied and hence the decommission will be slower.

      A host which is decommissioning is generally not used for Ratis reads unless there are no other nodes available, but it would still be used for EC reads to avoid online reconstruction. As decommission progresses on the node, and new copies are formed, the read load will decline over time. Furthermore, decommissioning nodes are not used for writes, so they should be under less load than other cluster nodes.

      Due to the reduced load on a decommissioning host, it is possible to increase the number of commands queued on a decommissioning host and also increase the size of the executor thread pool to process the commands.

      When a datanode switches to a decommissioning state, it will adjust the size of the replication supervisor thread pool higher, and if the node returns to the In Service state, it will return to the lower thread pool limit.

      Similarly when scheduling commands, SCM can allocate more commands to the decommissioning host, as it should process them more quickly due to the lower load and increased threadpool.

      Attachments

        Issue Links

          Activity

            People

              adoroszlai Attila Doroszlai
              sodonnell Stephen O'Donnell
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: