Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-7759 Improve Ozone Replication Manager
  3. HDDS-8535

ReplicationManager: Unhealthy containers could block EC recovery in small clusters

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.4.0
    • SCM

    Description

      With EC containers, if there is a small cluster of say 6 nodes with EC-3-2, a container will require 5 nodes. If 2 containers become unhealthy, reconstruction will be required to recover the 2 containers, but there is only 1 spare node.

      This means one will get recovered, and we will have 4 "good" containers and 2 "unhealthy" and the container will remain stuck like this because unhealthy containers are only removed once the container is has no over or under replication.

      A similar problem was resolved previously where an EC container with both over and under replication can meet the same problem, where under replication cannot proceed due to insufficient spare nodes. In that case, the solution was to check for this case, and call the over-replication handler to clear up the excess replicas. A similar solution is required here to remove some unhealthy nodes to allow progress to be made.

      Attachments

        Issue Links

          Activity

            People

              siddhant Siddhant Sangwan
              sodonnell Stephen O'Donnell
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: