Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-7759 Improve Ozone Replication Manager
  3. HDDS-8535

ReplicationManager: Unhealthy containers could block EC recovery in small clusters

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.4.0
    • SCM

    Description

      With EC containers, if there is a small cluster of say 6 nodes with EC-3-2, a container will require 5 nodes. If 2 containers become unhealthy, reconstruction will be required to recover the 2 containers, but there is only 1 spare node.

      This means one will get recovered, and we will have 4 "good" containers and 2 "unhealthy" and the container will remain stuck like this because unhealthy containers are only removed once the container is has no over or under replication.

      A similar problem was resolved previously where an EC container with both over and under replication can meet the same problem, where under replication cannot proceed due to insufficient spare nodes. In that case, the solution was to check for this case, and call the over-replication handler to clear up the excess replicas. A similar solution is required here to remove some unhealthy nodes to allow progress to be made.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            siddhant Siddhant Sangwan
            sodonnell Stephen O'Donnell
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment