Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-6462 Phase II : Erasure Coding Offline Recovery & Read/Write Improvements
  3. HDDS-7192

EC: ReplicationManager - create handlers to perform various container checks



    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.3.0
    • None


      One of the goals I had in mind when developing the new Replication Manager was to make the code cleaner, easier to test and easier to follow. The Legacy Replication Manager has a lot of logic in a single class, to handle all sorts of container states and conditions and that makes it hard to test and difficult to see what is even tested.

      If we define “business logic” as the rules and conditions which indicate if a container is healthy or not, and also the rules and logic to fix any conditions the container may have, such as bad replica states, over / under replication etc.

      Then the Replication Manager “infrastructure” or “plumbing” is what stitches the business logic together, provides access to queued containers, iterates over the containers etc. This logic is relatively simple in general.

      We have also started using the ReplicationManager class as a kind of proxy object, to

      My goal is to try to separate all the business logic into separate logical classes that can each be tested in isolation. Then the Replication Manager class itself can focus on iterating over the container containers and applying the business logic, dispatching commands and queuing containers for remediation (under / over replication processing).

      Already I can see some areas where we are starting to copy the legacy replication manager and putting some business logic items into the Replication Manager processing flow, so I would like to stop and have a discussion on the best way forward.

      Looking at the logic we have so far, the replication manager checks look like:

      • Open Container Healthy check - If the container is open and should be closed for some reason. Currently in ReplicationManager class
      • Unhealthy Replica Handing - If the container is closed and some replicas are not in the correct closed state. Currently in Replication Manager class.
      • Empty Container Handling - If the container is empty, remove its replicas. Not yet implemented, but there is a PR adding this into the RM class - https://github.com/apache/ozone/pull/3660
      • ECHealthCheck - under / over / mis replication (ECContainerHealthCheck)
      • RatisHealthCheck - under / over/ mis replication (not yet implemented, but will be in RatisContainerHealthCheck)
      • The legacy Manager also has logic for CLOSING and QUASI_CLOSED containers we need to get in somewhere.

      What I would like to propose is that we should make each of these above checks into their own class. Some of the classes will be very small, and others will be medium sized.

      We can then use a variation of the "Chain of Responsibility" pattern to bring all the checks together into a chain.




            sodonnell Stephen O'Donnell
            sodonnell Stephen O'Donnell
            0 Vote for this issue
            1 Start watching this issue