Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-8699 Further Replication Manager Improvements
  3. HDDS-8660

ReplicationManager: Notify when dead nodes or nodes go out of service

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • SCM
    • None

    Description

      If someone triggers decommission / maintenance, there is potentially a 5 minute lag from the decommission process starting and RM noticing that containers need replication, due to RM running on a 5 minute interval. Similarly, if a node goes dead, it has already been gone for 10 minutes, and it will take up to another 5 minutes for RM to notice and process the containers.

      It would be good to notify the RM thread to wake it up when these events happen to reduce the time it takes to start to repair the problem.

      One thing that comes to mind about for any solution, is that RM operates by:

      1. Getting a list of all containers.
      2. Processing the list
      3. Sleeping for 5 minutes.

      If a dead node happens at during step 2, and we notify the thread, it will already be running so the notify will not do anything. It may be that some of the containers from the node in question have been processed already, or they may still to be processed - we don't really know. Perhaps this is OK, rather than complicating the solution, as in general fixing decommission or under-replication will take a long time.

      It is also possible that several nodes go dead in quick succession, or several nodes go out of service quickly, resulting in several notify calls occurring. We don't want to wake up the thread too frequently if this happens, as it will result in a new replication queue getting created over and over. Perhaps if the queue is not empty, then there is replication work to do, and we should not run again.

      Finally, we might want to consider notifying on a node coming back into service, as that could cause over-replication. However over-replication is not as big of a problem as under-replication if it is not addressed quickly.

      Attachments

        Activity

          People

            Unassigned Unassigned
            sodonnell Stephen O'Donnell
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: