Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-7759 Improve Ozone Replication Manager
  3. HDDS-8335

ReplicationManager: EC Mis and Under replication handler should handle overloaded exceptions

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Implemented
    • None
    • 1.4.0
    • SCM

    Description

      In RatisOverReplicationHandler and ECOverReplicationHandler, a container can be over replicated by several replicas, and the deletes are done in two stages:

      1. First unhealthy replicas are removed.
      2. Then healthy are removed.

      While removing any replica, the handler could get a CommandTargetOverloadedException, but rather than throwing that exception immediately, it continues trying other replicas. At the end, if it has not deleted enough replicas, it re-throws the first CommandTargetOverloadedException so the over replication is re-queued on the over replication queue.

      Other handlers also have multiple stages, but in the event of an error like CommandTargetOverloadedException, they give up immediately.

      RatisOverReplicationHandler works as expected. So does ECOverReplicationHandler.

      For RatisUnderReplicationHandler, as the command target is the source, and the RM.sentThrottleReplicationCommand() handles picking the lowest loaded source - it is possible to send one command, and then fail to send the second, but there is no point in retrying as it means all the sources are overloaded. As things stand, it will send what it can and then throw an exception, so that is fine.

      For MisReplicationHandler, which is currently shared with EC and Ratis (HDDS-8109 may change this), I believe it could run into this problem with EC, where it may need to make a new copy of 2 EC indexes, and 1 of the nodes is overloaded and the other is not. It would be better to not fail completely if the first is overloaded.

      For Ratis Mis Replication, as we can copy any replica after HDDS-8109 it should behave like the RatisUnderReplicationHandler after HDDS-8109.

      For ECUnderReplicationHandler, there are multiple stages for processing and potential for partial success.

      We should review both ECUnderReplicationHandler and EC MisReplication handling (after HDDS-8109) to handle overloaded exceptions and throw exceptions on partial success.

      Attachments

        Issue Links

          Activity

            People

              sodonnell Stephen O'Donnell
              sodonnell Stephen O'Donnell
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: