Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-7759

Improve Ozone Replication Manager

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Implemented
    • None
    • 1.4.0
    • None
    • None

    Description

      Parent Jira to capture tasks related to migrating from Ozone's legacy replication manager to the new refactored modular replication manager.

      Attachments

        1. Replication Manager V2.pdf
          149 kB
          Siddhant Sangwan

        Issue Links

          1.
          Refine SCM handling of unhealthy container replicas Sub-task Resolved Ethan Rose
          2.
          EC: ReplicationManager - add move manager for container move Sub-task Resolved Stephen O'Donnell
          3.
          Improve Handling of Unhealthy Container Replicas in the new RM Sub-task Resolved Siddhant Sangwan
          4.
          Ratis OverReplicationHandler should exclude stale replicas Sub-task Resolved Stephen O'Donnell
          5.
          UNHEALTHY replicas will not contribute to sufficient replication in RatisContainerReplicaCount Sub-task Resolved Siddhant Sangwan
          6.
          Intermittent failure in TestReplicationManager#testUnderReplicationQueuePopulated Sub-task Resolved Attila Doroszlai
          7.
          Handle Mismatched Replicas (OPEN or CLOSING) of QUASI-CLOSED containers in RM Sub-task Resolved Siddhant Sangwan
          8.
          Modify Ratis Replication Handling in the new RM Sub-task Resolved Siddhant Sangwan
          9.
          Handle Replication of Unhealthy Replicas in RM Sub-task Resolved Siddhant Sangwan
          10.
          Force close QUASI_CLOSED replicas only when the container is CLOSED in Legacy RM Sub-task Resolved Siddhant Sangwan
          11.
          Add configuration flag to enable LegacyReplicationManager for RATIS containers Sub-task Resolved Siddhant Sangwan
          12.
          Force Close QUASI_CLOSED replicas of CLOSED containers in RM Sub-task Resolved Siddhant Sangwan
          13.
          UnhealthyReplicationProcessor retries failure without delay Sub-task Resolved Attila Doroszlai
          14.
          Synchronize on containerInfo in ReplicationManager and MoveManager Sub-task Resolved Stephen O'Donnell
          15.
          Let RatisMisReplicationHandler use the new RatisContainerReplicaCount constructor Sub-task Resolved Siddhant Sangwan
          16.
          Move pendingOps into ContainerStateManagerImpl to ensure consistent state Sub-task Resolved Stephen O'Donnell
          17.
          ReplicationManager: Count a container once for missing, under, mis or over replicated Sub-task Resolved Stephen O'Donnell
          18.
          Check container replication health before scheduling move in MoveManager Sub-task Resolved Siddhant Sangwan
          19.
          Replace Usages of LegacyReplicationManager.MoveResult with MoveManager.MoveResult Sub-task Resolved Siddhant Sangwan
          20.
          Improve synchronization around command queue updates in Node Manager Sub-task Resolved Stephen O'Donnell
          21.
          ECReconstructionCoordinatorTask.runTask should catch Exception Sub-task Resolved Stephen O'Donnell
          22.
          ReplicationManager: Introduce basic limits on ReplicateContainer commands Sub-task Resolved Stephen O'Donnell
          23.
          Clean up replication logs Sub-task Resolved Attila Doroszlai
          24.
          Integrate ContainerBalancer with MoveManager Sub-task Resolved Siddhant Sangwan
          25.
          Replication Manager: Make all handlers send commands immediately instead of returning commands Sub-task Resolved Stephen O'Donnell
          26.
          Inject MoveManager into ContainerBalancer Sub-task Resolved Siddhant Sangwan
          27.
          Replicate commands can be sent to dead maintenance modes if the same index is being decommissioned Sub-task Resolved Stephen O'Donnell
          28.
          ReplicationManager: Add RatisMisReplicationHandler into rm.processUnderReplicatedContainer Sub-task Resolved Stephen O'Donnell
          29.
          ReplicationManager: Datanode commands should be sent to nodeManager directly Sub-task Resolved Stephen O'Donnell
          30.
          Make deadlines inside MoveManager for move commands configurable Sub-task Resolved Siddhant Sangwan
          31.
          ECUnderReplicationHandler should consider commands already sent when processing the container Sub-task Resolved Stephen O'Donnell
          32.
          ReplicationManager: Throttle delete container commands from over replication handlers Sub-task Resolved Stephen O'Donnell
          33.
          Let ReplicationManager decide the timeout for commands in Datanodes Sub-task Resolved Stephen O'Donnell
          34.
          Delay Starting ContainerBalancer after SCM failover Sub-task Resolved Siddhant Sangwan
          35.
          ReplicationManager: Basic Throttling of EC Reconstruction commands Sub-task Resolved Stephen O'Donnell
          36.
          ReplicationManager: Add nodes to exclude list if they are overloaded Sub-task Resolved Stephen O'Donnell
          37.
          Let ContainerBalancer consider EC containers for balancing Sub-task Resolved Siddhant Sangwan
          38.
          Fix the space usage comparator in ContainerBalancerSelectionCriteria Sub-task Resolved Siddhant Sangwan
          39.
          ReplicationManager: Fix getContainerReplicationHealth() so that it builds ContainerCheckRequest correctly Sub-task Resolved Siddhant Sangwan
          40.
          ReplicationManager: Create ContainerReplicaOp with correct target Datanode Sub-task Resolved Siddhant Sangwan
          41.
          ReplicationManager: Use RM exclude list when getting target nodes for reconstruction Sub-task Resolved Attila Doroszlai
          42.
          ReplicationManager: MisReplicationHandler should throw an exception if partially successful Sub-task Resolved Attila Doroszlai
          43.
          ReplicationManager: RatisUnderReplicationHandler should partially recover the container if not enough nodes Sub-task Resolved Stephen O'Donnell
          44.
          ContainerBalancer should move only CLOSED replicas Sub-task Resolved Siddhant Sangwan
          45.
          Consider seperating Ratis and EC MisReplication Handling Sub-task Resolved Attila Doroszlai
          46.
          ReplicationManager: Allow partial EC reconstruction if insufficient nodes available Sub-task Resolved Stephen O'Donnell
          47.
          ReplicationManager: EC Mis and Under replication handler should handle overloaded exceptions Sub-task Resolved Stephen O'Donnell
          48.
          Disable LegacyReplicationManager by default Sub-task Resolved Attila Doroszlai
          49.
          ReplicationManager: RatisUnderReplication handler should not sort sources by BCSID Sub-task Resolved Stephen O'Donnell
          50.
          Ratis under replication handling in a rack aware environment doesn't work Sub-task Resolved Siddhant Sangwan
          51.
          Ensure replication processors use a single queue for each iteration Sub-task Resolved Attila Doroszlai
          52.
          ReplicationManager: Add configurable global replication limit Sub-task Resolved Stephen O'Donnell
          53.
          Investigate possible race conditions on ContainerInfo in ContainerBalancer Sub-task Resolved Siddhant Sangwan
          54.
          Adjust replication queue limits for decommissioning nodes Sub-task Resolved Attila Doroszlai
          55.
          ReplicationManager: Clear ContainerReplicaPendingOps when RM goes to running state Sub-task Resolved Stephen O'Donnell
          56.
          Provide more info in assertions Sub-task Resolved Attila Doroszlai
          57.
          Datanode decommissioning blocked due to non-empty replica of deleting container Sub-task Resolved Siddhant Sangwan
          58.
          Add config for factor of scaling up replication queue/threads in decommissioning nodes Sub-task Resolved Attila Doroszlai
          59.
          ReplicationManager should handle CLOSING containers that are empty Sub-task Resolved Siddhant Sangwan
          60.
          ReplicationManager: Pass used and excluded node separately for Under and Mis-Replication Sub-task Resolved Stephen O'Donnell
          61.
          Underreplication not fixed if all replicas start decommissioning Sub-task Resolved Attila Doroszlai
          62.
          Thread pool size needs to be decreased in different order in ReplicationSupervisor Sub-task Resolved Attila Doroszlai
          63.
          ReplicationManager: Change default command timeout to 10 minutes Sub-task Resolved Stephen O'Donnell
          64.
          Ratis underreplication due to maintenance is not deprioritised Sub-task Resolved Attila Doroszlai
          65.
          EC: ReplicationManager - consider deprecating maintenance.replica.minimum Sub-task Resolved Attila Doroszlai
          66.
          ReplicationManager: Use EC config scheme to adjust the weighting of reconstruction tasks Sub-task Resolved Unassigned
          67.
          ReplicationManager: Unhealthy containers could block EC recovery in small clusters Sub-task Resolved Siddhant Sangwan
          68.
          Incorrect expectedNodes passed to InsufficientNodesException Sub-task Resolved Stephen O'Donnell
          69.
          Fix expectation in testUnderRepSentToOverRepHandlerIfNoNewNodes Sub-task Resolved Siddhant Sangwan
          70.
          Add metrics to ReplicationSupervisor for task count and max stream Sub-task Resolved Stephen O'Donnell
          71.
          Delete empty containers that are stuck in CLOSING state Sub-task Resolved Nandakumar
          72.
          Clean up replication code Sub-task Resolved Attila Doroszlai
          73.
          ReplicationManager: Log overloaded commands at debug rather than info level Sub-task Resolved Stephen O'Donnell
          74.
          ReplicationManager: Add metric to count how often replication is throttled Sub-task Resolved Stephen O'Donnell
          75.
          Prepare for dynamic config in ReplicationManager Sub-task Resolved Attila Doroszlai
          76.
          ReplicationManager: Fix metrics to work with new RM Sub-task Resolved Stephen O'Donnell
          77.
          ReplicationManager: Add metrics for partial replication / reconstruction and cluster limit Sub-task Resolved Stephen O'Donnell
          78.
          Orphan blocks can leave empty container stuck deleting forever Sub-task Resolved Ashish Kumar
          79.
          Replication limit should not be less than reconstruction weight Sub-task Resolved Attila Doroszlai
          80.
          Add metrics to Container Balancer Sub-task Resolved Siddhant Sangwan
          81.
          NPE in SCMCommonPlacementPolicy#validateContainerPlacement Sub-task Resolved Attila Doroszlai

          Activity

            People

              Unassigned Unassigned
              erose Ethan Rose
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: