Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-2823 SCM HA Support
  3. HDDS-4599

Handle inflight delete/add actions in ReplicationManager properly.

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotAdd voteVotersWatch issueWatchersConvert to IssueLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.1.0
    • Fix Version/s: None
    • Component/s: SCM HA
    • Labels:


      ReplicationManager maintains the in-flight replication and deletion in-memory, which is not replicated using Ratis. So, theoretically it’s possible that we might run into data loss issues and over replicated issues if we immediately start ReplicationManager after a failover.

      There is a quick fix for the potential data loss issue HDDS-4589, however we need a thorough solution for both in-flight add and in-flight delete.

      We have two proposals from Stephen O'Donnell:

      1. have the DNs provide a list of pending_delete blocks in their container report / heartbeat, and then we can use that in SCM.
      2. if the DNs detect a new master SCM or a restarted SCM, then purge their pending delete list and wait for new instructions from the new/restarted SCM.

      File this Jira to record this problem.



          $i18n.getText('security.level.explanation', $currentSelection) Viewable by All Users



              • Created:

                Issue deployment