Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.1.0
-
None
-
None
Description
ReplicationManager maintains the in-flight replication and deletion in-memory, which is not replicated using Ratis. So, theoretically it’s possible that we might run into data loss issues and over replicated issues if we immediately start ReplicationManager after a failover.
There is a quick fix for the potential data loss issue HDDS-4589, however we need a thorough solution for both in-flight add and in-flight delete.
We have two proposals from sodonnell:
- have the DNs provide a list of pending_delete blocks in their container report / heartbeat, and then we can use that in SCM.
- if the DNs detect a new master SCM or a restarted SCM, then purge their pending delete list and wait for new instructions from the new/restarted SCM.
File this Jira to record this problem.