Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.4.0
-
None
Description
The situation is well documented under the heading 'Situation 4' in https://docs.google.com/document/d/1ebuSwJZkw4wMWWCHinDvRCfNbeFD4kcHMyIN6Q6wD9g/edit?usp=sharing. This happens because of limitations in rack scatter policy + replication manager flow. One possible solution is implementing "fallback" in the rack scatter policy. Along with the doc, this PR is also related - https://github.com/apache/ozone/pull/5097.
An example (summary) of this situation:
Suppose there are 5 racks and 6 DNs, such that any one rack will have 2 DNs. 5 replicas of an EC container are scattered across each of the 5 racks (so that there's only 1 replica on each rack). Now, if any of the Datanodes from any rack where there's only 1 DN on that rack is decommissioned, under replication handling will be blocked.
Attachments
Issue Links
- links to