[HDDS-9125] Decommissioning blocked because of under replicated EC containers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.0
Fix Version/s: 1.4.0
Component/s: ECOfflineRecovery, SCM
Labels:
None

Target Version/s:

1.4.0

Description

The situation is well documented under the heading 'Situation 4' in https://docs.google.com/document/d/1ebuSwJZkw4wMWWCHinDvRCfNbeFD4kcHMyIN6Q6wD9g/edit?usp=sharing. This happens because of limitations in rack scatter policy + replication manager flow. One possible solution is implementing "fallback" in the rack scatter policy. Along with the doc, this PR is also related - https://github.com/apache/ozone/pull/5097.

An example (summary) of this situation:
Suppose there are 5 racks and 6 DNs, such that any one rack will have 2 DNs. 5 replicas of an EC container are scattered across each of the 5 racks (so that there's only 1 replica on each rack). Now, if any of the Datanodes from any rack where there's only 1 DN on that rack is decommissioned, under replication handling will be blocked.

Attachments

Issue Links

links to

PR 5246

Activity

People

Assignee:: Stephen O'Donnell

Reporter:: Siddhant Sangwan

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 07/Aug/23 06:01

Updated:: 22/Jan/24 12:58

Resolved:: 07/Sep/23 11:28