[HDDS-9383] ReplicationManager: Unhealthy replicas of a sufficiently replicated container can block decommissioning - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Critical
Resolution: Done
Affects Version/s: None
Fix Version/s: 1.4.0, 2.0.0
Component/s: SCM
Labels:
None

Target Version/s:

1.4.0, 2.0.0

Description

Mix of quasi-closed and unhealthy replicas blocks decommission even if sufficiently replicated.
a. Caused when only some of the replicas hit the error during write.
b. Can be fixed by removing this check:
if (!replicaSet.isHealthy()) {
if (LOG.isDebugEnabled())

{ unhealthyIDs.add(cid); }

if (unhealthy < CONTAINER_DETAILS_LOGGING_LIMIT

However, simply removing that check is not a complete solution. We need to try and preserve any UNHEALTHY replicas that have the greatest Sequence ID. https://issues.apache.org/jira/browse/HDDS-9321 takes care of the Legacy Replication Manager side of things to preserve such UNHEALTHY replicas. It introduces an API, getVulnerableUnhealthyReplicas, in RatisContainerReplicaCount. In the new RM, we need to see if it's possible to leverage this API. We will also require some decommissioning side changes, like in https://issues.apache.org/jira/browse/HDDS-9354.

The approach described above indirectly tries to fix this issue by moving replicas around. A more complete, long term fix can be to have a reconciliation job that fixes these UNHEALTHY replicas on the datanode, possibly by merging blocks from different replicas to get a healthy replica.

We should also try to investigate how a quasi-closed container is getting some unhealthy replicas and fix the root cause.

Attachments

Issue Links

is fixed by

HDDS-9592 Replication Manager: Save UNHEALTHY replicas with highest BCSID for a QUASI_CLOSED container

Resolved

requires

HDDS-9592 Replication Manager: Save UNHEALTHY replicas with highest BCSID for a QUASI_CLOSED container

Resolved

Activity

People

Assignee:: Siddhant Sangwan

Reporter:: Siddhant Sangwan

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Oct/23 13:22

Updated:: 21/Dec/23 10:24

Resolved:: 20/Dec/23 07:24