[HDDS-6447] Refine SCM handling of unhealthy container replicas - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4.0
Component/s: SCM
Labels:
- pull-request-available

Description

Currently, containers are marked UNHEALTHY by Container Scrubber for one of the following reasons:

If an operation fails on an open/ closing container, it is marked unhealthy so that subsequent write transactions also fail.
If Container Scrubber is enabled and ContainerMetadataScanner detects an error during KeyValueContainerCheck#fastCheck().
- Metadata path or Chunks path is not accessible as a directory
- Container checksum verification fails
- On-disk Container Yaml data does not match in-memory container data (ContainerType, ContainerID, Container DBType, Metadata Path)
If Container Scrubber is enabled and ContainerDataScanner (runs only on closed and quasi-closed containers) detects any block with missing or corrupted chunks file.

If a container in “open” state in SCM is marked unhealthy (in the container report), SCM asks the DNs to close the container. But for a “closing” container with an “unhealthy” replica, SCM leaves the container replica as is.

Some of the issues with how unhealthy containers are handled:

If ReplicationManager does not find a healthy replica for a container, it does not replicate that container. So if there is only 1 replica of a container and it is unhealthy, SCM will never replicate it and there is potential for data loss if that single replica is lost for any reason (for example: disk failure).
If there is a Quasi-Closed replica and an Unhealthy container, SCM will delete the unhealthy container. In this scenario, SCM should not delete the unhealthy container if it can recovered as it is possible that the unhealthy container is ahead of the quasi-closed container.
SCM should be more conservative with deleting unhealthy containers as they could possibly be recovered. This Jira proposes to let SCM replicate an unhealthy container if there is no other replica. Also, if there is only a quasi-closed replica and an unhealthy replica, SCM should not delete the unhealthy replica.
Let’s say there are 3 quasi-closed replicas of a closed container with all of them having bcsId < container bcsId (closed replica is lost and a quasi-closed replica is replicated). RelicationManager will delete one of these quesi-closed replicas (handleUnstableContainer) and then in the next cycle replicate it again as container would now be under-replicated (handleUnderreplicatedContainer). This will become a loop of replicating and deleting the container replica.

Attachments

Issue Links

causes

HDDS-9254 Legacy replication manager uses mismatched replicas as replication sources

Resolved

is required by

HDDS-7093 Container Scanner should be enabled by default

Resolved

HDDS-7094 Enable Datanode side CRC checks by default

Reopened

links to

GitHub Pull Request #3258

GitHub Pull Request #3920

Activity

People

Assignee:: Ethan Rose

Reporter:: Hanisha Koneru

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Mar/22 18:21

Updated:: 07/Sep/23 20:36

Resolved:: 11/Jan/23 10:34