[HDDS-11136] Some containers affected by HDDS-8129 may still be in the DELETING state incorrectly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0, 1.4.1
Component/s: Ozone Datanode, SCM
Labels:
- pull-request-available

Target Version/s:

2.0.0, 1.4.1

Description

The bug described in ~~HDDS-8129~~ would cause containers to have block counts lower than their correct value. In versions of the code before the issue was fixed, this could cause the block count to reach zero too early, so SCM would move the containers to DELETING state, issue delete commands to datanodes, and move containers to DELETED when the replicas were gone. However, it's possible that between when the datanodes sent a heartbeat with zero block counts and SCM sent back delete commands, the block deleting service ran and made the container's block count negative on the datanode. In this case, when the datanode gets the delete command, it will reject it, even in the old version before the fixes, because the counter is not equal to zero (this code link is to a version before the deletion path was fixed).

These containers are stuck such that SCM's state is DELETING and it keeps resending delete commands, but datanodes block the deletion and the container may still have valid data. Containers that entered this state in old versions have remained in this state indefinitely, even after the fixes. This is because the delete commands are being sent based on SCM's DELETING state for the container, not the status of its block content as reported by datanodes after the fixes. The fixes prevent containers from moving from CLOSED to DELETING incorrectly but do nothing for containers already in that state.

Since DELETING containers are not processed by the replication manager, we need a way for SCM to move their state back to CLOSED if the datanode rejects the deletion to fully recover from the effects of ~~HDDS-8129~~.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Stuck Deleting Containers.pdf
29/Jul/24 07:47
53 kB
Siddhant Sangwan

Issue Links

links to

GitHub Pull Request #6967

Activity

People

Assignee:: Siddhant Sangwan

Reporter:: Ethan Rose

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 10/Jul/24 20:53

Updated:: 25/Oct/24 14:00

Resolved:: 29/Jul/24 07:49