Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
The bug described in HDDS-8129 would cause containers to have block counts lower than their correct value. In versions of the code before the issue was fixed, this could cause the block count to reach zero too early, so SCM would move the containers to DELETING state, issue delete commands to datanodes, and move containers to DELETED when the replicas were gone. However, it's possible that between when the datanodes sent a heartbeat with zero block counts and SCM sent back delete commands, the block deleting service ran and made the container's block count negative on the datanode. In this case, when the datanode gets the delete command, it will reject it, even in the old version before the fixes, because the counter is not equal to zero (this code link is to a version before the deletion path was fixed).
These containers are stuck such that SCM's state is DELETING and it keeps resending delete commands, but datanodes block the deletion and the container may still have valid data. Containers that entered this state in old versions have remained in this state indefinitely, even after the fixes. This is because the delete commands are being sent based on SCM's DELETING state for the container, not the status of its block content as reported by datanodes after the fixes. The fixes prevent containers from moving from CLOSED to DELETING incorrectly but do nothing for containers already in that state.
Since DELETING containers are not processed by the replication manager, we need a way for SCM to move their state back to CLOSED if the datanode rejects the deletion to fully recover from the effects of HDDS-8129.
Attachments
Attachments
Issue Links
- links to