Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
HDDS-9550 documented a case where containers can be created on SCM, but replicas are never created on datanodes and then being tracked as missing in the system even though there is no data in them. Since it is tricky to determine whether or not these containers are actually empty from SCM's point of view, pull request 5523 implemented a solution that keeps tracking the containers in SCM, but reports them as empty instead of missing.
In this Jira, I propose a solution that is a bit more involved, but should provide a path for these containers to be cleared from the system safely:
- When SCM first creates the container, it knows the datanode replicas that are supposed to have the container. It should track this information until it gets reports that the container is created, even after the pipeline is closed.
- When the pipeline is either closed gracefully by SCM or fails on the datanode, SCM should send close commands for all affected containers, including these empty ones.
- When a datanode gets a close container command for a container it does not have, it can ack back to the SCM that the container is closed with BCSID=0, block count=0, empty, etc. If the container has data then the normal container flow still applies.
- If the container was never created, SCM will now see it as empty and can then move this container through the regular close and delete flow. A datanode getting a delete command for a container it does not have should be ok.
With this approach, we can re-use the normal delete flow and safely clean the containers out of the system, because it requires one round of back and forth between SCM and datanodes.
Attachments
Issue Links
- relates to
-
HDDS-9550 Container report shows missing containers when they actually appear empty
- Resolved
- links to