[HDDS-6449] Failed container delete can leave artifacts on disk - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.0, 1.1.0, 1.2.0
Fix Version/s: 1.4.0
Component/s: Ozone Datanode
Labels:
- HDDS-6449
- pull-request-available

Target Version/s:

1.4.0, 1.3.1

Description

When SCM issues a delete command to a datanode, the datanode does the following steps:
writeLock()
1. The container is removed from the in memory container set.
writeUnlock()
2. The container metadata directory recursively deleted.
3. The container chunks directory recursively deleted.
4. The datanode sets the container's in memory state to DELETED
- This is purely for the ICR as the container is not present in the container set anymore.
5. Datanode sends incremental container report to SCM with the new state.
- The container has been removed from the in-memory set at this point, so once the ICR is sent the container is unreachable.

In ~~HDDS-6441~~, A failure in step 2 removed the .container file and db.checkpoints directory (unused) from the metadata directory, and the rest of the steps were not done after the IO exception was thrown during the delete. This caused an error to be logged when the partial state was read on datanode restart.

This current method of deleting containers provides no way to recover from or retry a failed delete, because the container is removed from the in-memory set as the first step. This Jira aims to change the datanode delete steps so that if a delete fails, the existing SCM container delete retry logic or the datanode itself can eventually get the lingering state off the disk.

Proposed solution v1,

Provided link to sharable google doc for potential solution "to resolve the datanode artifact issue by using a background failedContainerDelete thread that is run on each datanode to cleanup failed container delete transactions.":

https://docs.google.com/document/d/1ngRCbA_HxoNOof1kaiDuw0XYjJ2Z7t64ATF-V0TsJ-4/edit?usp=sharing

Proposed solution v2,

Following discussions with Ethan, Ritesh, Sid and Nanda, have created an updated proposed solution through an atomic rename of containers on container delete. The rename is to a common cleanup path on each disk. Subsequently, the Scrubber service is modified to delete all files found in the cleanup path. Design doc (draft) for this is in the shared google doc:

https://docs.google.com/document/d/1Xt_x1Uhs4e1vJ6cJgokdlMxI0tRSxNBEkZlI9MXMzMg/edit?usp=sharing

Revised solution v2+,

A revised v2+ solution out of our latest discussions - thanks erose , kerneltime , nanda , swagle

https://docs.google.com/document/d/1PVgESfYk-V9Jb7yo-qrTmDZ38NezAtADY5RWXmXGACQ/edit?usp=sharing

Attachments

Issue Links

causes

HDDS-8447 Datanodes should not process container deletes for failed volumes

Resolved

is a child of

HDDS-8161 Atomic container filesystem operations

Open

is a parent of

HDDS-6910 Hook background thread into container service

Resolved

HDDS-6915 Stopping the background service

Resolved

HDDS-6917 Perform artifact deletion

Resolved

HDDS-6919 Cleanup manager enhancements

Resolved

HDDS-6911 Start background thread when cleanup service is started

Resolved

HDDS-6912 Cleanup thread is invoked

Resolved

HDDS-6913 Background service should invoke the cleanup at N intervals

Resolved

HDDS-6914 Error handling when the background service is started

Resolved

HDDS-6916 Stopping the background cleanup service

Resolved

HDDS-6918 Check Ozone version on startup of background thread

Resolved

HDDS-6920 Rename artifacts to be deleted to target volume

Resolved

HDDS-6921 Cleanup Manager provides a list of objects in the deletion volume

Resolved

HDDS-6922 Deleting objects from the cleanup volume

Resolved

is related to

HDDS-3943 Cleanup empty container directory

Open

split from

HDDS-6441 Ozone metadata does not align with underlying blocks when there are many incomplete uploads happens

Resolved

links to

GitHub Pull Request #3741

(10 is a parent of, 1 is related to, 1 split from, 1 links to)

Activity

People

Assignee:: Christos Bisias

Reporter:: Ethan Rose

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 14/Mar/22 23:30

Updated:: 21/Jan/24 15:15

Resolved:: 15/Mar/23 21:10