Now that container schema v3 has been implemented, container level updates like delete and import require both moving the container directory, and editing the container's entries in RocksDB.
Originally in commit bf5b6f5 the container delete steps were:
1. Remove entries from RocksDB
2. Delete container directory
In this implementation, it is possible that the RocksDB update succeeds but the container delete fails, leaving behind a container directory on the disk that is discovered at startup. The datanode would load the container and recalculate only the metadata values (KeyValueContianerUtil#verifyAndFixupContainerData). Delete transaction and block data would be lost, leaving this container corrupted, but reported as healthy to SCM until the scanner identifies it.
HDDS-6449, the steps were changed so that failed directory deletes would not leave broken container directories that the datanode discovers on startup. The deletion steps became:
1. Move container directory to tmp deleted containers directory on the same file system (atomic).
2. Delete DB entries
3. Delete container from tmp directory.
The deleted container directory will be cleared on datanode startup and shutdown, and this process will also clear corresponding RocksDB entries that may not have been cleared if an error happened after step 1. This can cause RocksDB data for an active container replica to be deleted incorrectly in the following case:
1. Container 1 is deleted. Rename of the container directory to the delete directory succeeds but DB update fails.
2. Container 1 is re-imported to the same datanode on the same volume. The imported SST files overwrite the old ones in the DB.
3. Datanode is restarted, triggering cleanup of the deleted container directory and RocksDB entries for any containers there.
- This deletes data belonging to container ID 1, which now happens to belong to the active container.
Container import can have similar issues as well. We need a standardized process to keep DB and directory updates consistent and recover from failures between the two operations.