Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
We have enabled the full data scan and found that one container is marked as unhealthy due to the conflict between full data scan and block deletion.
The block deleting service first deletes the block and then updates the DB, while the data scan first scans the DB and then checks the existence of the blocks.
Once getting the DB record and finds the block not existing in the FS, the `Missing chunk file exception` will be thrown and the container will be marked as unhealthy.
The block deleting service has a write lock during the process but the data scan has no read lock to avoid the conflict.
Even by double checking the block if the block is still in the block-data table when the block is not found on the FS for the first time, the problem still happens. The flush time of DB batch operation is not predictable, so the direct second retrieval may not be a good solution as we cannot determine a fixed delay that could guarantee every batch could be flushed after this delay.
The log trace:
- 2022-09-30 16:07:38,535 BlockDeletingService#5 INFO org.apache.hadoop.ozone.container.keyvalue.impl.FilePerBlockStrategy: Deleted block file: /data11/ozone-ec/data/storage/hdds/CID-9090d68f-eb34-44f0-b54f-10df5e42a347/current/containerDir12/6595/chunks/109611004723333878.block
- 2022-09-30 16:07:39,244 [ContainerDataScanner(/data11/ozone-ec/data/storage/hdds)] ERROR org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck: Corruption detected in container: [6595] Exception: [Missing chunk file /data11/ozone-ec/data/storage/hdds/CID-9090d68f-eb34-44f0-b54f-10df5e42a347/current/containerDir12/6595/chunks/109611004723333878.block]
- 2022-09-30 16:07:39,545 [ContainerDataScanner(/data11/ozone-ec/data/storage/hdds)] WARN org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer: Moving container /data11/ozone-ec/data/storage/hdds/CID-9090d68f-eb34-44f0-b54f-10df5e42a347/current/containerDir12/6595 to state UNHEALTHY from state:UNHEALTHY Trace:java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1060) org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.markContainerUnhealthy(KeyValueContainer.java:340) org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.markContainerUnhealthy(KeyValueHandler.java:1017) org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.markContainerUnhealthy(ContainerController.java:116) org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.scanContainer(ContainerDataScanner.java:72) org.apache.hadoop.ozone.container.ozoneimpl.AbstractContainerScanner.scanContainers(AbstractContainerScanner.java:99) org.apache.hadoop.ozone.container.ozoneimpl.AbstractContainerScanner.runIteration(AbstractContainerScanner.java:74)