Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-7300

Race condition between full data scan and block deletion

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.3.0
    • None

    Description

      We have enabled the full data scan and found that one container is marked as unhealthy due to the conflict between full data scan and block deletion.

      The block deleting service first deletes the block and then updates the DB, while the data scan first scans the DB and then checks the existence of the blocks. 

      Once getting the DB record and finds the block not existing in the FS, the `Missing chunk file exception` will be thrown and the container will be marked as unhealthy.

       

      The block deleting service has a write lock during the process but the data scan has no read lock to avoid the conflict.

      Even by double checking the block if the block is still in the block-data table when the block is not found on the FS for the first time, the problem still happens. The flush time of DB batch operation is not predictable, so the direct second retrieval may not be a good solution as we cannot determine a fixed delay that could guarantee every batch could be flushed after this delay.

       

      The log trace:

      • 2022-09-30 16:07:38,535 BlockDeletingService#5 INFO org.apache.hadoop.ozone.container.keyvalue.impl.FilePerBlockStrategy: Deleted block file: /data11/ozone-ec/data/storage/hdds/CID-9090d68f-eb34-44f0-b54f-10df5e42a347/current/containerDir12/6595/chunks/109611004723333878.block

       

      • 2022-09-30 16:07:39,244 [ContainerDataScanner(/data11/ozone-ec/data/storage/hdds)] ERROR org.apache.hadoop.ozone.container.keyvalue.KeyValueContainerCheck: Corruption detected in container: [6595] Exception: [Missing chunk file /data11/ozone-ec/data/storage/hdds/CID-9090d68f-eb34-44f0-b54f-10df5e42a347/current/containerDir12/6595/chunks/109611004723333878.block]

       

      • 2022-09-30 16:07:39,545 [ContainerDataScanner(/data11/ozone-ec/data/storage/hdds)] WARN org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer: Moving container /data11/ozone-ec/data/storage/hdds/CID-9090d68f-eb34-44f0-b54f-10df5e42a347/current/containerDir12/6595 to state UNHEALTHY from state:UNHEALTHY Trace:java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1060) org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.markContainerUnhealthy(KeyValueContainer.java:340) org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.markContainerUnhealthy(KeyValueHandler.java:1017) org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.markContainerUnhealthy(ContainerController.java:116) org.apache.hadoop.ozone.container.ozoneimpl.ContainerDataScanner.scanContainer(ContainerDataScanner.java:72) org.apache.hadoop.ozone.container.ozoneimpl.AbstractContainerScanner.scanContainers(AbstractContainerScanner.java:99) org.apache.hadoop.ozone.container.ozoneimpl.AbstractContainerScanner.runIteration(AbstractContainerScanner.java:74)

       

      Attachments

        Issue Links

          Activity

            People

              Nibiruxu Xu Shao Hong
              Nibiruxu Xu Shao Hong
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: