Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-6235

Empty KeyValueContainers are replicated without chunks directory

    XMLWordPrintableJSON

Details

    Description

      An empty KeyValueContainer will have an empty chunks directory. TarContainerPacker#pack recurses into directories adding files into containers, but if the chunks directory is empty, it will not be included in the tar. The receiver will unpack the tar successfully, but the resulting container will not have a chunks directory. After this, the container will not be able to replicated further, as the tar packing step requires all container pieces to be present on disk. This issue is more likely to occur due to HDDS-5359, which causes many empty containers to be tracked by SCM indefinitely.

      Since the issue only affects empty containers, there does not appear to be any data loss risk, even though the container scanner may detect it as "corruption". The issue may manifest as the container being marked unhealthy by the background container scanner (if it is enabled), or a container continuously attempting to be replicated and failing. In the later case, logs like this may be observed on the receiver of an import:

      2020-06-23 14:11:20,504 [grpc-default-executor-111] INFO org.apache.hadoop.ozone.container.replication.GrpcReplicationClient: Container 206 is downloaded to /tmp/container-copy/container-206.tar.gz
      2020-06-23 14:11:20,505 [ContainerReplicationThread-0] INFO org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator: Container 206 is downloaded, starting to import.
      2020-06-23 14:11:20,616 [ContainerReplicationThread-0] ERROR org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator: Can't import the downloaded container data id=206
      java.io.IOException: Container descriptor is missing from the container archive.
      

      This happens because the sender stopped packing contents into the container when it found the chunks dir missing, so it did not add the .container file. The send happens anyways, but the receiver tries to unpack the .container file first, and aborts when it sees it is not there.

      This Jira will fix the issue with the tar packer, and also add a repair step on datanode startup to create the chunks directory for containers that do not have one. This step should be a quick addition to datanode startup that already iterates all the containers, and should not impact startup time.

      Attachments

        Issue Links

          Activity

            People

              erose Ethan Rose
              erose Ethan Rose
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: