Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.0.0, 1.1.0, 1.2.0
Description
An empty KeyValueContainer will have an empty chunks directory. TarContainerPacker#pack recurses into directories adding files into containers, but if the chunks directory is empty, it will not be included in the tar. The receiver will unpack the tar successfully, but the resulting container will not have a chunks directory. After this, the container will not be able to replicated further, as the tar packing step requires all container pieces to be present on disk. This issue is more likely to occur due to HDDS-5359, which causes many empty containers to be tracked by SCM indefinitely.
Since the issue only affects empty containers, there does not appear to be any data loss risk, even though the container scanner may detect it as "corruption". The issue may manifest as the container being marked unhealthy by the background container scanner (if it is enabled), or a container continuously attempting to be replicated and failing. In the later case, logs like this may be observed on the receiver of an import:
2020-06-23 14:11:20,504 [grpc-default-executor-111] INFO org.apache.hadoop.ozone.container.replication.GrpcReplicationClient: Container 206 is downloaded to /tmp/container-copy/container-206.tar.gz 2020-06-23 14:11:20,505 [ContainerReplicationThread-0] INFO org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator: Container 206 is downloaded, starting to import. 2020-06-23 14:11:20,616 [ContainerReplicationThread-0] ERROR org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator: Can't import the downloaded container data id=206 java.io.IOException: Container descriptor is missing from the container archive.
This happens because the sender stopped packing contents into the container when it found the chunks dir missing, so it did not add the .container file. The send happens anyways, but the receiver tries to unpack the .container file first, and aborts when it sees it is not there.
This Jira will fix the issue with the tar packer, and also add a repair step on datanode startup to create the chunks directory for containers that do not have one. This step should be a quick addition to datanode startup that already iterates all the containers, and should not impact startup time.
Attachments
Issue Links
- causes
-
HDDS-6301 containerDir left over after volume/bucket/key deletion
- Resolved
- is duplicated by
-
HDDS-1493 Download and Import Container replicator fails.
- Resolved
-
HDDS-3852 Failed to import replicated container
- Resolved
-
HDDS-5149 when source datanode download container tar from target datanode,but the target datanode container file missing,import error
- Resolved
-
HDDS-5150 if container chunks is missing,the datanode restart the container reader will not verfiy the chunks directory missing
- Resolved
- is related to
-
HDDS-5359 Incorrect BLOCKCOUNT and BYTESUSED in container DB
- Resolved
-
HDDS-5548 Keep downloaded container .gz.tar file for debug purpose
- Open
- links to