Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
None
-
None
-
None
Description
We use distcp with -update option to copy a dir from hdfs to S3. When we run distcp job once more, it will overwrite S3 dir directly, rather than skip the same files.
Test Case:
Run twice the following cmd, the modify time of S3 files will be modified every time.
hadoop distcp -update /test/ s3a://${s3_buckect}/test/
Check code in CopyMapper.java and S3AFileSystem.java
(1) For the first time, distcp job will create files in S3, but blockSize is unused!
(2) For the second time, the distcp job will compare fileSize and blockSize between hdfs file and S3 file
(3) blockSize is unused, when get blockSize of S3 file, it return a default value.
In S3AFileSystem.java, we find that the default value of fs.s3a.block.size is 32 * 1024 * 1024.
The blockSize of HDFS seems invalid in Object Store, like S3. So I think there's no need to compare blockSize when distcp with -update option.
Attachments
Attachments
Issue Links
- is caused by
-
HADOOP-8143 Change distcp to have -pb on by default
- Resolved
- is duplicated by
-
HADOOP-16756 distcp -update to S3A; abfs, etc always overwrites due to block size mismatch
- Resolved