Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.3.0
-
None
Description
Distcp over S3A always copies all source files no matter the files are changed or not. This is opposite to the statement in the doc below.
http://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
And to use -update to only copy changed files.
CopyMapper compares file length as well as block size before copying. While the file length should match, the block size does not. This is apparently because the returned block size from S3A is always 32MB.
I'd suppose we should update the documentation or make code change.
Attachments
Issue Links
- duplicates
-
HADOOP-17256 DistCp -update option will be invalid when distcp files from hdfs to S3
- Resolved
- is broken by
-
HADOOP-8143 Change distcp to have -pb on by default
- Resolved
- is duplicated by
-
HADOOP-15300 distcp -update to WASB and ADL copies up all the files, always
- Resolved
- is related to
-
HADOOP-16932 distcp copy calls getFileStatus() needlessly and can fail against S3
- Resolved