Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.0.3-alpha, 0.23.5
-
None
-
Reviewed
Description
When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents.
The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.)
I propose that we skip checksum comparisons under the following conditions:
1. -skipCrc is specified.
2. File-size is 0 (in which case the call to the checksum-servlet is moot).
3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case.
I have a patch for #3.
Edit: I've modified the fix to warn the user (instead of skipping the checksum-check). Skipping parity-checks is unsafe. The code now fails the copy, and suggests that the user either use -pb to preserve block-size, or consider -skipCrc (and forgo copy validation entirely).