Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5065

DistCp should skip checksum comparisons if block-sizes are different on source/target.



    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.3-alpha, 0.23.5
    • 2.1.0-beta, 0.23.8
    • distcp
    • None
    • Reviewed


      When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents.

      The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.)

      I propose that we skip checksum comparisons under the following conditions:
      1. -skipCrc is specified.
      2. File-size is 0 (in which case the call to the checksum-servlet is moot).
      3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case.

      I have a patch for #3.

      Edit: I've modified the fix to warn the user (instead of skipping the checksum-check). Skipping parity-checks is unsafe. The code now fails the copy, and suggests that the user either use -pb to preserve block-size, or consider -skipCrc (and forgo copy validation entirely).


        1. MAPREDUCE-5065.branch-0.23.patch
          11 kB
          Mithun Radhakrishnan
        2. MAPREDUCE-5065.branch-2.patch
          9 kB
          Mithun Radhakrishnan



            mithun Mithun Radhakrishnan
            mithun Mithun Radhakrishnan
            0 Vote for this issue
            9 Start watching this issue