Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5065

DistCp should skip checksum comparisons if block-sizes are different on source/target.


    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.3-alpha, 0.23.5
    • Fix Version/s: 2.1.0-beta, 0.23.8
    • Component/s: distcp
    • Labels:
    • Hadoop Flags:


      When copying files between 2 clusters with different default block-sizes, one sees that the copy fails with a checksum-mismatch, even though the files have identical contents.

      The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size of the file. So you could have 2 different files with identical contents (but different block-sizes) have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't used.)

      I propose that we skip checksum comparisons under the following conditions:
      1. -skipCrc is specified.
      2. File-size is 0 (in which case the call to the checksum-servlet is moot).
      3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed to differ in this case.

      I have a patch for #3.

      Edit: I've modified the fix to warn the user (instead of skipping the checksum-check). Skipping parity-checks is unsafe. The code now fails the copy, and suggests that the user either use -pb to preserve block-size, or consider -skipCrc (and forgo copy validation entirely).


        1. MAPREDUCE-5065.branch-0.23.patch
          11 kB
          Mithun Radhakrishnan
        2. MAPREDUCE-5065.branch-2.patch
          9 kB
          Mithun Radhakrishnan



            • Assignee:
              mithun Mithun Radhakrishnan
              mithun Mithun Radhakrishnan
            • Votes:
              0 Vote for this issue
              9 Start watching this issue


              • Created: