Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-2725

Distcp truncates some files when copying

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

      Description

      We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.

      Command used (it was run on the src cluster):
      hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir

      Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).

      src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
      src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
      src-file-3 692172075 bytes -> tgt-file-3 0 bytes

      All target files are truncated at block boundaries (some have 0 size).

      I looked at the log files, and noticed a few things:

      1. There are 31059 log files (same as the number of Maps the job had).

      2. 246 of the log files are non-empty.

      3. All non-empty log files are of the form:

      SKIP: hdfs://src-namenode/src-dir-a/src-file-x
      SKIP: hdfs://src-namenode/src-dir-b/src-file-y
      SKIP: hdfs://src-namenode/src-dir-c/src-file-z

      4. All 9 files which were truncated were included in the log files as skipped files.

      5. All 9 files were the last entry in their respective log files.

      e.g.
      Non-empty logfile 1:

      SKIP: hdfs://src-namenode/src-dir-a/src-file-x
      SKIP: hdfs://src-namenode/src-dir-b/src-file-y
      SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file

      Non_empty logfile 2:
      SKIP: hdfs://src-namenode/src-dir-p/src-file-m
      SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file

        Attachments

        1. 2725_20080206.patch
          30 kB
          Tsz-wo Sze
        2. 2725_20080208.patch
          28 kB
          Tsz-wo Sze
        3. 2725_20080212.patch
          38 kB
          Tsz-wo Sze

        Issue Links

          Activity

            People

            • Assignee:
              szetszwo Tsz-wo Sze
              Reporter:
              mabasrai Murtaza A. Basrai

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment