Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-13114

DistCp should have option to compress data on write

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Patch Available
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.8.0, 2.7.3, 3.0.0-alpha1
    • Fix Version/s: None
    • Component/s: tools/distcp
    • Labels:

      Description

      DistCp utility should have capability to store data in user specified compression format. This avoids one hop of compressing data after transfer. Backup strategies to different cluster also get benefit of saving one IO operation to and from HDFS, thus saving resources, time and effort.

      • Create an option -compressOutput defaulting to org.apache.hadoop.io.compress.BZip2Codec.
      • Users will be able to change codec with -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
      • If distcp compression is enabled, suffix the filenames with default codec extension to indicate the file is compressed. Thus users can be aware of what codec was used to compress the data.

        Attachments

        1. HADOOP-13114.05.patch
          23 kB
          Ravi Prakash
        2. HADOOP-13114.06.patch
          26 kB
          Ravi Prakash
        3. HADOOP-13114-trunk_2016-05-07-1.patch
          23 kB
          Suraj Nayak
        4. HADOOP-13114-trunk_2016-05-08-1.patch
          24 kB
          Suraj Nayak
        5. HADOOP-13114-trunk_2016-05-10-1.patch
          24 kB
          Suraj Nayak
        6. HADOOP-13114-trunk_2016-05-12-1.patch
          23 kB
          Suraj Nayak

        Issue Links

          Activity

            People

            • Assignee:
              snayakm Suraj Nayak
              Reporter:
              snayakm Suraj Nayak

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 48h
                48h
                Remaining:
                Remaining Estimate - 48h
                48h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Issue deployment