DistCp utility should have capability to store data in user specified compression format. This avoids one hop of compressing data after transfer. Backup strategies to different cluster also get benefit of saving one IO operation to and from HDFS, thus saving resources, time and effort.
- Create an option -compressOutput defaulting to org.apache.hadoop.io.compress.BZip2Codec.
- Users will be able to change codec with -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
- If distcp compression is enabled, suffix the filenames with default codec extension to indicate the file is compressed. Thus users can be aware of what codec was used to compress the data.