Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6713

Distcp doesn't provide any option to override the default staging directory



    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.5.1
    • None
    • distcp
    • None


      Current state and shortcoming
      By default, distcp writes temporary files into $TARGET_PATH/.distcp.tmp/$taskatttempttid. (See RetriableFileCopyCommand#getTmpFile). There is no way a user can override this staging/tmp directory. The problem is obvious in S3 with versioning. For example, user wants to turn on S3 versioning only for his target directory but not the staging/tmp directory. Current distcp also creates versioning for staging directory which can contain a lot of temporary files. If user can override this path by a non-versioned S3 path for staging, it will make things cleaner.

      Proposed solution
      Provide a new option(-stage) where user can optionally provide a path from target FS. Distcp mapper tasks will write distcp temporary files into that directory.

      Possible Confusions
      There is another distcp option (-tmp) which can be assumed to serve the same purpose. But this option works only with "-atomic" option which has a different meaning of temporary files.
      Another confusion could be the staging directory used by mapreduce framework. The proposed temp directory is for distcp specific.

      Working on a patch to upload.




            kamrul Mohammad Islam
            kamrul Mohammad Islam
            0 Vote for this issue
            4 Start watching this issue