Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6713

Distcp doesn't provide any option to override the default staging directory



    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.5.1
    • Fix Version/s: None
    • Component/s: distcp
    • Labels:


      Current state and shortcoming
      By default, distcp writes temporary files into $TARGET_PATH/.distcp.tmp/$taskatttempttid. (See RetriableFileCopyCommand#getTmpFile). There is no way a user can override this staging/tmp directory. The problem is obvious in S3 with versioning. For example, user wants to turn on S3 versioning only for his target directory but not the staging/tmp directory. Current distcp also creates versioning for staging directory which can contain a lot of temporary files. If user can override this path by a non-versioned S3 path for staging, it will make things cleaner.

      Proposed solution
      Provide a new option(-stage) where user can optionally provide a path from target FS. Distcp mapper tasks will write distcp temporary files into that directory.

      Possible Confusions
      There is another distcp option (-tmp) which can be assumed to serve the same purpose. But this option works only with "-atomic" option which has a different meaning of temporary files.
      Another confusion could be the staging directory used by mapreduce framework. The proposed temp directory is for distcp specific.

      Working on a patch to upload.




            • Assignee:
              kamrul Mohammad Islam
              kamrul Mohammad Islam
            • Votes:
              0 Vote for this issue
              5 Start watching this issue


              • Created: