Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-14766

Cloudup: an object store high performance dfs put command

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Won't Fix
    • Affects Version/s: 2.8.1
    • Fix Version/s: None
    • Component/s: fs, fs/azure, fs/s3
    • Labels:
      None
    • Target Version/s:

      Description

      hdfs put local s3a://path is suboptimal as it treewalks down down the source tree then, sequentially, copies up the file through copying the file (opened as a stream) contents to a buffer, writes that to the dest file, repeats.

      For S3A that hurts because

      • it;s doing the upload inefficiently: the file can be uploaded just by handling the pathname to the AWS xter manager
      • it is doing it sequentially, when some parallelised upload would work.
      • as the ordering of the files to upload is a recursive treewalk, it doesn't spread the upload across multiple shards.

      Better:

      • build the list of files to upload
      • upload in parallel, picking entries from the list at random and spreading across a pool of uploaders
      • upload straight from local file (copyFromLocalFile()
      • track IO load (files created/second) to estimate risk of throttling.

        Attachments

        1. HADOOP-14766-002.patch
          32 kB
          Steve Loughran
        2. HADOOP-14766-001.patch
          36 kB
          Steve Loughran

          Issue Links

            Activity

              People

              • Assignee:
                stevel@apache.org Steve Loughran
                Reporter:
                stevel@apache.org Steve Loughran
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: