Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-14766

Cloudup: an object store high performance dfs put command

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • 2.8.1
    • None
    • fs, fs/azure, fs/s3
    • None

    Description

      hdfs put local s3a://path is suboptimal as it treewalks down down the source tree then, sequentially, copies up the file through copying the file (opened as a stream) contents to a buffer, writes that to the dest file, repeats.

      For S3A that hurts because

      • it;s doing the upload inefficiently: the file can be uploaded just by handling the pathname to the AWS xter manager
      • it is doing it sequentially, when some parallelised upload would work.
      • as the ordering of the files to upload is a recursive treewalk, it doesn't spread the upload across multiple shards.

      Better:

      • build the list of files to upload
      • upload in parallel, picking entries from the list at random and spreading across a pool of uploaders
      • upload straight from local file (copyFromLocalFile()
      • track IO load (files created/second) to estimate risk of throttling.

      Attachments

        1. HADOOP-14766-001.patch
          36 kB
          Steve Loughran
        2. HADOOP-14766-002.patch
          32 kB
          Steve Loughran

        Issue Links

          Activity

            People

              stevel@apache.org Steve Loughran
              stevel@apache.org Steve Loughran
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: