hdfs put local s3a://path is suboptimal as it treewalks down down the source tree then, sequentially, copies up the file through copying the file (opened as a stream) contents to a buffer, writes that to the dest file, repeats.
For S3A that hurts because
- it;s doing the upload inefficiently: the file can be uploaded just by handling the pathname to the AWS xter manager
- it is doing it sequentially, when some parallelised upload would work.
- as the ordering of the files to upload is a recursive treewalk, it doesn't spread the upload across multiple shards.
- build the list of files to upload
- upload in parallel, picking entries from the list at random and spreading across a pool of uploaders
- upload straight from local file (copyFromLocalFile()
- track IO load (files created/second) to estimate risk of throttling.