Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-18028

High performance S3A input stream with prefetching & caching

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 3.3.9
    • fs/s3

    Description

      I work for Pinterest. I developed a technique for vastly improving read throughput when reading from the S3 file system. It not only helps the sequential read case (like reading a SequenceFile) but also significantly improves read throughput of a random access case (like reading Parquet). This technique has been very useful in significantly improving efficiency of the data processing jobs at Pinterest. 
       
      I would like to contribute that feature to Apache Hadoop. More details on this technique are available in this blog I wrote recently:
      https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0
       

      Attachments

        Issue Links

          Activity

            People

              bhalchandrap Bhalchandra Pandit
              bhalchandrap Bhalchandra Pandit
              Votes:
              0 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 44h 20m
                  44h 20m