Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-18028

High performance S3A input stream with prefetching & caching

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 3.3.9
    • fs/s3

    Description

      I work for Pinterest. I developed a technique for vastly improving read throughput when reading from the S3 file system. It not only helps the sequential read case (like reading a SequenceFile) but also significantly improves read throughput of a random access case (like reading Parquet). This technique has been very useful in significantly improving efficiency of the data processing jobs at Pinterest. 
       
      I would like to contribute that feature to Apache Hadoop. More details on this technique are available in this blog I wrote recently:
      https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0
       

      Attachments

        Issue Links

          1.
          test failures with prefetching s3a input stream Sub-task Resolved Monthon Klongklaew

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h 50m
          2.
          s3a prefetching stream to move off twitter FuturePool Sub-task Resolved Unassigned  
          3.
          document use and architecture design of prefetching s3a input stream Sub-task Resolved Ahmar Suhail

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h 40m
          4.
          Remove use of scala jar twitter util-core with java futures in S3A prefetching stream Sub-task Resolved PJ Fanning

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 9h
          5.
          move org.apache.hadoop.fs.common package into hadoop-common module Sub-task Resolved Steve Loughran

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 20m
          6.
          S3File to store reference to active S3Object in a field. Sub-task Resolved Bhalchandra Pandit  
          7.
          s3a prefetching stream to support unbuffer() Sub-task In Progress Steve Loughran  
          8.
          tune logging of prefetch problems Sub-task Open Unassigned  
          9.
          s3a prefetching to use SemaphoredDelegatingExecutor for submitting work Sub-task Resolved Viraj Jasani  
          10.
          Convert s3a prefetching to use JavaDoc for fields and enums Sub-task Resolved Steve Loughran  
          11.
          S3PrefetchingInputStream to support status probes when closed Sub-task Resolved Viraj Jasani  
          12.
          Collect IOStatistics during S3A prefetching Sub-task Resolved Ahmar Suhail

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 4h 10m
          13.
          Ensure S3A prefetching stream memory consumption scales Sub-task Open Unassigned  
          14.
          stream warns Not all bytes were read from the S3ObjectInputStream when closed Sub-task Resolved Ahmar Suhail

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 40m
          15.
          Use async drain threshold to decide b/w async and sync draining Sub-task Resolved Ahmar Suhail  
          16.
          tests in ITestS3AInputStreamPerformance are failing Sub-task Resolved Ahmar Suhail

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 6h
          17.
          Rebase s3a prefetching feature branch on top of trunk Sub-task Resolved Ahmar Suhail  
          18.
          Remove lower limit on s3a prefetching/caching block size Sub-task Resolved Ankit Saurabh  
          19.
          Tests in ITestS3AOpenCost are failing Sub-task Resolved Ahmar Suhail  
          20.
          Add in configuration option to enable prefetching Sub-task Resolved Ahmar Suhail

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 10m
          21.
          Review s3a prefetching input stream retry code; synchronization Sub-task Open Unassigned  
          22.
          S3A prefetch - Implement LRU cache for SingleFilePerBlockCache Sub-task Resolved Viraj Jasani  
          23.
          Update class names to be clear they belong to S3A prefetching Sub-task Resolved Unassigned  
          24.
          S3A prefetching: Error logging during reads Sub-task Resolved Ankit Saurabh  
          25.
          hadoop-aws maven build to add a prefetch profile to run all tests with prefetching Sub-task Resolved Viraj Jasani  
          26.
          Implement readFully(long position, byte[] buffer, int offset, int length) Sub-task Resolved Alessandro Passaro  
          27.
          rebase feature/HADOOP-18028-s3a-prefetch to trunk Sub-task Resolved Steve Loughran

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h
          28.
          fs.s3a.prefetch.block.size to be read through longBytesOption Sub-task Resolved Viraj Jasani  
          29.
          ITestS3AFileSystemStatistic failure in prefetch feature branch Sub-task Open Samrat Deb  
          30.
          ITestS3ACannedACLs failure; not in a span Sub-task Resolved Ashutosh Gupta  
          31.
          S3A Prefetch - SingleFilePerBlockCache to use LocalDirAllocator Sub-task Resolved Viraj Jasani  
          32.
          s3a prefetching Executor should be closed Sub-task Resolved Viraj Jasani  
          33.
          assertion failure in ITestS3APrefetchingInputStream Sub-task Resolved Ashutosh Gupta  
          34.
          Fix transient failure of ITestS3APrefetchingInputStream#testRandomReadLargeFile Sub-task Resolved Viraj Jasani  
          35.
          Backport S3A prefetching stream to branch-3.3 Sub-task Resolved Steve Loughran  
          36.
          s3a prefetch cache blocks should be accessed by RW locks Sub-task Resolved Viraj Jasani  
          37.
          CachingBlockManager to use AtomicBoolean for closed flag Sub-task Resolved Viraj Jasani  
          38.
          S3A prefetching: switch to prefetching for chosen read policies Sub-task Open Unassigned  
          39.
          s3a prefetching to use split start/end options to limit prefetch range Sub-task In Progress Steve Loughran  
          40.
          s3a large file prefetch tests are too slow, don't validate data Sub-task Resolved Viraj Jasani  
          41.
          s3a prefetch read/write file operations should guard channel close Sub-task Resolved Viraj Jasani  
          42.
          s3a prefetch LRU cache eviction metric Sub-task Resolved Viraj Jasani  
          43.
          S3ACachingInputStream.ensureCurrentBuffer(): lazy seek means all reads look like random IO Sub-task Open Unassigned  
          44.
          ITestS3APrefetchingCacheFiles teardown failure if setup() fails Sub-task Open Unassigned  
          45.
          Use builder for prefetch CachingBlockManager Sub-task Resolved Viraj Jasani  
          46.
          S3A prefetching to support Vector IO Sub-task Open Unassigned  
          47.
          TestS3ACachingBlockManager fails intermittently in Yetus Sub-task Open Unassigned  

          Activity

            People

              bhalchandrap Bhalchandra Pandit
              bhalchandrap Bhalchandra Pandit
              Votes:
              0 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 44h 20m
                  44h 20m