Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
Description
I work for Pinterest. I developed a technique for vastly improving read throughput when reading from the S3 file system. It not only helps the sequential read case (like reading a SequenceFile) but also significantly improves read throughput of a random access case (like reading Parquet). This technique has been very useful in significantly improving efficiency of the data processing jobs at Pinterest.
I would like to contribute that feature to Apache Hadoop. More details on this technique are available in this blog I wrote recently:
https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0
Attachments
Issue Links
- is depended upon by
-
HADOOP-18179 Boost S3A Stream Read Performance
- Open
-
HADOOP-18477 Über-jira: S3A Hadoop 3.3.9 features
- Open
- is related to
-
HIVE-25827 Parquet file footer is read multiple times, when multiple splits are created in same file
- Closed
- links to
1.
|
s3a prefetching stream to support unbuffer() | In Progress | Steve Loughran | |
2.
|
tune logging of prefetch problems | Open | Unassigned | |
3.
|
Ensure S3A prefetching stream memory consumption scales | Open | Unassigned | |
4.
|
Review s3a prefetching input stream retry code; synchronization | Open | Unassigned | |
5.
|
ITestS3AFileSystemStatistic failure in prefetch feature branch | Open | Samrat Deb | |
6.
|
S3A prefetching: switch to prefetching for chosen read policies | Open | Unassigned | |
7.
|
s3a prefetching to use split start/end options to limit prefetch range | In Progress | Steve Loughran | |
8.
|
S3ACachingInputStream.ensureCurrentBuffer(): lazy seek means all reads look like random IO | Open | Unassigned | |
9.
|
ITestS3APrefetchingCacheFiles teardown failure if setup() fails | Open | Unassigned | |
10.
|
S3A prefetching to support Vector IO | Open | Unassigned | |
11.
|
TestS3ACachingBlockManager fails intermittently in Yetus | Open | Unassigned |