Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
Description
I work for Pinterest. I developed a technique for vastly improving read throughput when reading from the S3 file system. It not only helps the sequential read case (like reading a SequenceFile) but also significantly improves read throughput of a random access case (like reading Parquet). This technique has been very useful in significantly improving efficiency of the data processing jobs at Pinterest.
I would like to contribute that feature to Apache Hadoop. More details on this technique are available in this blog I wrote recently:
https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0
Attachments
Issue Links
- is depended upon by
-
HADOOP-18179 Boost S3A Stream Read Performance
- Open
-
HADOOP-18477 Über-jira: S3A Hadoop 3.3.9 features
- Open
- is related to
-
HIVE-25827 Parquet file footer is read multiple times, when multiple splits are created in same file
- Closed
- links to