[HADOOP-18028] High performance S3A input stream with prefetching & caching - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 3.3.9
Component/s: fs/s3
Labels:
- pull-request-available

Language:
- java

Description

I work for Pinterest. I developed a technique for vastly improving read throughput when reading from the S3 file system. It not only helps the sequential read case (like reading a SequenceFile) but also significantly improves read throughput of a random access case (like reading Parquet). This technique has been very useful in significantly improving efficiency of the data processing jobs at Pinterest.

I would like to contribute that feature to Apache Hadoop. More details on this technique are available in this blog I wrote recently:
https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0

Attachments

Issue Links

is depended upon by

HADOOP-18179 Boost S3A Stream Read Performance

Open

HADOOP-18477 Über-jira: S3A Hadoop 3.3.9 features

Open

is related to

HIVE-25827 Parquet file footer is read multiple times, when multiple splits are created in same file

Closed

links to

GitHub Pull Request #3736

GitHub Pull Request #4109

GitHub Pull Request #4654

GitHub Pull Request #4675

GitHub Pull Request #4752

GitHub Pull Request #5559

GitHub Pull Request #5605

(5 links to)

Sub-Tasks

1.	s3a prefetching stream to support unbuffer()	In Progress	Steve Loughran
2.	tune logging of prefetch problems	Open	Unassigned
3.	Ensure S3A prefetching stream memory consumption scales	Open	Unassigned
4.	Review s3a prefetching input stream retry code; synchronization	Open	Unassigned
5.	ITestS3AFileSystemStatistic failure in prefetch feature branch	Open	Samrat Deb
6.	S3A prefetching: switch to prefetching for chosen read policies	Open	Unassigned
7.	s3a prefetching to use split start/end options to limit prefetch range	In Progress	Steve Loughran
8.	S3ACachingInputStream.ensureCurrentBuffer(): lazy seek means all reads look like random IO	Open	Unassigned
9.	ITestS3APrefetchingCacheFiles teardown failure if setup() fails	Open	Unassigned
10.	S3A prefetching to support Vector IO	Open	Unassigned
11.	TestS3ACachingBlockManager fails intermittently in Yetus	Open	Unassigned

Activity

People

Assignee:: Bhalchandra Pandit

Reporter:: Bhalchandra Pandit

Votes:: 0 Vote for this issue

Watchers:: 25 Start watching this issue

Dates

Created:: 29/Nov/21 16:12

Updated:: 16/Jan/24 08:37

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

44h 20m

Include sub-tasks