[HADOOP-18028] High performance S3A input stream with prefetching & caching - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 3.3.9
Component/s: fs/s3
Labels:
- pull-request-available

Language:
- java

Description

I work for Pinterest. I developed a technique for vastly improving read throughput when reading from the S3 file system. It not only helps the sequential read case (like reading a SequenceFile) but also significantly improves read throughput of a random access case (like reading Parquet). This technique has been very useful in significantly improving efficiency of the data processing jobs at Pinterest.

I would like to contribute that feature to Apache Hadoop. More details on this technique are available in this blog I wrote recently:
https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0

Attachments

Issue Links

is depended upon by

HADOOP-18179 Boost S3A Stream Read Performance

Open

HADOOP-18477 Über-jira: S3A Hadoop 3.3.9 features

Open

is related to

HIVE-25827 Parquet file footer is read multiple times, when multiple splits are created in same file

Closed

links to

GitHub Pull Request #3736

GitHub Pull Request #4109

GitHub Pull Request #4654

GitHub Pull Request #4675

GitHub Pull Request #4752

GitHub Pull Request #5559

GitHub Pull Request #5605

(5 links to)

Sub-Tasks

1.

test failures with prefetching s3a input stream

Resolved

Monthon Klongklaew

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 2h 50m

2.

s3a prefetching stream to move off twitter FuturePool

Resolved

Unassigned

3.

document use and architecture design of prefetching s3a input stream

Resolved

Ahmar Suhail

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 2h 40m

4.

Remove use of scala jar twitter util-core with java futures in S3A prefetching stream

Resolved

PJ Fanning

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 9h

5.

move org.apache.hadoop.fs.common package into hadoop-common module

Resolved

Steve Loughran

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 1h 20m

6.

S3File to store reference to active S3Object in a field.

Resolved

Bhalchandra Pandit

7.

s3a prefetching stream to support unbuffer()

In Progress

Steve Loughran

8.

tune logging of prefetch problems

Open

Unassigned

9.

s3a prefetching to use SemaphoredDelegatingExecutor for submitting work

Resolved

Viraj Jasani

10.

Convert s3a prefetching to use JavaDoc for fields and enums

Resolved

Steve Loughran

11.

S3PrefetchingInputStream to support status probes when closed

Resolved

Viraj Jasani

12.

Collect IOStatistics during S3A prefetching

Resolved

Ahmar Suhail

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 4h 10m

13.

Ensure S3A prefetching stream memory consumption scales

Open

Unassigned

14.

stream warns Not all bytes were read from the S3ObjectInputStream when closed

Resolved

Ahmar Suhail

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 1h 40m

15.

Use async drain threshold to decide b/w async and sync draining

Resolved

Ahmar Suhail

16.

tests in ITestS3AInputStreamPerformance are failing

Resolved

Ahmar Suhail

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 6h

17.

Rebase s3a prefetching feature branch on top of trunk

Resolved

Ahmar Suhail

18.

Remove lower limit on s3a prefetching/caching block size

Resolved

Ankit Saurabh

19.

Tests in ITestS3AOpenCost are failing

Resolved

Ahmar Suhail

20.

Add in configuration option to enable prefetching

Resolved

Ahmar Suhail

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 1h 10m

21.

Review s3a prefetching input stream retry code; synchronization

Open

Unassigned

22.

S3A prefetch - Implement LRU cache for SingleFilePerBlockCache

Resolved

Viraj Jasani

23.

Update class names to be clear they belong to S3A prefetching

Resolved

Unassigned

24.

S3A prefetching: Error logging during reads

Resolved

Ankit Saurabh

25.

hadoop-aws maven build to add a prefetch profile to run all tests with prefetching

Resolved

Viraj Jasani

26.

Implement readFully(long position, byte[] buffer, int offset, int length)

Resolved

Alessandro Passaro

27.

rebase feature/HADOOP-18028-s3a-prefetch to trunk

Resolved

Steve Loughran

100%

Original Estimate - Not Specified

Original Estimate - Not Specified

Time Spent - 1h

28.

fs.s3a.prefetch.block.size to be read through longBytesOption

Resolved

Viraj Jasani

29.

ITestS3AFileSystemStatistic failure in prefetch feature branch

Open

Samrat Deb

30.

ITestS3ACannedACLs failure; not in a span

Resolved

Ashutosh Gupta

31.

S3A Prefetch - SingleFilePerBlockCache to use LocalDirAllocator

Resolved

Viraj Jasani

32.

s3a prefetching Executor should be closed

Resolved

Viraj Jasani

33.

assertion failure in ITestS3APrefetchingInputStream

Resolved

Ashutosh Gupta

34.

Fix transient failure of ITestS3APrefetchingInputStream#testRandomReadLargeFile

Resolved

Viraj Jasani

35.

Backport S3A prefetching stream to branch-3.3

Resolved

Steve Loughran

36.

s3a prefetch cache blocks should be accessed by RW locks

Resolved

Viraj Jasani

37.

CachingBlockManager to use AtomicBoolean for closed flag

Resolved

Viraj Jasani

38.

S3A prefetching: switch to prefetching for chosen read policies

Open

Unassigned

39.

s3a prefetching to use split start/end options to limit prefetch range

In Progress

Steve Loughran

40.

s3a large file prefetch tests are too slow, don't validate data

Resolved

Viraj Jasani

41.

s3a prefetch read/write file operations should guard channel close

Resolved

Viraj Jasani

42.

s3a prefetch LRU cache eviction metric

Resolved

Viraj Jasani

43.

S3ACachingInputStream.ensureCurrentBuffer(): lazy seek means all reads look like random IO

Open

Unassigned

44.

ITestS3APrefetchingCacheFiles teardown failure if setup() fails

Open

Unassigned

45.

Use builder for prefetch CachingBlockManager

Resolved

Viraj Jasani

46.

S3A prefetching to support Vector IO

Open

Unassigned

47.

TestS3ACachingBlockManager fails intermittently in Yetus

Open

Unassigned

Activity

People

Assignee:: Bhalchandra Pandit

Reporter:: Bhalchandra Pandit

Votes:: 0 Vote for this issue

Watchers:: 25 Start watching this issue

Dates

Created:: 29/Nov/21 16:12

Updated:: 16/Jan/24 08:37

Time Tracking

Estimated:

Not Specified

Remaining:

0h

Logged:

44h 20m

Include sub-tasks