[HADOOP-18179] Boost S3A Stream Read Performance - ASF JIRA

XML

Word

Printable

JSON

calibrate S3A input stream performance against recent applications/data formats and improve where necessary.

HADOOP-18028 is a key part of this, but there are other issues/opertunities

we could add machine parsable trace-level logging in FSDataInputStream to collect stats on how stream apis are invoked, so collect data from real apps; analyze
implement those APIs which some apps use (ByteBufferPositionedReadable), not so much for direct implementation as to get better information from the app as its read plan
the `normal` mode doesn't switch from sequential on forward seeks. Is that always appropriate?
choose different buffering options when doing whole file IO vs sequential vs random

depends upon

HADOOP-18028 High performance S3A input stream with prefetching & caching

HADOOP-16202 Enhance openFile() for better read performance against object stores

is depended upon by

HADOOP-18477 Über-jira: S3A Hadoop 3.3.9 features

relates to

HADOOP-17842 S3a parquet reads slow with Spark on Kubernetes (EKS)

Resolved

Ankit Saurabh