Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
ghx-label-1
Description
When opening files via libhdfs we call hdfsOpen which ultimately calls FileSystem#open(Path f, int bufferSize). As of HADOOP-15229, the HDFS-client now exposes a new API for opening files called openFile. The new API has a few advantages (1) it is capable of specifying file specific configuration values in a builder-based manner (see o.a.h.fs.FSBuilder for details), and (2) it can open files asynchronously (e.g. see o.a.h.fs.FutureDataInputStreamBuilder for details.
The async file opens are similar to IMPALA-7738 (Implement timeouts for HDFS open calls). To avoid overlap between IMPALA-7738 and the async file opens in openFile, HADOOP-15691 can be used to check which filesystems open files asynchronously and which ones don't (currently only S3A opens files asynchronously).
The main use case for the new openFile API is Impala-S3 performance. Performance benchmarks have shown that setting fs.s3a.experimental.input.fadvise to RANDOM for Parquet files can significantly improve performance, however, this setting also adversely affects scans of non-splittable file formats such as gzipped files (see HADOOP-13203). One solution to this issue is to just document that setting fs.s3a.experimental.input.fadvise to RANDOM for Parquet improves performance, however, a better solution would be to use the new openFile API to specify different values of fadvise depending on the file type.
This work is dependent on exposing the new openFile API via libhdfs (HDFS-14478).
Attachments
Issue Links
- depends upon
-
HDFS-14478 Add libhdfs APIs for openFile
- Resolved