[PARQUET-2493] HadoopInputFile to pass down FileStatus when opening file. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.14.0
Fix Version/s: None
Component/s: parquet-hadoop
Labels:
None

Description

In the current version of the HadoopInputFile implementation:

https://github.com/apache/parquet-java/blob/apache-parquet-1.14.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopInputFile.java

When performing a newStream, the reference to the FileStatus is lost, which has already been previously consulted to create this class. This means that when you go to the implementation of each FileSystem, it will surely have to be requested again, since you have requested the reference of whether the file exists, when the file is weighed or relevant information to be able to open the file.

Hadoop's openFile() builder API does support this, but it is not on older releases, so until Parquet moves to Hadoop 3.2.0+ only it cannot use the API. And because its a complex and extensible design, it's very hard to use reflection.

HADOOP-19131 adds reflection-friendly entry points for this and other operations, so for releases with the new class, Parquet can pick up the speedup.

Attachments

Issue Links

blocks

PARQUET-2486 Improve Parquet IO Performance within cloud datalakes

In Progress

depends upon

HADOOP-19131 WrappedIO to export modern filesystem/statistics APIs in a reflection friendly form

In Progress

SPARK-48571 Reduce the number of accesses to S3 object storage

Open

is depended upon by

HADOOP-19199 Include FileStatus when opening a file from FileSystem

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Oliver Caballero Alvarez

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Jun/24 11:21

Updated:: 23/Jun/24 03:33