Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2493

HadoopInputFile to pass down FileStatus when opening file.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.14.0
    • None
    • parquet-hadoop
    • None

    Description

      In the current version of the HadoopInputFile implementation:

      https://github.com/apache/parquet-java/blob/apache-parquet-1.14.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopInputFile.java

       

      When performing a newStream, the reference to the FileStatus is lost, which has already been previously consulted to create this class. This means that when you go to the implementation of each FileSystem, it will surely have to be requested again, since you have requested the reference of whether the file exists, when the file is weighed or relevant information to be able to open the file.

      Hadoop's openFile() builder API does support this, but it is not on older releases, so until Parquet moves to Hadoop 3.2.0+ only it cannot use the API. And because its a complex and extensible design, it's very hard to use reflection.

      HADOOP-19131 adds reflection-friendly entry points for this and other operations, so for releases with the new class, Parquet can pick up the speedup.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ocaballero Oliver Caballero Alvarez
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: