Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8523

Migrate hdfsOpen to builder-based openFile API

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Backend
    • None
    • ghx-label-1

    Description

      When opening files via libhdfs we call hdfsOpen which ultimately calls FileSystem#open(Path f, int bufferSize). As of HADOOP-15229, the HDFS-client now exposes a new API for opening files called openFile. The new API has a few advantages (1) it is capable of specifying file specific configuration values in a builder-based manner (see o.a.h.fs.FSBuilder for details), and (2) it can open files asynchronously (e.g. see o.a.h.fs.FutureDataInputStreamBuilder for details.

      The async file opens are similar to IMPALA-7738 (Implement timeouts for HDFS open calls). To avoid overlap between IMPALA-7738 and the async file opens in openFile, HADOOP-15691 can be used to check which filesystems open files asynchronously and which ones don't (currently only S3A opens files asynchronously).

      The main use case for the new openFile API is Impala-S3 performance. Performance benchmarks have shown that setting fs.s3a.experimental.input.fadvise to RANDOM for Parquet files can significantly improve performance, however, this setting also adversely affects scans of non-splittable file formats such as gzipped files (see HADOOP-13203). One solution to this issue is to just document that setting fs.s3a.experimental.input.fadvise to RANDOM for Parquet improves performance, however, a better solution would be to use the new openFile API to specify different values of fadvise depending on the file type.

      This work is dependent on exposing the new openFile API via libhdfs (HDFS-14478).

      Attachments

        Issue Links

          Activity

            People

              stakiar Sahil Takiar
              stakiar Sahil Takiar
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: