Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-9606

ABFS reads should use hdfsPreadFully

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • Impala 4.0.0
    • Backend
    • None
    • ghx-label-11

    Description

      In IMPALA-8525, hdfs preads were enabled by default when reading data from S3. IMPALA-8525 deferred enabling preads for ABFS because they didn't significantly improve performance. After some more investigation into the ABFS input streams, I think it is safe to use hdfsPreadFully for ABFS reads.

      The ABFS client uses a different model for fetching data compared to S3A. Details are beyond the scope of this JIRA, but it is related to a feature in ABFS called "read-aheads". ABFS has logic to pre-fetch data it thinks will be required by the client. By default, it pre-fetches # cores * 4 MB of data. If the requested data exists in the client cache, it is read from the cache.

      However, there is no real drawback to using hdfsPreadFully for ABFS reads. It's definitely safer, because while the current implementation of ABFS always returns the amount of requested data, only the hdfsPreadFully API makes that guarantee.

      Attachments

        Issue Links

          Activity

            People

              stakiar Sahil Takiar
              stakiar Sahil Takiar
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: