Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
ghx-label-11
Description
In IMPALA-8525, hdfs preads were enabled by default when reading data from S3. IMPALA-8525 deferred enabling preads for ABFS because they didn't significantly improve performance. After some more investigation into the ABFS input streams, I think it is safe to use hdfsPreadFully for ABFS reads.
The ABFS client uses a different model for fetching data compared to S3A. Details are beyond the scope of this JIRA, but it is related to a feature in ABFS called "read-aheads". ABFS has logic to pre-fetch data it thinks will be required by the client. By default, it pre-fetches # cores * 4 MB of data. If the requested data exists in the client cache, it is read from the cache.
However, there is no real drawback to using hdfsPreadFully for ABFS reads. It's definitely safer, because while the current implementation of ABFS always returns the amount of requested data, only the hdfsPreadFully API makes that guarantee.
Attachments
Issue Links
- is related to
-
IMPALA-8525 preads should use hdfsPreadFully rather than hdfsPread
- Resolved