Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
ghx-label-8
Description
https://impala.apache.org/docs/build/html/topics/impala_scalability.html state:
Because this feature only involves HDFS data files, it does not apply to non-HDFS tables, such as Kudu or HBase tables, or tables that store their data on cloud services such as S3 or ADLS.
This section should be updated because the file handle cache now supports S3 files.
We should add a section to the docs similar to what we added when support for remote HDFS files was added to the file handle cache:
In Impala 3.2 and higher, file handle caching also applies to remote HDFS file handles. This is controlled by the cache_remote_file_handles flag for an impalad. It is recommended that you use the default value of true as this caching prevents your NameNode from overloading when your cluster has many remote HDFS reads.
Like cache_remote_file_handles the flag cache_s3_file_handles has been added as an impalad startup option (the flag is enabled by default).
Unlike HDFS though, S3 has no NameNode, the benefit is that it eliminate a call to getFileStatus() on the target S3 file. So "prevents your NameNode from overloading when your cluster has many remote HDFS reads" should be changed to something like "avoids an unnecessary call to S3AFileSystem#getFileStatus() which reduces the number of API calls made to S3."