Description
The webhdfs protocol supports a LISTSTATUS_BATCH operation where it can retrieve the file listing for a large directory in chunks.
When using the webhdfs service embedded in the namenode, this works as expected, but when using HTTPFS, any call to LISTSTATUS_BATCH simply returns the entire listing rather than batches, working effectively like LISTSTATUS instead.
This seems to be because HTTPFS falls back to using the method org.apache.hadoop.fs.FileSystem#listStatusBatch, which is intended to be overridden, but the implementation used in HTTPFS has not done that, leading to this limitation.
This feature (LISTSTATUS_BATCH) was added to HTTPFS by HDFS-10823, but based on my testing it does not work as intended. I suspect it is because the listStatusBatch operation was added to the WebHdfsFileSystem and HttpFSFileSystem as part of the above Jira, but behind the scenes HTTPFS seems to use DistributeFileSystem and hence it falls back to the default implementation "org.apache.hadoop.fs.FileSystem#listStatusBatch" which returns all entries in a single batch.
Attachments
Issue Links
- is caused by
-
HDFS-10823 Implement HttpFSFileSystem#listStatusIterator
- Resolved