Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14663

HttpFS: LISTSTATUS_BATCH does not return batches

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.0
    • None
    • httpfs
    • None

    Description

      The webhdfs protocol supports a LISTSTATUS_BATCH operation where it can retrieve the file listing for a large directory in chunks.

      When using the webhdfs service embedded in the namenode, this works as expected, but when using HTTPFS, any call to LISTSTATUS_BATCH simply returns the entire listing rather than batches, working effectively like LISTSTATUS instead.

      This seems to be because HTTPFS falls back to using the method org.apache.hadoop.fs.FileSystem#listStatusBatch, which is intended to be overridden, but the implementation used in HTTPFS has not done that, leading to this limitation.

      This feature (LISTSTATUS_BATCH) was added to HTTPFS by HDFS-10823, but based on my testing it does not work as intended. I suspect it is because the listStatusBatch operation was added to the WebHdfsFileSystem and HttpFSFileSystem as part of the above Jira, but behind the scenes HTTPFS seems to use DistributeFileSystem and hence it falls back to the default implementation "org.apache.hadoop.fs.FileSystem#listStatusBatch" which returns all entries in a single batch.

      Attachments

        Issue Links

          Activity

            People

              smeng Siyao Meng
              sodonnell Stephen O'Donnell
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: