Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14269 Performance optimizations for data on S3
  3. HIVE-16003

Blobstores should use fs.listFiles(path, recursive=true) rather than FileUtils.listStatusRecursively

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      FileUtils.listStatusRecursively can be slow on blobstores because listStatus calls are applied recursively to a given directory. This can be especially bad on tables with multiple levels of partitioning.

      The FileSystem API provides an optimized API called listFiles(path, recursive) that can be used to invoke an optimized recursive directory listing.

      The problem is that the listFiles(path, recursive) API doesn't provide a option to pass in a PathFilter, while FileUtils.listStatusRecursively uses a custom HIDDEN_FILES_PATH_FILTER.

      To fix this we could either:

      1: Modify the FileSystem API to provide a listFiles(path, recursive, PathFilter) method (probably the cleanest solution)
      2: Add conditional logic so that blobstores invoke listFiles(path, recursive) and the rest of the code uses the current implementation of FileUtils.listStatusRecursively
      3: Replace the implementation of FileUtils.listStatusRecursively with listFiles(path, recursive) and apply the PathFilter on the results (not sure what optimizations can be made if PathFilter objects are passed into FileSystem methods - maybe PathFilter objects are pushed to the NameNode?)

      Attachments

        Activity

          People

            Unassigned Unassigned
            stakiar Sahil Takiar
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: