Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
None
-
None
-
None
-
None
Description
FileUtils.listStatusRecursively can be slow on blobstores because listStatus calls are applied recursively to a given directory. This can be especially bad on tables with multiple levels of partitioning.
The FileSystem API provides an optimized API called listFiles(path, recursive) that can be used to invoke an optimized recursive directory listing.
The problem is that the listFiles(path, recursive) API doesn't provide a option to pass in a PathFilter, while FileUtils.listStatusRecursively uses a custom HIDDEN_FILES_PATH_FILTER.
To fix this we could either:
1: Modify the FileSystem API to provide a listFiles(path, recursive, PathFilter) method (probably the cleanest solution)
2: Add conditional logic so that blobstores invoke listFiles(path, recursive) and the rest of the code uses the current implementation of FileUtils.listStatusRecursively
3: Replace the implementation of FileUtils.listStatusRecursively with listFiles(path, recursive) and apply the PathFilter on the results (not sure what optimizations can be made if PathFilter objects are passed into FileSystem methods - maybe PathFilter objects are pushed to the NameNode?)