Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
2.0.0
-
None
Description
The Hadoop FileSystem.exists() and FileSystem.isDirectory() calls are wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS NN, and very, very slow against object stores.
- if these calls are followed by any getStatus() calls then they can be eliminated by careful merging and pulling out the catching of {FileNotFoundException}} from the exists() call to the spark code.
- Any sequence of exists + delete can be optimised by removing the exists check, relying on FileSystem.delete() to be a no-op if the destination path is not present. That's a tested requirement of all Hadoop compatible FS and object stores.
Attachments
Issue Links
- is depended upon by
-
HADOOP-13525 Optimize uses of FS operations in the ASF analysis frameworks and libraries
- Resolved
- is related to
-
HADOOP-13427 Eliminate needless uses of FileSystem#{exists(), isFile(), isDirectory()}
- Resolved
-
HADOOP-15192 S3A listStatus excessively slow -hurts Spark job partitioning
- Resolved
- relates to
-
HADOOP-13321 Deprecate FileSystem APIs that promote inefficient call patterns.
- Resolved
-
HIVE-14323 Reduce number of FS permissions and redundant FS operations
- Closed
- links to