[SPARK-16736] remove redundant FileSystem status checks calls from Spark codebase - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.1.0
Component/s: Spark Core
Labels:
None

Description

The Hadoop FileSystem.exists() and FileSystem.isDirectory() calls are wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS NN, and very, very slow against object stores.

if these calls are followed by any getStatus() calls then they can be eliminated by careful merging and pulling out the catching of {FileNotFoundException}} from the exists() call to the spark code.

Any sequence of exists + delete can be optimised by removing the exists check, relying on FileSystem.delete() to be a no-op if the destination path is not present. That's a tested requirement of all Hadoop compatible FS and object stores.

Attachments

Issue Links

is depended upon by

HADOOP-13525 Optimize uses of FS operations in the ASF analysis frameworks and libraries

Resolved

is related to

HADOOP-13427 Eliminate needless uses of FileSystem#{exists(), isFile(), isDirectory()}

Resolved

HADOOP-15192 S3A listStatus excessively slow -hurts Spark job partitioning

Resolved

relates to

HADOOP-13321 Deprecate FileSystem APIs that promote inefficient call patterns.

Resolved

HIVE-14323 Reduce number of FS permissions and redundant FS operations

Closed

links to

[Github] Pull Request #14371 (steveloughran)

(1 links to)

Activity

People

Assignee:: Steve Loughran

Reporter:: Steve Loughran

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Jul/16 11:42

Updated:: 25/Jan/18 21:35

Resolved:: 17/Aug/16 18:43