[HADOOP-15192] S3A listStatus excessively slow -hurts Spark job partitioning - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 2.7.3
Fix Version/s: 2.8.0
Component/s: fs/s3
Labels:
None
Environment:

Amazon EMR

Description

Symptoms:

CloudWatch Metrics for S3 showing an unexpectedly large number of 4xx errors in our bucket
Performance when listing files recursively is abysmal (15 minutes on our bucket compared to less than 2 minutes using cli `aws s3 ls`)

Analysis:

In CloudTrail logs for this bucket, we found that it generate one 404 (NoSuchKey) error per folder listed recursively.
Spark recursively calls FileSystem::listStatus (S3AFileSystem implementation from Hadoop-aws:2.7.3); which in turn calls getFileStatus to determine if it is a directory.
It turns out that this call to getFileStatus yield a 404 when the path used is a directory but do not end with a slash. It then retries with the slash concatenated (incurring one extra unneeded call to S3).

Questions:

Why is this trailing slash got removed in the first place? (Hadoop Path class normalize it by removing trailing slashes when constructed)
S3AFileSystem::listStatus needs to know if the path is a Directory. However, it’s a common usage pattern to already have that FileStatus object in hand when recursively listing files. Thus incurring an unneeded performance penalty. Base FileSystem class could offer an optimized Api to use this assumption (or fix listLocatedStatus(recursive=true) unoptimized call to listStatus)
I might be wrong on this last bullet but I think S3 object api will fetch every objects under a prefix (not just current level) and filter them out. If that is the case, there should be opportunities to have an efficient recursive listStatus implementation for s3 using paginated calls to top level folder only.

Note, all this is in the context of spark jobs reading hundred of thousands of parquet files organized and partitioned hierarchically as recommended. Every time we read it, spark lists recursively all files and folders to discover what are the partitions (folder names).

Attachments

Issue Links

is superceded by

HADOOP-13208 S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

Resolved

relates to

SPARK-16736 remove redundant FileSystem status checks calls from Spark codebase

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Michel Lemay

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Jan/18 13:07

Updated:: 26/Jan/18 22:23

Resolved:: 25/Jan/18 21:33