- CloudWatch Metrics for S3 showing an unexpectedly large number of 4xx errors in our bucket
- Performance when listing files recursively is abysmal (15 minutes on our bucket compared to less than 2 minutes using cli `aws s3 ls`)
- In CloudTrail logs for this bucket, we found that it generate one 404 (NoSuchKey) error per folder listed recursively.
- Spark recursively calls FileSystem::listStatus (S3AFileSystem implementation from Hadoop-aws:2.7.3); which in turn calls getFileStatus to determine if it is a directory.
- It turns out that this call to getFileStatus yield a 404 when the path used is a directory but do not end with a slash. It then retries with the slash concatenated (incurring one extra unneeded call to S3).
- Why is this trailing slash got removed in the first place? (Hadoop Path class normalize it by removing trailing slashes when constructed)
- S3AFileSystem::listStatus needs to know if the path is a Directory. However, it’s a common usage pattern to already have that FileStatus object in hand when recursively listing files. Thus incurring an unneeded performance penalty. Base FileSystem class could offer an optimized Api to use this assumption (or fix listLocatedStatus(recursive=true) unoptimized call to listStatus)
- I might be wrong on this last bullet but I think S3 object api will fetch every objects under a prefix (not just current level) and filter them out. If that is the case, there should be opportunities to have an efficient recursive listStatus implementation for s3 using paginated calls to top level folder only.
Note, all this is in the context of spark jobs reading hundred of thousands of parquet files organized and partitioned hierarchically as recommended. Every time we read it, spark lists recursively all files and folders to discover what are the partitions (folder names).