Description
Looking at logs of LocatedFileStatus/FileInputFormat scans; there's a needless call to getFileStatus whenever a S3AFileSystem.listLocatedStatus() call is made
- S3AFileSystem.listLocatedStatus() does a getFileStatus call, returns the file status first
- But if you look at all the uses in the MR code in FileInputFormat and LocatedFileStatusFetcher, they only call this method knowing the destination is a directory
Which means for every unguarded S3 path: two needless HEADS and a single entry LIST, before the real LIST is initiated.
If the S3A FS can assume that a dest is a non-empty directory, then it can go straight to the LIST operation, only falling back to the HEAD + HEAD +/ if that fails.
We could also think about doing the same for listStatus
Attachments
Issue Links
- causes
-
HADOOP-17134 S3AFileSystem.listLocatedStatus(file) does a LIST even with S3Guard
- Resolved
- depends upon
-
HADOOP-16458 LocatedFileStatusFetcher scans failing intermittently against S3 store
- Resolved
- is blocked by
-
HADOOP-16697 audit/tune s3a authoritative flag in s3guard DDB Table
- Resolved
- links to