Details
-
Sub-task
-
Status: Resolved
-
Minor
-
Resolution: Duplicate
-
2.8.0
-
None
-
None
Description
FS shell -count uses getContentSummary to summarise the contents; this slows significantly with directory tree depth. On wide directories, as the FileStatus[] array is built up before recursing down, if there are many millions of files, memory use becomes an issue
Moving to a flat listFiles listing with iterator-based scanning would allow directory depth to become a near-non-issue, avoid memory problems. We'd need to reverse-construct the directory tree for its count summary; some hash map of parent paths could build that up while iterating through the files and adding up their sizes
Attachments
Issue Links
- duplicates
-
HADOOP-13704 S3A getContentSummary() to move to listFiles(recursive) to count children; instrument use
- Resolved