[HADOOP-13829] S3A getContentSummary to use flat listFiles instead of treewalk - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 2.8.0
Fix Version/s: None
Component/s: fs/s3
Labels:
None

Description

FS shell -count uses getContentSummary to summarise the contents; this slows significantly with directory tree depth. On wide directories, as the FileStatus[] array is built up before recursing down, if there are many millions of files, memory use becomes an issue

Moving to a flat listFiles listing with iterator-based scanning would allow directory depth to become a near-non-issue, avoid memory problems. We'd need to reverse-construct the directory tree for its count summary; some hash map of parent paths could build that up while iterating through the files and adding up their sizes

Attachments

Issue Links

duplicates

HADOOP-13704 S3A getContentSummary() to move to listFiles(recursive) to count children; instrument use

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Steve Loughran

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 23/Nov/16 12:03

Updated:: 16/Jun/17 09:55

Resolved:: 16/Jun/17 09:55