Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-13204 Über-jira: S3a phase III: scale and tuning
  3. HADOOP-13829

S3A getContentSummary to use flat listFiles instead of treewalk

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 2.8.0
    • Fix Version/s: None
    • Component/s: fs/s3
    • Labels:
      None

      Description

      FS shell -count uses getContentSummary to summarise the contents; this slows significantly with directory tree depth. On wide directories, as the FileStatus[] array is built up before recursing down, if there are many millions of files, memory use becomes an issue

      Moving to a flat listFiles listing with iterator-based scanning would allow directory depth to become a near-non-issue, avoid memory problems. We'd need to reverse-construct the directory tree for its count summary; some hash map of parent paths could build that up while iterating through the files and adding up their sizes

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                stevel@apache.org Steve Loughran
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: