Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-13204 Über-jira: S3a phase III: scale and tuning
  3. HADOOP-13829

S3A getContentSummary to use flat listFiles instead of treewalk

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 2.8.0
    • None
    • fs/s3
    • None

    Description

      FS shell -count uses getContentSummary to summarise the contents; this slows significantly with directory tree depth. On wide directories, as the FileStatus[] array is built up before recursing down, if there are many millions of files, memory use becomes an issue

      Moving to a flat listFiles listing with iterator-based scanning would allow directory depth to become a near-non-issue, avoid memory problems. We'd need to reverse-construct the directory tree for its count summary; some hash map of parent paths could build that up while iterating through the files and adding up their sizes

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              stevel@apache.org Steve Loughran
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: