Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11694 Über-jira: S3a phase II: robustness, scale and performance
  3. HADOOP-13208

S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: fs/s3
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      S3A has optimized the listFiles method by doing a bulk listing of all entries under a path in a single S3 operation instead of recursively walking the directory tree. The listLocatedStatus method has been optimized by fetching results from S3 lazily as the caller traverses the returned iterator instead of doing an eager fetch of all possible results.
      Show
      S3A has optimized the listFiles method by doing a bulk listing of all entries under a path in a single S3 operation instead of recursively walking the directory tree. The listLocatedStatus method has been optimized by fetching results from S3 lazily as the caller traverses the returned iterator instead of doing an eager fetch of all possible results.

      Description

      A major cost in split calculation against object stores turns out be listing the directory tree itself. That's because against S3, it takes S3A two HEADs and two lists to list the content of any directory path (2 HEADs + 1 list for getFileStatus(); the next list to query the contents).

      Listing a directory could be improved slightly by combining the final two listings. However, a listing of a directory tree will still be O(directories). In contrast, a recursive listFiles() operation should be implementable by a bulk listing of all descendant paths; one List operation per thousand descendants.

      As the result of this call is an iterator, the ongoing listing can be implemented within the iterator itself

        Attachments

        1. HADOOP-13208-branch-2-001.patch
          89 kB
          Steve Loughran
        2. HADOOP-13208-branch-2-007.patch
          90 kB
          Steve Loughran
        3. HADOOP-13208-branch-2-008.patch
          90 kB
          Steve Loughran
        4. HADOOP-13208-branch-2-009.patch
          105 kB
          Steve Loughran
        5. HADOOP-13208-branch-2-010.patch
          105 kB
          Steve Loughran
        6. HADOOP-13208-branch-2-011.patch
          112 kB
          Steve Loughran
        7. HADOOP-13208-branch-2-012.patch
          113 kB
          Steve Loughran
        8. HADOOP-13208-branch-2-017.patch
          165 kB
          Steve Loughran
        9. HADOOP-13208-branch-2-018.patch
          42 kB
          Steve Loughran
        10. HADOOP-13208-branch-2-019.patch
          43 kB
          Steve Loughran
        11. HADOOP-13208-branch-2-020.patch
          50 kB
          Steve Loughran
        12. HADOOP-13208-branch-2-021.patch
          50 kB
          Chris Nauroth

          Issue Links

            Activity

              People

              • Assignee:
                stevel@apache.org Steve Loughran
                Reporter:
                stevel@apache.org Steve Loughran
              • Votes:
                0 Vote for this issue
                Watchers:
                15 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Remaining Estimate - 24h
                  24h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified