Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-17400

Optimize S3A for maximum performance in directory listings

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.3.0
    • Fix Version/s: None
    • Component/s: fs/s3
    • Labels:
      None
    • Target Version/s:

      Description

      Make listing in applications as fast as we can get it especially for query planning.

      • All operations used in listing directories for query planning etc to be optimized for their primary use: being passed directories (not files) and so make that faster even at the expense of more remote IO when handed files or empty directories.
      • remove needless calls to S3 wherever possible (e.g. getFileStatus("/"), making bucket existence probes optional)
      • Support/enable Asynchronous IO where possible.

      Review higher level APIs (glob status) and uses on the FsShell and optimize their use by minimising invocations or FS API calls, with bonus goal of reduce/minimize risk of 404 caching.

      Work with downstream projects to move to FS APIs which work best in this world -primarily the recursive listing operations and those which return RemoteIterator<FileStatus> -and so make any asynchronous page fetching operations useful.

        Attachments

          Activity

            People

            • Assignee:
              mukund-thakur Mukund Thakur
              Reporter:
              stevel@apache.org Steve Loughran
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 10h 20m
                10h 20m