Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-4339

Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.18.1
    • Fix Version/s: 0.20.0
    • Component/s: fs
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      FsShell.du has two inefficiencies:

      • calling getContentSummary twice for each top-level item rather than calling it once and saving the result
      • calling getContentSummary for files rather than using the size it already has in FileStatus

      getContentSummary has one:

      • calling itself for files rather than using the length it already has in FileStatus

      Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).

      The simple solution:

      • FsShell.du calls once per item and saves the ContentSummary
      • FsShell.du uses FileStatus.getLen for files
      • getContentSummary only calls itself for directories

      Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.

        Attachments

        1. hadoop-fsshell-du-simple.patch
          2 kB
          David Phillips
        2. hadoop-fsshell-du-simple.patch
          2 kB
          David Phillips

          Activity

            People

            • Assignee:
              electrum David Phillips
              Reporter:
              electrum David Phillips
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: