Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-4339

Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.18.1
    • 0.20.0
    • fs
    • None
    • Reviewed

    Description

      FsShell.du has two inefficiencies:

      • calling getContentSummary twice for each top-level item rather than calling it once and saving the result
      • calling getContentSummary for files rather than using the size it already has in FileStatus

      getContentSummary has one:

      • calling itself for files rather than using the length it already has in FileStatus

      Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).

      The simple solution:

      • FsShell.du calls once per item and saves the ContentSummary
      • FsShell.du uses FileStatus.getLen for files
      • getContentSummary only calls itself for directories

      Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.

      Attachments

        1. hadoop-fsshell-du-simple.patch
          2 kB
          David Phillips
        2. hadoop-fsshell-du-simple.patch
          2 kB
          David Phillips

        Activity

          People

            electrum David Phillips
            electrum David Phillips
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: