Issue Details (XML | Word | Printable)

Key: HADOOP-4339
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: David Phillips
Reporter: David Phillips
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Improve FsShell -du/-dus and FileSystem.getContentSummary efficiency

Created: 03/Oct/08 09:22 PM   Updated: 23/Apr/09 07:17 PM
Return to search
Component/s: fs
Affects Version/s: 0.18.1
Fix Version/s: 0.20.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works hadoop-fsshell-du-simple.patch 2008-11-24 11:47 PM David Phillips 2 kB
Text File Licensed for inclusion in ASF works hadoop-fsshell-du-simple.patch 2008-10-16 08:08 PM David Phillips 2 kB

Hadoop Flags: Reviewed
Resolution Date: 25/Nov/08 02:55 AM


 Description  « Hide
FsShell.du has two inefficiencies:
  • calling getContentSummary twice for each top-level item rather than calling it once and saving the result
  • calling getContentSummary for files rather than using the size it already has in FileStatus

getContentSummary has one:

  • calling itself for files rather than using the length it already has in FileStatus

Every call to getContentSummary results in a call to getFileStatus, which may be expensive (e.g. NativeS3FileSystem has both network latency and actual monetary cost).

The simple solution:

  • FsShell.du calls once per item and saves the ContentSummary
  • FsShell.du uses FileStatus.getLen for files
  • getContentSummary only calls itself for directories

Another solution, rather than adding special casing to callers, is to add a getContentSummary that takes a FileStatus.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
David Phillips added a comment - 03/Oct/08 10:39 PM
Patch for the simple solution. This reduces calls to getFileStatus to one per directory for -du and -dus [1], as opposed to the previous one (-dus) or two (-dus) per file and directory.

[1] -dus still has an extra call for the base directory due to the initial call to globStatus.


David Phillips made changes - 03/Oct/08 10:39 PM
Field Original Value New Value
Attachment simple.patch [ 12391448 ]
David Phillips made changes - 08/Oct/08 12:28 AM
Status Open [ 1 ] Patch Available [ 10002 ]
David Phillips made changes - 16/Oct/08 08:08 PM
Attachment hadoop-fsshell-du-simple.patch [ 12392279 ]
David Phillips made changes - 16/Oct/08 08:08 PM
Attachment simple.patch [ 12391448 ]
Hadoop QA added a comment - 07/Nov/08 05:21 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12392279/hadoop-fsshell-du-simple.patch
against trunk revision 712102.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3549/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3549/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3549/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3549/console

This message is automatically generated.


Chris Douglas added a comment - 22/Nov/08 04:43 AM
In FsShell, it makes more sense to save the length instead of the ContentSummary. The FileSystem change looks good.

Chris Douglas made changes - 22/Nov/08 04:45 AM
Fix Version/s 0.20.0 [ 12313438 ]
Assignee David Phillips [ electrum ]
Status Patch Available [ 10002 ] Open [ 1 ]
David Phillips added a comment - 24/Nov/08 11:47 PM
Good point, Chris. Patch updated.

David Phillips made changes - 24/Nov/08 11:47 PM
Attachment hadoop-fsshell-du-simple.patch [ 12394615 ]
Chris Douglas added a comment - 25/Nov/08 12:09 AM
+1 Looks good

Chris Douglas made changes - 25/Nov/08 12:09 AM
Hadoop Flags [Reviewed]
Status Open [ 1 ] Patch Available [ 10002 ]
Repository Revision Date User Message
ASF #720386 Tue Nov 25 02:54:38 UTC 2008 cdouglas HADOOP-4339. Remove redundant calls from FileSystem/FsShell when
generating/processing ContentSummary. Contributed by David Phillips.
Files Changed
MODIFY /hadoop/core/trunk/CHANGES.txt
MODIFY /hadoop/core/trunk/src/core/org/apache/hadoop/fs/FsShell.java
MODIFY /hadoop/core/trunk/src/core/org/apache/hadoop/fs/FileSystem.java

Chris Douglas made changes - 25/Nov/08 02:54 AM
Issue Type Bug [ 1 ] Improvement [ 4 ]
Chris Douglas added a comment - 25/Nov/08 02:55 AM
I just committed this. Thanks, David.

Chris Douglas made changes - 25/Nov/08 02:55 AM
Resolution Fixed [ 1 ]
Status Patch Available [ 10002 ] Resolved [ 5 ]
Hudson added a comment - 25/Nov/08 06:39 PM
Integrated in Hadoop-trunk #670 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/670/)
. Remove redundant calls from FileSystem/FsShell when
generating/processing ContentSummary. Contributed by David Phillips.

Nigel Daley made changes - 23/Apr/09 07:17 PM
Status Resolved [ 5 ] Closed [ 6 ]