Issue Details (XML | Word | Printable)

Key: HADOOP-2219
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Tsz Wo (Nicholas), SZE
Reporter: Koji Noguchi
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

du like command to count number of files under a given directory

Created: 17/Nov/07 01:39 AM   Updated: 08/Jul/09 04:42 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.17.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 2219_20080226.patch 2008-02-27 12:16 AM Tsz Wo (Nicholas), SZE 25 kB
Text File Licensed for inclusion in ASF works 2219_20080227.patch 2008-02-29 07:18 PM Tsz Wo (Nicholas), SZE 27 kB
Text File Licensed for inclusion in ASF works 2219_20080229.patch 2008-02-29 09:21 PM Tsz Wo (Nicholas), SZE 27 kB

Hadoop Flags: Incompatible change
Release Note:
Added a new fs command fs -count for counting the number of bytes, files and directories under a given path.

Added a new RPC getContentSummary(String path) to ClientProtocol.
Resolution Date: 03/Mar/08 07:10 AM


 Description  « Hide
To keep the total number of files on dfs low, we like the users to be able to easily find out how many files each of their directory contain.

Currently, we only have fsck or dfs -lsr which takes time.

Can I ask for an option for du to show the total number of files (as well as the total size) of a given directory?



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Nigel Daley made changes - 22/Jan/08 07:32 PM
Field Original Value New Value
Fix Version/s 0.16.0 [ 12312740 ]
Tsz Wo (Nicholas), SZE made changes - 22/Feb/08 11:09 PM
Assignee Tsz Wo (Nicholas), SZE [ szetszwo ]
Tsz Wo (Nicholas), SZE added a comment - 22/Feb/08 11:18 PM
I will add a option to du.

For the implementation, I am thinking about adding a method getContentSummary(Path) in the FileSystem. It returns a ContentSummary (a new class) object which contains length, number of files and number of directories.

Similar to FileSystem.getContentLength(Path), an implementation of getContentSummary(Path), which uses FileSystem API , will be provided in FileSystem. Then, DistributedFileSystem will override getContentSummary(Path) to provide a NameNode side implementation.

Since content length can be obtained by getContentSummary(Path), I will deprecate getContentLength(Path).


Tsz Wo (Nicholas), SZE added a comment - 23/Feb/08 01:14 AM - edited
Instead of adding an option, how about changing the output to the followings?
bash-3.2$ ./bin/hadoop fs -du /
Found 3 items
              byte         file directory
               159            1         0   hdfs://host:9000/a.txt
             44198            1         0   hdfs://host:9000/build.xml
               318            2         2   hdfs://host:9000/user
bash-3.2$ ./bin/hadoop fs -dus /
             44675            4         3   /

Allen Wittenauer added a comment - 23/Feb/08 01:34 AM
Why not keep du like du and make a df command instead?

Tsz Wo (Nicholas), SZE added a comment - 25/Feb/08 10:41 PM
I am fine with making a new command. However, df in unix is for disk space usage, not directory space usage. So, it may be confusing. It seems to me that there is no specific unix command for counting files and directories.

Allen Wittenauer added a comment - 25/Feb/08 10:46 PM
The UNIX way of doing that is to use find with a pipe to wc (amongst other ways). But df -i is probably the closest to a single command.

Tsz Wo (Nicholas), SZE added a comment - 26/Feb/08 09:25 PM
Since the new feature is neither provided du nor df in unix, let create a new command. How about making a new command "count"? The usage would be
bash-3.2$ ./bin/hadoop fs -count /
             44675            4         3   /

Tsz Wo (Nicholas), SZE added a comment - 27/Feb/08 12:16 AM
2219_20080226.patch: adding a new command "fs -count"

Tsz Wo (Nicholas), SZE made changes - 27/Feb/08 12:16 AM
Attachment 2219_20080226.patch [ 12376578 ]
dhruba borthakur added a comment - 29/Feb/08 09:04 AM
1. It might be a good idea to deprecate getContentLen in ClientProtocol and Namenode.
2. INode.computeContentSummary can take an ContentSummary object as a parameter rather than an array of three longs.
3. This patch removes the optimization in DistributedFileSystem.getContentLength(). In the original code, if the path object is a DfsPath object, then no additional RPC is required. In the patch, a RPC is always required.

Tsz Wo (Nicholas), SZE added a comment - 29/Feb/08 07:18 PM
2219_20080227.patch:

1. Deprecated getContentLength() in ClientProtocol, NameNode, FileSystem, DistributedFileSystem and DFSClient. The ones in FSNamesystem and INode are removed directly since they are not public APIs.

2. The reason of use array of longs is efficiency. computeContentSummary(...) recursively goes through the INode tree. If a ContentSummary object is used, the values have to be updated by two method calls (get, set) for each recursive call. If we use long, we only have to do a +=.

3. Reverted DistributedFileSystem.getContentLength() to keep the optimization.


Tsz Wo (Nicholas), SZE made changes - 29/Feb/08 07:18 PM
Attachment 2219_20080227.patch [ 12376861 ]
dhruba borthakur added a comment - 29/Feb/08 07:41 PM
+1. Code looks good. A minor typo : Count.DISCRIPTION should be Count.DESCRIPTION.

Tsz Wo (Nicholas), SZE added a comment - 29/Feb/08 09:21 PM
2219_20080229.patch: fixed the typos.

Tsz Wo (Nicholas), SZE made changes - 29/Feb/08 09:21 PM
Attachment 2219_20080229.patch [ 12376867 ]
Tsz Wo (Nicholas), SZE made changes - 29/Feb/08 09:22 PM
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 01/Mar/08 01:14 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12376867/2219_20080229.patch
against trunk revision 619744.

@author +1. The patch does not contain any @author tags.

tests included +1. The patch appears to include 7 new or modified tests.

javadoc +1. The javadoc tool did not generate any warning messages.

javac -1. The applied patch generated 628 javac compiler warnings (more than the trunk's current 614 warnings).

release audit +1. The applied patch does not generate any new release audit warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1880/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1880/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1880/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1880/console

This message is automatically generated.


Tsz Wo (Nicholas), SZE added a comment - 01/Mar/08 01:39 AM
The additional javac warnings are due to the newly deprecated APIs.

dhruba borthakur added a comment - 03/Mar/08 07:10 AM
I just committed this. Thanks Nicholas!

dhruba borthakur made changes - 03/Mar/08 07:10 AM
Resolution Fixed [ 1 ]
Fix Version/s 0.17.0 [ 12312913 ]
Status Patch Available [ 10002 ] Resolved [ 5 ]
Repository Revision Date User Message
ASF #632960 Mon Mar 03 07:10:44 UTC 2008 dhruba HADOOP-2219. A new command "df -count" that counts the number of
files and directories. (Tsz Wo (Nicholas), SZE via dhruba)
Files Changed
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/FSNamesystem.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/DistributedFileSystem.java
ADD /hadoop/core/trunk/src/java/org/apache/hadoop/fs/ContentSummary.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/fs/FileSystem.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/fs/FsShell.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/INode.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/ClientProtocol.java
MODIFY /hadoop/core/trunk/src/test/org/apache/hadoop/dfs/TestDFSShell.java
ADD /hadoop/core/trunk/src/java/org/apache/hadoop/fs/shell/CommandUtils.java
ADD /hadoop/core/trunk/src/java/org/apache/hadoop/fs/shell
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/FSDirectory.java
MODIFY /hadoop/core/trunk/CHANGES.txt
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/DFSClient.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/DFSFileInfo.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/LocatedBlocks.java
ADD /hadoop/core/trunk/src/java/org/apache/hadoop/fs/shell/Count.java
MODIFY /hadoop/core/trunk/src/java/org/apache/hadoop/dfs/NameNode.java

Hudson added a comment - 03/Mar/08 12:35 PM

Tsz Wo (Nicholas), SZE made changes - 17/Apr/08 12:30 AM
Release Note Added a new fs command fs -count for counting the number of bytes, files and directories under a given path.

Added a new RPC getContentSummary(String path) to ClientProtocol.
Hadoop Flags [Incompatible change]
Nigel Daley made changes - 21/May/08 08:05 PM
Status Resolved [ 5 ] Closed [ 6 ]
Owen O'Malley made changes - 08/Jul/09 04:42 PM
Component/s dfs [ 12310710 ]