Hadoop Common
  1. Hadoop Common
  2. HADOOP-2219

du like command to count number of files under a given directory

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change
    • Release Note:
      Hide
      Added a new fs command fs -count for counting the number of bytes, files and directories under a given path.

      Added a new RPC getContentSummary(String path) to ClientProtocol.
      Show
      Added a new fs command fs -count for counting the number of bytes, files and directories under a given path. Added a new RPC getContentSummary(String path) to ClientProtocol.

      Description

      To keep the total number of files on dfs low, we like the users to be able to easily find out how many files each of their directory contain.

      Currently, we only have fsck or dfs -lsr which takes time.

      Can I ask for an option for du to show the total number of files (as well as the total size) of a given directory?

      1. 2219_20080229.patch
        27 kB
        Tsz Wo Nicholas Sze
      2. 2219_20080227.patch
        27 kB
        Tsz Wo Nicholas Sze
      3. 2219_20080226.patch
        25 kB
        Tsz Wo Nicholas Sze

        Activity

        Hide
        Tsz Wo Nicholas Sze added a comment -

        I will add a option to du.

        For the implementation, I am thinking about adding a method getContentSummary(Path) in the FileSystem. It returns a ContentSummary (a new class) object which contains length, number of files and number of directories.

        Similar to FileSystem.getContentLength(Path), an implementation of getContentSummary(Path), which uses FileSystem API , will be provided in FileSystem. Then, DistributedFileSystem will override getContentSummary(Path) to provide a NameNode side implementation.

        Since content length can be obtained by getContentSummary(Path), I will deprecate getContentLength(Path).

        Show
        Tsz Wo Nicholas Sze added a comment - I will add a option to du. For the implementation, I am thinking about adding a method getContentSummary(Path) in the FileSystem. It returns a ContentSummary (a new class) object which contains length, number of files and number of directories. Similar to FileSystem.getContentLength(Path), an implementation of getContentSummary(Path), which uses FileSystem API , will be provided in FileSystem. Then, DistributedFileSystem will override getContentSummary(Path) to provide a NameNode side implementation. Since content length can be obtained by getContentSummary(Path), I will deprecate getContentLength(Path).
        Hide
        Tsz Wo Nicholas Sze added a comment - - edited

        Instead of adding an option, how about changing the output to the followings?

        bash-3.2$ ./bin/hadoop fs -du /
        Found 3 items
                      byte         file directory
                       159            1         0   hdfs://host:9000/a.txt
                     44198            1         0   hdfs://host:9000/build.xml
                       318            2         2   hdfs://host:9000/user
        bash-3.2$ ./bin/hadoop fs -dus /
                     44675            4         3   /
        
        Show
        Tsz Wo Nicholas Sze added a comment - - edited Instead of adding an option, how about changing the output to the followings? bash-3.2$ ./bin/hadoop fs -du / Found 3 items byte file directory 159 1 0 hdfs: //host:9000/a.txt 44198 1 0 hdfs: //host:9000/build.xml 318 2 2 hdfs: //host:9000/user bash-3.2$ ./bin/hadoop fs -dus / 44675 4 3 /
        Hide
        Allen Wittenauer added a comment -

        Why not keep du like du and make a df command instead?

        Show
        Allen Wittenauer added a comment - Why not keep du like du and make a df command instead?
        Hide
        Tsz Wo Nicholas Sze added a comment -

        I am fine with making a new command. However, df in unix is for disk space usage, not directory space usage. So, it may be confusing. It seems to me that there is no specific unix command for counting files and directories.

        Show
        Tsz Wo Nicholas Sze added a comment - I am fine with making a new command. However, df in unix is for disk space usage, not directory space usage. So, it may be confusing. It seems to me that there is no specific unix command for counting files and directories.
        Hide
        Allen Wittenauer added a comment -

        The UNIX way of doing that is to use find with a pipe to wc (amongst other ways). But df -i is probably the closest to a single command.

        Show
        Allen Wittenauer added a comment - The UNIX way of doing that is to use find with a pipe to wc (amongst other ways). But df -i is probably the closest to a single command.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Since the new feature is neither provided du nor df in unix, let create a new command. How about making a new command "count"? The usage would be

        bash-3.2$ ./bin/hadoop fs -count /
                     44675            4         3   /
        
        Show
        Tsz Wo Nicholas Sze added a comment - Since the new feature is neither provided du nor df in unix, let create a new command. How about making a new command "count"? The usage would be bash-3.2$ ./bin/hadoop fs -count / 44675 4 3 /
        Hide
        Tsz Wo Nicholas Sze added a comment -

        2219_20080226.patch: adding a new command "fs -count"

        Show
        Tsz Wo Nicholas Sze added a comment - 2219_20080226.patch: adding a new command "fs -count"
        Hide
        dhruba borthakur added a comment -

        1. It might be a good idea to deprecate getContentLen in ClientProtocol and Namenode.
        2. INode.computeContentSummary can take an ContentSummary object as a parameter rather than an array of three longs.
        3. This patch removes the optimization in DistributedFileSystem.getContentLength(). In the original code, if the path object is a DfsPath object, then no additional RPC is required. In the patch, a RPC is always required.

        Show
        dhruba borthakur added a comment - 1. It might be a good idea to deprecate getContentLen in ClientProtocol and Namenode. 2. INode.computeContentSummary can take an ContentSummary object as a parameter rather than an array of three longs. 3. This patch removes the optimization in DistributedFileSystem.getContentLength(). In the original code, if the path object is a DfsPath object, then no additional RPC is required. In the patch, a RPC is always required.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        2219_20080227.patch:

        1. Deprecated getContentLength() in ClientProtocol, NameNode, FileSystem, DistributedFileSystem and DFSClient. The ones in FSNamesystem and INode are removed directly since they are not public APIs.

        2. The reason of use array of longs is efficiency. computeContentSummary(...) recursively goes through the INode tree. If a ContentSummary object is used, the values have to be updated by two method calls (get, set) for each recursive call. If we use long, we only have to do a +=.

        3. Reverted DistributedFileSystem.getContentLength() to keep the optimization.

        Show
        Tsz Wo Nicholas Sze added a comment - 2219_20080227.patch: 1. Deprecated getContentLength() in ClientProtocol, NameNode, FileSystem, DistributedFileSystem and DFSClient. The ones in FSNamesystem and INode are removed directly since they are not public APIs. 2. The reason of use array of longs is efficiency. computeContentSummary(...) recursively goes through the INode tree. If a ContentSummary object is used, the values have to be updated by two method calls (get, set) for each recursive call. If we use long, we only have to do a +=. 3. Reverted DistributedFileSystem.getContentLength() to keep the optimization.
        Hide
        dhruba borthakur added a comment -

        +1. Code looks good. A minor typo : Count.DISCRIPTION should be Count.DESCRIPTION.

        Show
        dhruba borthakur added a comment - +1. Code looks good. A minor typo : Count.DISCRIPTION should be Count.DESCRIPTION.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        2219_20080229.patch: fixed the typos.

        Show
        Tsz Wo Nicholas Sze added a comment - 2219_20080229.patch: fixed the typos.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12376867/2219_20080229.patch
        against trunk revision 619744.

        @author +1. The patch does not contain any @author tags.

        tests included +1. The patch appears to include 7 new or modified tests.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac -1. The applied patch generated 628 javac compiler warnings (more than the trunk's current 614 warnings).

        release audit +1. The applied patch does not generate any new release audit warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1880/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1880/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1880/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1880/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12376867/2219_20080229.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 7 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac -1. The applied patch generated 628 javac compiler warnings (more than the trunk's current 614 warnings). release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1880/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1880/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1880/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1880/console This message is automatically generated.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        The additional javac warnings are due to the newly deprecated APIs.

        Show
        Tsz Wo Nicholas Sze added a comment - The additional javac warnings are due to the newly deprecated APIs.
        Hide
        dhruba borthakur added a comment -

        I just committed this. Thanks Nicholas!

        Show
        dhruba borthakur added a comment - I just committed this. Thanks Nicholas!
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #418 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/418/ )

          People

          • Assignee:
            Tsz Wo Nicholas Sze
            Reporter:
            Koji Noguchi
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development