Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: test, tools
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This is a tool for analyzing file size distribution in HDFS using a modified Offline Image Viewer tool.

      1. OIV-FileDistr.patch
        8 kB
        Konstantin Shvachko
      2. OIV-FileDistr.patch
        13 kB
        Konstantin Shvachko
      3. OIV-FileDistr.patch
        14 kB
        Konstantin Shvachko

        Activity

        Konstantin Shvachko created issue -
        Hide
        Konstantin Shvachko added a comment -

        In order to run the tool one should define a range of integers [0, maxSize] by specifying maxSize and a step.
        The range of integers is divided into segments of size step: [0, s_1, ..., s_n-1, maxSize].
        The tool calculates how many files in the system fall into each segment [s_i-1, s_i).
        Note that files larger than maxSize always fall into the very last segment.
        The result is a two column table. The first column represents segments and the second contains the number of files in it.
        The results can be easily visualized using Excel, R system or other graphing software.

        Show
        Konstantin Shvachko added a comment - In order to run the tool one should define a range of integers [0, maxSize] by specifying maxSize and a step . The range of integers is divided into segments of size step : [0, s_1, ..., s_n-1, maxSize] . The tool calculates how many files in the system fall into each segment [s_i-1, s_i) . Note that files larger than maxSize always fall into the very last segment. The result is a two column table. The first column represents segments and the second contains the number of files in it. The results can be easily visualized using Excel, R system or other graphing software.
        Konstantin Shvachko made changes -
        Field Original Value New Value
        Component/s tools [ 12312944 ]
        Konstantin Shvachko made changes -
        Attachment OIV-FileDistr.patch [ 12412331 ]
        Konstantin Shvachko made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Jakob Homan added a comment -

        Looks good. How was it to implement the processor? I think I see some places where I can improve the interface for those extending the oiv.

        Concerns:

        • The Javadoc is great, but since oiv is a user-facing tool, most of the documentation should go into the forrest docs on the oiv page.
        • Currently the tool only counts inodes, not inodes-under-construction. If this is intended, it should be documented.
        • Command line processor uses equalsIgnoreCase, which is different than the other options, which just use the equals method. Should all use the same comparison method, and I believe case sensitivity is standard.
        • There doesn't appear to be any documentation in the usage section that is printed to the terminal as to the step or maxSize options.

        Nits:

        • modulo division will result in every 1000001, 2000001, 3000001 being reported, which is reasonable, but multiples of one million might be more standard.
        Show
        Jakob Homan added a comment - Looks good. How was it to implement the processor? I think I see some places where I can improve the interface for those extending the oiv. Concerns: The Javadoc is great, but since oiv is a user-facing tool, most of the documentation should go into the forrest docs on the oiv page. Currently the tool only counts inodes, not inodes-under-construction. If this is intended, it should be documented. Command line processor uses equalsIgnoreCase, which is different than the other options, which just use the equals method. Should all use the same comparison method, and I believe case sensitivity is standard. There doesn't appear to be any documentation in the usage section that is printed to the terminal as to the step or maxSize options. Nits: modulo division will result in every 1000001, 2000001, 3000001 being reported, which is reasonable, but multiples of one million might be more standard.
        Hide
        Jakob Homan added a comment -

        Also, the optional command line arguments -p line should be changed to add FileDistribution.

        Show
        Jakob Homan added a comment - Also, the optional command line arguments -p line should be changed to add FileDistribution.
        Konstantin Shvachko made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Konstantin Shvachko added a comment -

        Fixed problems mentioned by Jakob: count inodes under construction, use case sensitive equal, updated OIV documentation, and online usage help.
        The idea behind printing from 1 rather than 0 is that I wanted to have something printed in the beginning. If you print round millions then you will have to wait till the first million of files is processed which may be longer than you've got patience for.
        I also included a test case, which tests the new visitor explicitly.

        Show
        Konstantin Shvachko added a comment - Fixed problems mentioned by Jakob: count inodes under construction, use case sensitive equal, updated OIV documentation, and online usage help. The idea behind printing from 1 rather than 0 is that I wanted to have something printed in the beginning. If you print round millions then you will have to wait till the first million of files is processed which may be longer than you've got patience for. I also included a test case, which tests the new visitor explicitly.
        Konstantin Shvachko made changes -
        Attachment OIV-FileDistr.patch [ 12412684 ]
        Konstantin Shvachko made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12412684/OIV-FileDistr.patch
        against trunk revision 790733.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/5/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/5/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/5/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/5/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12412684/OIV-FileDistr.patch against trunk revision 790733. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/5/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/5/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/5/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/5/console This message is automatically generated.
        Hide
        Jakob Homan added a comment -

        Looks good. +1 with two minor quibbles:

        • On line 25 of your patch, the console help should be wrapped to fit with the rest of the text, currently it runs very long:
            * XML: This processor creates an XML document with all elements of
              the fsimage enumerated, suitable for further analysis by XML
              tools.
            * FileDistribution: This processor analyzes the file size
              distribution in the image.
              -maxSize specifies the range [0, maxSize] of file sizes to be analyzed (128GB by default).
              -step defines the granularity of the distribution. (2MB by default)
          
        • In the forrest docs, the new processor is referred to as a visitor, which is an implementation detail that shouldn't leak out into userland.
        Show
        Jakob Homan added a comment - Looks good. +1 with two minor quibbles: On line 25 of your patch, the console help should be wrapped to fit with the rest of the text, currently it runs very long: * XML: This processor creates an XML document with all elements of the fsimage enumerated, suitable for further analysis by XML tools. * FileDistribution: This processor analyzes the file size distribution in the image. -maxSize specifies the range [0, maxSize] of file sizes to be analyzed (128GB by default). -step defines the granularity of the distribution. (2MB by default) In the forrest docs, the new processor is referred to as a visitor, which is an implementation detail that shouldn't leak out into userland.
        Hide
        Jakob Homan added a comment -

        Also, the -p line of the console help still doesn't list the FileDistribution processor...

        Show
        Jakob Homan added a comment - Also, the -p line of the console help still doesn't list the FileDistribution processor...
        Hide
        Konstantin Shvachko added a comment -

        Fixed documentation formatting, wording, and online help.

        Show
        Konstantin Shvachko added a comment - Fixed documentation formatting, wording, and online help.
        Konstantin Shvachko made changes -
        Attachment OIV-FileDistr.patch [ 12412801 ]
        Hide
        Jakob Homan added a comment -

        Looks great. +1.

        Show
        Jakob Homan added a comment - Looks great. +1.
        Jakob Homan made changes -
        Hadoop Flags [Reviewed]
        Hide
        Konstantin Shvachko added a comment -

        I just committed this.

        Show
        Konstantin Shvachko added a comment - I just committed this.
        Konstantin Shvachko made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #17 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/17/)
        . Tool to analyze file size distribution in HDFS. Contributed by Konstantin Shvachko.

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #17 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/17/ ) . Tool to analyze file size distribution in HDFS. Contributed by Konstantin Shvachko.
        Tom White made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Konstantin Shvachko
            Reporter:
            Konstantin Shvachko
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development