Hadoop Common
  1. Hadoop Common
  2. HADOOP-2816

Cluster summary at name node web has confusing report for space utilization

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Improved space reporting for NameNode Web UI. Applications that parse the Web UI output should be reviewed.

      Description

      In one example:
      Cluster Summary
      Capacity : 1.15 PB
      DFS Remaining : 192 TB
      DFS Used : 717 TB
      DFS Used% : 62 %

      Why is Capacity not equal Used plus Remaining?

      (The answer is that there is an estimated reserve for local files.)

      The presentation should be easily understood by the user.

      1. HADOOP-2816.patch
        14 kB
        Suresh Srinivas
      2. HADOOP-2816.patch
        14 kB
        Suresh Srinivas
      3. HADOOP-2816.patch
        13 kB
        Suresh Srinivas

        Issue Links

          Activity

          Hide
          Robert Chansler added a comment -

          Consensus recommendation from M, K, A, H, S, R:

          Allocation and management algorithms will not change, but reporting on the Name Node home page should be modified:

          In the node table, four statistics will be reported:

          "Configured Capacity" is the sum over all volumes V named in config:dfs.data.dir that exist of (df.Size V - config:dfs.datanode.du.reserved)
          "Present Capacity" is df.Size V - MAX{df.Used V - [space used to store block and metadata files], config:dfs.datanode.du.reserved)
          "Used (%)" is the ratio of [space used to store block and metadata files] and Present Capacity (the little gauge reflects this value)
          "Remaining" is the difference Present Capacity - [space used to store block and metadata files]

          The "cluster summary" will report 5 statistics:

          "Configured Capacity" is the sum over all data nodes D of d.ConfiguredCapacity
          "Present Capacity" is the sum over all data nodes of D of D.PresentCapacity
          "Used" is the sum over all data nodes of D of D.[space used to store block and metadata files]
          "Remaining" is the difference Present Capacity - Used
          "Used %" is the ratio of Used to Present Capacity

          A key will explain these calculations for the user.

          Show
          Robert Chansler added a comment - Consensus recommendation from M, K, A, H, S, R: Allocation and management algorithms will not change, but reporting on the Name Node home page should be modified: In the node table, four statistics will be reported: "Configured Capacity" is the sum over all volumes V named in config:dfs.data.dir that exist of (df.Size V - config:dfs.datanode.du.reserved) "Present Capacity" is df.Size V - MAX{df.Used V - [space used to store block and metadata files] , config:dfs.datanode.du.reserved) "Used (%)" is the ratio of [space used to store block and metadata files] and Present Capacity (the little gauge reflects this value) "Remaining" is the difference Present Capacity - [space used to store block and metadata files] The "cluster summary" will report 5 statistics: "Configured Capacity" is the sum over all data nodes D of d.ConfiguredCapacity "Present Capacity" is the sum over all data nodes of D of D.PresentCapacity "Used" is the sum over all data nodes of D of D. [space used to store block and metadata files] "Remaining" is the difference Present Capacity - Used "Used %" is the ratio of Used to Present Capacity A key will explain these calculations for the user.
          Hide
          Suresh Srinivas added a comment -

          For reporting the following info needs to be considered:

          Total capacity - Capacity of all the data directories
          Reserved space - Space reserved for non DFS usage
          dfs.datanode.du.pct - When calculating DFS remaining space, only use this percentage of the real available space

          Here is how DFS remaining space is calculated:
          Available space is Minimum of (Available space on local file system) or (Total capacity - DFS used space - Reserved space)
          DFS remaining = (dfs.datanode.du.pct) * Available space

          Current proposal does not consider the factor dfs.datanode.du.pct. I am not sure why du.pct is being used. If it is to reduce available disk space for DFS, to consider factors such as disk fragmentation - it is not serving the purpose. Available space keeps on decreasing. The percentage is applied to the shrinking available space. Eventually the DFS ends up using all the available space any way (in theory) and the du.pct will not serve any purpose.

          My proposal:
          1) Remove du.pct configuration option

          or

          2) If du.pct is used, it is calculated on Total capacity and not on available space. This helps set aside a percentage of total capacity.

          Show
          Suresh Srinivas added a comment - For reporting the following info needs to be considered: Total capacity - Capacity of all the data directories Reserved space - Space reserved for non DFS usage dfs.datanode.du.pct - When calculating DFS remaining space, only use this percentage of the real available space Here is how DFS remaining space is calculated: Available space is Minimum of (Available space on local file system) or (Total capacity - DFS used space - Reserved space) DFS remaining = (dfs.datanode.du.pct) * Available space Current proposal does not consider the factor dfs.datanode.du.pct. I am not sure why du.pct is being used. If it is to reduce available disk space for DFS, to consider factors such as disk fragmentation - it is not serving the purpose. Available space keeps on decreasing. The percentage is applied to the shrinking available space. Eventually the DFS ends up using all the available space any way (in theory) and the du.pct will not serve any purpose. My proposal: 1) Remove du.pct configuration option or 2) If du.pct is used, it is calculated on Total capacity and not on available space. This helps set aside a percentage of total capacity.
          Hide
          Suresh Srinivas added a comment -

          After discussing this with Hairong, looks like the issues of dfs.datanode.du.pct is unrelated to reporting data. The du.pct issue will be tracked in a separate JIRA.

          The data displayed is changed as follows:

          Cluster Summary
          Capacity : Currently, this is sum of the file system capacity of all the data directories. This will be changed to exclude reserved space and will be calculated as (Sum of the file system capacity of all the data directories - Reserved space)

          Present Capacity: This is newly added and represents the present capacity available for DFS use. This is sum of DFS Remaining and DFS Used given below

          DFS Remaining : This will remain as it is
          DFS Used : This will remain as it is
          DFS Used% : This will remain as it is
          Live Nodes : This will remain as it is
          Dead Nodes : This will remain as it is

          Node data prints currently:
          Node Last Contact Admin State Size (TB) Used (%) Used (%) Remaining (TB) Blocks

          It will be change to:
          Node Last Contact Admin State Capacity (TB) Present Capacity (TB) Used (%) Used (%) Remaining (TB) Blocks

          Size column is renamed as Capacity. Previously this was calculated as sum of file system capacity of all the data directories. It is changed to exclude reserved space and will be calculated as (sum of file system capacity of all the data directories - reserved space)

          New column Present Capacity is added. This will sum of Used and Remaining.

          Show
          Suresh Srinivas added a comment - After discussing this with Hairong, looks like the issues of dfs.datanode.du.pct is unrelated to reporting data. The du.pct issue will be tracked in a separate JIRA. The data displayed is changed as follows: Cluster Summary Capacity : Currently, this is sum of the file system capacity of all the data directories. This will be changed to exclude reserved space and will be calculated as (Sum of the file system capacity of all the data directories - Reserved space) Present Capacity: This is newly added and represents the present capacity available for DFS use. This is sum of DFS Remaining and DFS Used given below DFS Remaining : This will remain as it is DFS Used : This will remain as it is DFS Used% : This will remain as it is Live Nodes : This will remain as it is Dead Nodes : This will remain as it is Node data prints currently: Node Last Contact Admin State Size (TB) Used (%) Used (%) Remaining (TB) Blocks It will be change to: Node Last Contact Admin State Capacity (TB) Present Capacity (TB) Used (%) Used (%) Remaining (TB) Blocks Size column is renamed as Capacity. Previously this was calculated as sum of file system capacity of all the data directories. It is changed to exclude reserved space and will be calculated as (sum of file system capacity of all the data directories - reserved space) New column Present Capacity is added. This will sum of Used and Remaining.
          Hide
          Suresh Srinivas added a comment -

          Attached file makes the proposed changes. One change from my previous comment is, the used percentages are calculated based on the Present Capacity instead of Total Capacity.

          Show
          Suresh Srinivas added a comment - Attached file makes the proposed changes. One change from my previous comment is, the used percentages are calculated based on the Present Capacity instead of Total Capacity.
          Hide
          Hairong Kuang added a comment -

          1. FSDataSet.java: getCapacity() should make sure that it does not return a negative number.
          2. FSNamesystem.java: In getCapacityUsedPercent(), used space should be divided by the present capacity. FSNamesystem probably should not have this public method since it is only used in a test.
          3. DatanodeInfo.getDfsUsedPercent should check the case that the present capacity is zero. Again, I do not think this public method needs to add to the class since it is only used in webUI and the test.
          4. In webUI, better to rename "Total Capacity" to be "Configured Capacity" to show that it is different from the old definition.
          5. Since the capacity field in the heartbeat has a new definition, should we bump up the DatanodeProtocol version?

          Show
          Hairong Kuang added a comment - 1. FSDataSet.java: getCapacity() should make sure that it does not return a negative number. 2. FSNamesystem.java: In getCapacityUsedPercent(), used space should be divided by the present capacity. FSNamesystem probably should not have this public method since it is only used in a test. 3. DatanodeInfo.getDfsUsedPercent should check the case that the present capacity is zero. Again, I do not think this public method needs to add to the class since it is only used in webUI and the test. 4. In webUI, better to rename "Total Capacity" to be "Configured Capacity" to show that it is different from the old definition. 5. Since the capacity field in the heartbeat has a new definition, should we bump up the DatanodeProtocol version?
          Hide
          Hairong Kuang added a comment -

          Suresh, could you please also change the command line cluster report? This is an extra work, but I think it is better to make the command line report and web UI report to be consistent in one release. Please take a look at DFSAdmin.report() and DatanodeInfo.getDatanodeReport(). Thanks.

          Show
          Hairong Kuang added a comment - Suresh, could you please also change the command line cluster report? This is an extra work, but I think it is better to make the command line report and web UI report to be consistent in one release. Please take a look at DFSAdmin.report() and DatanodeInfo.getDatanodeReport(). Thanks.
          Hide
          Suresh Srinivas added a comment -

          Thanks for the review. I have uploaded new patch with the changes.

          1. FSDataSet.java: getCapacity() should make sure that it does not return a negative number.
          > Done

          2. FSNamesystem.java: In getCapacityUsedPercent(), used space should be divided by the present capacity. FSNamesystem probably should not have this public method since it is only used in a test.
          > Thanks for the catch. The testcase passed because used was a very small number. Hence the used/remaining ~= used/present capacity
          >
          > I think it is a good idea to keep the method public. This ensures used percentage calculation correctly uses present capacity and how it is done need not be known to users of the capacity information. This will help consistent calculation of the percentage used data.

          3. DatanodeInfo.getDfsUsedPercent should check the case that the present capacity is zero. Again, I do not think this public method needs to add to the class since it is only used in webUI and the test.
          > I think it is probably good idea to keep it as a public method

          4. In webUI, better to rename "Total Capacity" to be "Configured Capacity" to show that it is different from the old definition.
          > Changed

          5. Since the capacity field in the heartbeat has a new definition, should we bump up the DatanodeProtocol version?
          > Done. Updated the protocol version number

          Additionally a new JIRA will be created to keep track of reporting the capacity for DFSAdmin report and other CLIs that are impacted by this change. This change only addresses the Web UI.

          Show
          Suresh Srinivas added a comment - Thanks for the review. I have uploaded new patch with the changes. 1. FSDataSet.java: getCapacity() should make sure that it does not return a negative number. > Done 2. FSNamesystem.java: In getCapacityUsedPercent(), used space should be divided by the present capacity. FSNamesystem probably should not have this public method since it is only used in a test. > Thanks for the catch. The testcase passed because used was a very small number. Hence the used/remaining ~= used/present capacity > > I think it is a good idea to keep the method public. This ensures used percentage calculation correctly uses present capacity and how it is done need not be known to users of the capacity information. This will help consistent calculation of the percentage used data. 3. DatanodeInfo.getDfsUsedPercent should check the case that the present capacity is zero. Again, I do not think this public method needs to add to the class since it is only used in webUI and the test. > I think it is probably good idea to keep it as a public method 4. In webUI, better to rename "Total Capacity" to be "Configured Capacity" to show that it is different from the old definition. > Changed 5. Since the capacity field in the heartbeat has a new definition, should we bump up the DatanodeProtocol version? > Done. Updated the protocol version number Additionally a new JIRA will be created to keep track of reporting the capacity for DFSAdmin report and other CLIs that are impacted by this change. This change only addresses the Web UI.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12390442/HADOOP-2816.patch
          against trunk revision 696846.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3312/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3312/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3312/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3312/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12390442/HADOOP-2816.patch against trunk revision 696846. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3312/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3312/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3312/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3312/console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          Fix for failed test case

          Show
          Suresh Srinivas added a comment - Fix for failed test case
          Hide
          Hairong Kuang added a comment -

          +1 The patch looks good.

          Show
          Hairong Kuang added a comment - +1 The patch looks good.
          Hide
          Suresh Srinivas added a comment -

          New patch passed all the unit tests.

          Test results for the test-patch:
          [exec] +1 overall.

          [exec] +1 @author. The patch does not contain any @author tags.

          [exec] +1 tests included. The patch appears to include 6 new or modified tests.

          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

          Show
          Suresh Srinivas added a comment - New patch passed all the unit tests. Test results for the test-patch: [exec] +1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          Hide
          Hairong Kuang added a comment -

          I've committed this. Thanks, Suresh!

          Show
          Hairong Kuang added a comment - I've committed this. Thanks, Suresh!
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #611 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/611/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk #620 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/620/)
          HADOOP-4281. Change dfsadmin to report available disk space in a format
          consistent with the web interface as defined in . Contributed by
          Suresh Srinivas

          Show
          Hudson added a comment - Integrated in Hadoop-trunk #620 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/620/ ) HADOOP-4281 . Change dfsadmin to report available disk space in a format consistent with the web interface as defined in . Contributed by Suresh Srinivas
          Hide
          Suresh Srinivas added a comment -

          Changes are made as described in the proposed solution (in the previous comment).

          Here is the test-patch result:
          [exec] +1 overall.

          [exec] +1 @author. The patch does not contain any @author tags.

          [exec] +1 tests included. The patch appears to include 3 new or modified tests.

          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.

          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

          Show
          Suresh Srinivas added a comment - Changes are made as described in the proposed solution (in the previous comment). Here is the test-patch result: [exec] +1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          Hide
          Suresh Srinivas added a comment -

          Ignore the previous comment. It was intended for another issue.

          Show
          Suresh Srinivas added a comment - Ignore the previous comment. It was intended for another issue.
          Hide
          Robert Chansler added a comment -

          This fix changes the following:
          1) Datanode heartbeat reported Capacity information is changed. Earlier the Capacity was sum of all the diskspace of data directories. With this change, it is sum of all the diskspace of data directories minus the reserved space configured using dfs.datanode.du.reserved config param. This change is reflected by changing the protocol version from 17 to 18.

          2) The Namenode Web UI is changed accordingly as detailed below...

          Cluster Summary
          Capacity : Currently, this is sum of the file system capacity of all the data directories. This is changed to Sum of the file system capacity of all the data directories minus Reserved space. The name is changed to "Configured Capacity".

          Present Capacity: This is newly added and represents the present capacity available for DFS use. This is sum of DFS Remaining and DFS Used given below

          DFS Remaining : This will remain as it is
          DFS Used : This will remain as it is
          DFS Used% : This is changed. It is calculated based on Present Capacity and not Configured Capacity.
          Live Nodes : This will remain as it is
          Dead Nodes : This will remain as it is

          Node data prints currently:
          Node Last Contact Admin State Size (TB) Used (%) Used (%) Remaining (TB) Blocks

          It will be change to:
          Node Last Contact Admin State Capacity (TB) Present Capacity (TB) Used (%) Used (%) Remaining (TB) Blocks

          Size column is renamed as Total Capacity. Previously this was calculated as sum of file system capacity of all the data directories. It is changed to exclude reserved space and will be calculated as (sum of file system capacity of all the data directories - reserved space)

          Show
          Robert Chansler added a comment - This fix changes the following: 1) Datanode heartbeat reported Capacity information is changed. Earlier the Capacity was sum of all the diskspace of data directories. With this change, it is sum of all the diskspace of data directories minus the reserved space configured using dfs.datanode.du.reserved config param. This change is reflected by changing the protocol version from 17 to 18. 2) The Namenode Web UI is changed accordingly as detailed below... Cluster Summary Capacity : Currently, this is sum of the file system capacity of all the data directories. This is changed to Sum of the file system capacity of all the data directories minus Reserved space. The name is changed to "Configured Capacity". Present Capacity: This is newly added and represents the present capacity available for DFS use. This is sum of DFS Remaining and DFS Used given below DFS Remaining : This will remain as it is DFS Used : This will remain as it is DFS Used% : This is changed. It is calculated based on Present Capacity and not Configured Capacity. Live Nodes : This will remain as it is Dead Nodes : This will remain as it is Node data prints currently: Node Last Contact Admin State Size (TB) Used (%) Used (%) Remaining (TB) Blocks It will be change to: Node Last Contact Admin State Capacity (TB) Present Capacity (TB) Used (%) Used (%) Remaining (TB) Blocks Size column is renamed as Total Capacity. Previously this was calculated as sum of file system capacity of all the data directories. It is changed to exclude reserved space and will be calculated as (sum of file system capacity of all the data directories - reserved space)

            People

            • Assignee:
              Suresh Srinivas
              Reporter:
              Robert Chansler
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development