Issue Details (XML | Word | Printable)

Key: HADOOP-2816
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Suresh Srinivas
Reporter: Robert Chansler
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Cluster summary at name node web has confusing report for space utilization

Created: 14/Feb/08 02:32 AM   Updated: 08/Jul/09 04:42 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.19.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-2816.patch 2008-09-19 04:00 AM Suresh Srinivas 14 kB
Text File Licensed for inclusion in ASF works HADOOP-2816.patch 2008-09-19 12:54 AM Suresh Srinivas 14 kB
Text File Licensed for inclusion in ASF works HADOOP-2816.patch 2008-09-18 02:45 AM Suresh Srinivas 13 kB
Issue Links:
Blocker
 
Reference
 

Hadoop Flags: Reviewed, Incompatible change
Release Note: Improved space reporting for NameNode Web UI. Applications that parse the Web UI output should be reviewed.
Resolution Date: 19/Sep/08 10:43 PM


 Description  « Hide
In one example:
Cluster Summary
Capacity : 1.15 PB
DFS Remaining : 192 TB
DFS Used : 717 TB
DFS Used% : 62 %

Why is Capacity not equal Used plus Remaining?

(The answer is that there is an estimated reserve for local files.)

The presentation should be easily understood by the user.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Robert Chansler added a comment - 05/Aug/08 11:28 PM
Consensus recommendation from M, K, A, H, S, R:

Allocation and management algorithms will not change, but reporting on the Name Node home page should be modified:

In the node table, four statistics will be reported:

"Configured Capacity" is the sum over all volumes V named in config:dfs.data.dir that exist of (df.Size V - config:dfs.datanode.du.reserved)
"Present Capacity" is df.Size V - MAX{df.Used V - [space used to store block and metadata files], config:dfs.datanode.du.reserved)
"Used (%)" is the ratio of [space used to store block and metadata files] and Present Capacity (the little gauge reflects this value)
"Remaining" is the difference Present Capacity - [space used to store block and metadata files]

The "cluster summary" will report 5 statistics:

"Configured Capacity" is the sum over all data nodes D of d.ConfiguredCapacity
"Present Capacity" is the sum over all data nodes of D of D.PresentCapacity
"Used" is the sum over all data nodes of D of D.[space used to store block and metadata files]
"Remaining" is the difference Present Capacity - Used
"Used %" is the ratio of Used to Present Capacity

A key will explain these calculations for the user.


Suresh Srinivas added a comment - 03/Sep/08 09:47 PM
For reporting the following info needs to be considered:

Total capacity - Capacity of all the data directories
Reserved space - Space reserved for non DFS usage
dfs.datanode.du.pct - When calculating DFS remaining space, only use this percentage of the real available space

Here is how DFS remaining space is calculated:
Available space is Minimum of (Available space on local file system) or (Total capacity - DFS used space - Reserved space)
DFS remaining = (dfs.datanode.du.pct) * Available space

Current proposal does not consider the factor dfs.datanode.du.pct. I am not sure why du.pct is being used. If it is to reduce available disk space for DFS, to consider factors such as disk fragmentation - it is not serving the purpose. Available space keeps on decreasing. The percentage is applied to the shrinking available space. Eventually the DFS ends up using all the available space any way (in theory) and the du.pct will not serve any purpose.

My proposal:
1) Remove du.pct configuration option

or

2) If du.pct is used, it is calculated on Total capacity and not on available space. This helps set aside a percentage of total capacity.


Suresh Srinivas added a comment - 15/Sep/08 05:38 PM
After discussing this with Hairong, looks like the issues of dfs.datanode.du.pct is unrelated to reporting data. The du.pct issue will be tracked in a separate JIRA.

The data displayed is changed as follows:

Cluster Summary
Capacity : Currently, this is sum of the file system capacity of all the data directories. This will be changed to exclude reserved space and will be calculated as (Sum of the file system capacity of all the data directories - Reserved space)

Present Capacity: This is newly added and represents the present capacity available for DFS use. This is sum of DFS Remaining and DFS Used given below

DFS Remaining : This will remain as it is
DFS Used : This will remain as it is
DFS Used% : This will remain as it is
Live Nodes : This will remain as it is
Dead Nodes : This will remain as it is

Node data prints currently:
Node Last Contact Admin State Size (TB) Used (%) Used (%) Remaining (TB) Blocks

It will be change to:
Node Last Contact Admin State Capacity (TB) Present Capacity (TB) Used (%) Used (%) Remaining (TB) Blocks

Size column is renamed as Capacity. Previously this was calculated as sum of file system capacity of all the data directories. It is changed to exclude reserved space and will be calculated as (sum of file system capacity of all the data directories - reserved space)

New column Present Capacity is added. This will sum of Used and Remaining.


Suresh Srinivas added a comment - 18/Sep/08 02:45 AM
Attached file makes the proposed changes. One change from my previous comment is, the used percentages are calculated based on the Present Capacity instead of Total Capacity.

Hairong Kuang added a comment - 18/Sep/08 10:26 PM
1. FSDataSet.java: getCapacity() should make sure that it does not return a negative number.
2. FSNamesystem.java: In getCapacityUsedPercent(), used space should be divided by the present capacity. FSNamesystem probably should not have this public method since it is only used in a test.
3. DatanodeInfo.getDfsUsedPercent should check the case that the present capacity is zero. Again, I do not think this public method needs to add to the class since it is only used in webUI and the test.
4. In webUI, better to rename "Total Capacity" to be "Configured Capacity" to show that it is different from the old definition.
5. Since the capacity field in the heartbeat has a new definition, should we bump up the DatanodeProtocol version?

Hairong Kuang added a comment - 18/Sep/08 11:23 PM
Suresh, could you please also change the command line cluster report? This is an extra work, but I think it is better to make the command line report and web UI report to be consistent in one release. Please take a look at DFSAdmin.report() and DatanodeInfo.getDatanodeReport(). Thanks.

Suresh Srinivas added a comment - 19/Sep/08 12:54 AM
Thanks for the review. I have uploaded new patch with the changes.

1. FSDataSet.java: getCapacity() should make sure that it does not return a negative number.
> Done

2. FSNamesystem.java: In getCapacityUsedPercent(), used space should be divided by the present capacity. FSNamesystem probably should not have this public method since it is only used in a test.
> Thanks for the catch. The testcase passed because used was a very small number. Hence the used/remaining ~= used/present capacity
>
> I think it is a good idea to keep the method public. This ensures used percentage calculation correctly uses present capacity and how it is done need not be known to users of the capacity information. This will help consistent calculation of the percentage used data.

3. DatanodeInfo.getDfsUsedPercent should check the case that the present capacity is zero. Again, I do not think this public method needs to add to the class since it is only used in webUI and the test.
> I think it is probably good idea to keep it as a public method

4. In webUI, better to rename "Total Capacity" to be "Configured Capacity" to show that it is different from the old definition.
> Changed

5. Since the capacity field in the heartbeat has a new definition, should we bump up the DatanodeProtocol version?
> Done. Updated the protocol version number

Additionally a new JIRA will be created to keep track of reporting the capacity for DFSAdmin report and other CLIs that are impacted by this change. This change only addresses the Web UI.


Hadoop QA added a comment - 19/Sep/08 03:20 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12390442/HADOOP-2816.patch
against trunk revision 696846.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 6 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3312/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3312/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3312/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3312/console

This message is automatically generated.


Suresh Srinivas added a comment - 19/Sep/08 04:00 AM
Fix for failed test case

Hairong Kuang added a comment - 19/Sep/08 05:28 PM
+1 The patch looks good.

Suresh Srinivas added a comment - 19/Sep/08 06:54 PM
New patch passed all the unit tests.

Test results for the test-patch:
[exec] +1 overall.

[exec] +1 @author. The patch does not contain any @author tags.

[exec] +1 tests included. The patch appears to include 6 new or modified tests.

[exec] +1 javadoc. The javadoc tool did not generate any warning messages.

[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.


Hairong Kuang added a comment - 19/Sep/08 10:43 PM
I've committed this. Thanks, Suresh!

Hudson added a comment - 22/Sep/08 03:18 PM

Hudson added a comment - 01/Oct/08 01:29 PM
Integrated in Hadoop-trunk #620 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/620/)
HADOOP-4281. Change dfsadmin to report available disk space in a format
consistent with the web interface as defined in . Contributed by
Suresh Srinivas

Suresh Srinivas added a comment - 17/Oct/08 06:42 PM
Changes are made as described in the proposed solution (in the previous comment).

Here is the test-patch result:
[exec] +1 overall.

[exec] +1 @author. The patch does not contain any @author tags.

[exec] +1 tests included. The patch appears to include 3 new or modified tests.

[exec] +1 javadoc. The javadoc tool did not generate any warning messages.

[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.


Suresh Srinivas added a comment - 17/Oct/08 06:56 PM
Ignore the previous comment. It was intended for another issue.

Robert Chansler added a comment - 21/Oct/08 10:47 PM
This fix changes the following:
1) Datanode heartbeat reported Capacity information is changed. Earlier the Capacity was sum of all the diskspace of data directories. With this change, it is sum of all the diskspace of data directories minus the reserved space configured using dfs.datanode.du.reserved config param. This change is reflected by changing the protocol version from 17 to 18.

2) The Namenode Web UI is changed accordingly as detailed below...

Cluster Summary
Capacity : Currently, this is sum of the file system capacity of all the data directories. This is changed to Sum of the file system capacity of all the data directories minus Reserved space. The name is changed to "Configured Capacity".

Present Capacity: This is newly added and represents the present capacity available for DFS use. This is sum of DFS Remaining and DFS Used given below

DFS Remaining : This will remain as it is
DFS Used : This will remain as it is
DFS Used% : This is changed. It is calculated based on Present Capacity and not Configured Capacity.
Live Nodes : This will remain as it is
Dead Nodes : This will remain as it is

Node data prints currently:
Node Last Contact Admin State Size (TB) Used (%) Used (%) Remaining (TB) Blocks

It will be change to:
Node Last Contact Admin State Capacity (TB) Present Capacity (TB) Used (%) Used (%) Remaining (TB) Blocks

Size column is renamed as Total Capacity. Previously this was calculated as sum of file system capacity of all the data directories. It is changed to exclude reserved space and will be calculated as (sum of file system capacity of all the data directories - reserved space)