Issue Details (XML | Word | Printable)

Key: HADOOP-1687
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Konstantin Shvachko
Reporter: Konstantin Shvachko
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Name-node memory size estimates and optimization proposal.

Created: 06/Aug/07 09:42 PM   Updated: 08/Jul/09 04:42 PM
Return to search
Component/s: None
Affects Version/s: 0.13.1
Fix Version/s: 0.15.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works DNBlockList.patch 2007-08-31 11:11 PM Konstantin Shvachko 28 kB
Issue Links:
Incorporates
 

Resolution Date: 07/Sep/07 08:55 PM


 Description  « Hide
I've done some estimates on how much space our data structures take on the name-node per block, file and directory.

Brief overview of the data structures:
Directory tree (FSDirectory) is built of inodes. Each INode points either to an array of blocks
if it corresponds to a file or to a TreeMap<String, INode> of children INodes if it is a directory.
[Note: this estimates were made before Dhruba replaced the children TreeMap by ArrayList.]
Each block participates also in at least 2 more data structures.
BlocksMap contains a HashMap<Block, BlockInfo> of all blocks mapping a Block into a BlockInfo.
DatanodeDescriptor contains a TreeMap<Block, Block> of all blocks belonging to this data-node.
A block may or may not be contained also in other data-structures, like

UnderReplicatedBlocks
PendingReplicationBlocks
recentInvalidateSets
excessReplicateMap

Presence of a block in any of these structures is temporary and therefore I do not count them in my estimates.
The estimates can be viewed as lower bounds.

These are some classes that we are looking at here

class INode {
   String name;
   INode parent;
   TreeMap<String, INode> children;
   Block blocks[];
   short blockReplication;
   long modificationTime;
}

class Block {
   long blkid;
   long len;
}

class BlockInfo {
   FSDirectory.INode       inode;
   DatanodeDescriptor[]   nodes;
   Block                          block;
}

The calculations are made for a 64-bit java vm based on that
Reference size = 8 bytes
Object header size = 16 bytes
Array header size = 24 bytes

Commonly used objects:
TreeMap.Entry = 64 bytes. It has 5 reference fields
HashMap.Entry = 48 bytes. It has 3 reference fields
String header = 64 bytes.

The size of a file includes:

  1. Size of an empty file INode: INode.children = null, INode.blocks is a 0-length array, and file name is empty. (152 bytes)
  2. A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
  3. file name length times 2, because String represents each name character by 2 bytes.
  4. Reference to the outer FSDirectory class (8 bytes)

The total: 224 + 2 * fileName.length

The size of a directory includes:

  1. Size of an empty directory INode: INode.children is an empty TreeMap, INode.blocks = null, and file name is empty. (192 bytes)
  2. A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
  3. file name length times 2
  4. Reference to the outer FSDirectory class (8 bytes)

The total: 264 + 2 * fileName.length

The size of a block includes:

  1. Size of Block. (32 bytes)
  2. Size of BlockInfo. (64 + 8*replication bytes)
  3. Reference to the block from INode.blocks (8 bytes)
  4. HashMap.Entry referencing the block from BlocksMap. (48 bytes)
  5. References to the block from all DatanodeDescriptors it belongs to.
    This is a TreeMap.Entry size times block replication. (64 * replication)

The total: 152 + 72 * replication

Typical object sizes:
Taking into account that typical file name is 10-15 bytes and our default replication is 3 we can say that typical sizes are
File size: 250
Directory size: 290
Block size: 368

Object size estimate (bytes) typical size (bytes)
File 224 + 2 * fileName.length 250
Directory 264 + 2 * fileName.length 290
Block 152 + 72 * replication 368

One of our clusters has

  1. Files: 10 600 000
  2. Dirs: 310 000
  3. Blocks: 13 300 000

Total Size (estimate): 7,63 GB
Memory used on the name-node (actual reported by jconsole after gc): 9 GB

This means that other data structures like NetworkTopology, heartbeats, datanodeMap, Host2NodesMap,
leases, sortedLeases, and multiple block replication maps occupy ~1.4 GB, which seems to be pretty high
and need to be investigated as well.

Based on the above estimates blocks should be the main focus of the name-node memory reduction effort.
Space used by a block is 50% larger compared to a file, and there is more blocks than files or directories.

Some ideas on how we can reduce the name-node size without substantially changing the data structures.

  1. INode.children should be an ArrayList instead of a TreeMap. Already done HADOOP-1565. (-48 bytes)
  2. Factor out the INode class into a separate class (-8 bytes)
  3. Create base INode class and derive file inode and directory inode classes from the base.
    Directory inodes do not need to contain blocks and replication fields (-16 bytes)
    File inodes do not need to contain children field (-8 bytes)
  4. String name should be replaced by a mere byte[]. (-(40 + fileName.length) ~ -50 bytes)
  5. Eliminate the Block object.
    We should move Block fields into into BlockInfo and completely get rid of the Block object. (-16 bytes)
  6. Block object is referenced at least 5 times in our structures for each physical block.
    The number of references should be reduced to just 2. (-24)
  7. Remove name field from INode. File or directory name is stored in the corresponding directory
    entry and does need to be duplicated in the INode (-8 bytes)
  8. Eliminate INode.parent field. INodes are accessed through the directory tree, and the parent can
    be remembered in a local variable while browsing the tree. There is no need to persistently store
    the parent reference for each object. (-8 bytes)
  9. Need to optimize data-node to block map. Currently each DatanodeDescriptor holds a TreeMap of
    blocks contained in the node, and we have an overhead of one TreeMap.Entry per block replica.
    I expect we can reorganize datanodeMap in a way that it stores only 1 or 2 references per replica
    instead of an entire TreeMap.Entry. (-48 * replication)

Note: In general TreeMaps turned out to be very expensive, we should avoid using them if possible.
Or implement a custom map structure, which would avoid using objects for each map entry.

This is what we will have after all the optimizations

Object size estimate (bytes) typical size (bytes) current typical size (bytes)
File 112 + fileName.length 125 250
Directory 144 + fileName.length 155 290
Block 112 + 24 * replication 184 368


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Konstantin Shvachko made changes - 06/Aug/07 09:48 PM
Field Original Value New Value
Description I've done some estimates on how much space our data structures take on the name-node per block, file and directory.

Brief overview of the data structures:
Directory tree (FSDirectory) is built of inodes. Each INode points either to an array of blocks
if it corresponds to a file or to a TreeMap<String, INode> of children INodes if it is a directory.
[Note: this estimates were made before Dhruba replaced the children TreeMap by ArrayList.]
Each block participates also in at least 2 more data structures.
BlocksMap contains a HashMap<Block, BlockInfo> of all blocks mapping a Block into a BlockInfo.
DatanodeDescriptor contains a TreeMap<Block, Block> of all blocks belonging to this data-node.
A block may or may not be contained also in other data-structures, like
{code}
UnderReplicatedBlocks
PendingReplicationBlocks
recentInvalidateSets
excessReplicateMap
{code}
Presence of a block in any of these structures is temporary and therefore I do not count them in my estimates.
The estimates can be viewed as lower bounds.

These are some classes that we are looking at here
{code}
class INode {
   String name;
   INode parent;
   TreeMap<String, INode> children;
   Block blocks[];
   short blockReplication;
   long modificationTime;
}

class Block {
   long blkid;
   long len;
}

class BlockInfo {
   FSDirectory.INode inode;
   DatanodeDescriptor[] nodes;
   Block block;
}
{code}

The calculations are made for a 64-bit java vm based on that
Reference size = 8 bytes
Object header size = 16 bytes
Array header size = 24 bytes

Commonly used objects:
TreeMap.Entry = 64 bytes. It has 5 reference fields
HashMap.Entry = 48 bytes. It has 3 reference fields
String header = 64 bytes.

The size of a file includes:
# Size of an empty file INode: INode.children = null, INode.blocks is a 0-length array, and file name is empty. (152 bytes)
# A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
# file name length times 2, because String represents each name character by 2 bytes.
# Reference to the outer FSDirectory class (8 bytes)
The total: 224 + 2 * fileName.length

The size of a directory includes:
# Size of an empty directory INode: INode.children is an empty TreeMap, INode.blocks = null, and file name is empty. (192 bytes)
# A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
# file name length times 2
# Reference to the outer FSDirectory class (8 bytes)
The total: 264 + 2 * fileName.length

The size of a block includes:
# Size of Block. (32 bytes)
# Size of BlockInfo. (64 + 8*replication bytes)
# Reference to the block from INode.blocks (8 bytes)
# HashMap.Entry referencing the block from BlocksMap. (48 bytes)
# References to the block from all DatanodeDescriptors it belongs to.
This is a TreeMap.Entry size times block replication. (64 * replication)
The total: 152 + 72 * replication

Typical object sizes:
Taking into account that typical file name is 10-15 bytes and our default replication is 3 we can say that typical sizes are
File size: 250
Directory size: 290
Block size: 368

||Object||size estimate (bytes)||typical size (bytes)||
|File|224 + 2 * fileName.length|250|
|Directory|264 + 2 * fileName.length|290|
|Block|152 + 72 * replication|368|

One of our clusters has
# Files: 10 600 000
# Dirs: 310 000
# Blocks: 13 300 000
Total Size (estimate): 7,63 GB
Memory used on the name-node (actual reported by jconsole after gc): 9 GB

This means that other data structures like NetworkTopology, heartbeats, datanodeMap, Host2NodesMap,
leases, sortedLeases, and multiple block replication maps occupy ~1.4 GB, which seems to be pretty high
and need to be investigated as well.

Based on the above estimates blocks should be the main focus of the name-node memory reduction effort.
Space used by a block is 50% larger compared to a file, and there is more blocks than files or directories.

Some ideas on how we can reduce the name-node size without substantially changing the data structures.
# INode.children should be an ArrayList instead of a TreeMap. Already done HADOOP-1565. (-48 bytes)
# Factor out the INode class into a separate class (-8 bytes)
# Create base INode class and derive file inode and directory inode classes from the base.
Directory inodes do not need to contain blocks and replication fields (-16 bytes)
File inodes do not need to contain children field (-8 bytes)
# String name should be replaced by a mere byte[]. (-(40 + fileName.length) ~ -50 bytes)
# Eliminate the Block object.
We should move Block fields into into BlockInfo and completely get rid of the Block object. (-16 bytes)
# Block object is referenced at least 5 times in our structures for each physical block.
The number of references should be reduced to just 2. (-24)
# Remove name field from INode. File or directory name is stored in the corresponding directory
entry and does need to be duplicated in the INode (-8 bytes)
# Eliminate INode.parent field. INodes are accessed through the directory tree, and the parent can
be remembered in a local variable while browsing the tree. There is no need to persistently store
the parent reference for each object. (-8 bytes)
# Need to optimize data-node to block map. Currently each DatanodeDescriptor holds a TreeMap of
blocks contained in the node, and we have an overhead of one TreeMap.Entry per block replica.
I expect we can reorganize datanodeMap in a way that it stores only 1 or 2 references per replica
instead of an entire TreeMap.Entry. (-48 * replication)

Note: In general TreeMaps turned out to be very expensive, we should avoid using them if possible.
Or implement a custom map structure, which would avoid using objects for each map entry.


This is what we will have after all the optimizations
||Object||size estimate (bytes)||typical size (bytes)||current typical size (bytes)||
|File|112 + fileName.length|125|250|
|Directory|144 + fileName.length|155|290|
|Block|112 + 24 * replication|184|368|
I've done some estimates on how much space our data structures take on the name-node per block, file and directory.

Brief overview of the data structures:
Directory tree (FSDirectory) is built of inodes. Each INode points either to an array of blocks
if it corresponds to a file or to a TreeMap<String, INode> of children INodes if it is a directory.
[Note: this estimates were made before Dhruba replaced the children TreeMap by ArrayList.]
Each block participates also in at least 2 more data structures.
BlocksMap contains a HashMap<Block, BlockInfo> of all blocks mapping a Block into a BlockInfo.
DatanodeDescriptor contains a TreeMap<Block, Block> of all blocks belonging to this data-node.
A block may or may not be contained also in other data-structures, like
{code}
UnderReplicatedBlocks
PendingReplicationBlocks
recentInvalidateSets
excessReplicateMap
{code}
Presence of a block in any of these structures is temporary and therefore I do not count them in my estimates.
The estimates can be viewed as lower bounds.

These are some classes that we are looking at here
{code}
class INode {
   String name;
   INode parent;
   TreeMap<String, INode> children;
   Block blocks[];
   short blockReplication;
   long modificationTime;
}

class Block {
   long blkid;
   long len;
}

class BlockInfo {
   FSDirectory.INode inode;
   DatanodeDescriptor[] nodes;
   Block block;
}
{code}

The calculations are made for a 64-bit java vm based on that
Reference size = 8 bytes
Object header size = 16 bytes
Array header size = 24 bytes

Commonly used objects:
TreeMap.Entry = 64 bytes. It has 5 reference fields
HashMap.Entry = 48 bytes. It has 3 reference fields
String header = 64 bytes.

The size of a file includes:
# Size of an empty file INode: INode.children = null, INode.blocks is a 0-length array, and file name is empty. (152 bytes)
# A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
# file name length times 2, because String represents each name character by 2 bytes.
# Reference to the outer FSDirectory class (8 bytes)

The total: 224 + 2 * fileName.length

The size of a directory includes:
# Size of an empty directory INode: INode.children is an empty TreeMap, INode.blocks = null, and file name is empty. (192 bytes)
# A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
# file name length times 2
# Reference to the outer FSDirectory class (8 bytes)

The total: 264 + 2 * fileName.length

The size of a block includes:
# Size of Block. (32 bytes)
# Size of BlockInfo. (64 + 8*replication bytes)
# Reference to the block from INode.blocks (8 bytes)
# HashMap.Entry referencing the block from BlocksMap. (48 bytes)
# References to the block from all DatanodeDescriptors it belongs to.
This is a TreeMap.Entry size times block replication. (64 * replication)

The total: 152 + 72 * replication

Typical object sizes:
Taking into account that typical file name is 10-15 bytes and our default replication is 3 we can say that typical sizes are
File size: 250
Directory size: 290
Block size: 368

||Object||size estimate (bytes)||typical size (bytes)||
|File|224 + 2 * fileName.length|250|
|Directory|264 + 2 * fileName.length|290|
|Block|152 + 72 * replication|368|

One of our clusters has
# Files: 10 600 000
# Dirs: 310 000
# Blocks: 13 300 000

Total Size (estimate): 7,63 GB
Memory used on the name-node (actual reported by jconsole after gc): 9 GB

This means that other data structures like NetworkTopology, heartbeats, datanodeMap, Host2NodesMap,
leases, sortedLeases, and multiple block replication maps occupy ~1.4 GB, which seems to be pretty high
and need to be investigated as well.

Based on the above estimates blocks should be the main focus of the name-node memory reduction effort.
Space used by a block is 50% larger compared to a file, and there is more blocks than files or directories.

Some ideas on how we can reduce the name-node size without substantially changing the data structures.
# INode.children should be an ArrayList instead of a TreeMap. Already done HADOOP-1565. (-48 bytes)
# Factor out the INode class into a separate class (-8 bytes)
# Create base INode class and derive file inode and directory inode classes from the base.
Directory inodes do not need to contain blocks and replication fields (-16 bytes)
File inodes do not need to contain children field (-8 bytes)
# String name should be replaced by a mere byte[]. (-(40 + fileName.length) ~ -50 bytes)
# Eliminate the Block object.
We should move Block fields into into BlockInfo and completely get rid of the Block object. (-16 bytes)
# Block object is referenced at least 5 times in our structures for each physical block.
The number of references should be reduced to just 2. (-24)
# Remove name field from INode. File or directory name is stored in the corresponding directory
entry and does need to be duplicated in the INode (-8 bytes)
# Eliminate INode.parent field. INodes are accessed through the directory tree, and the parent can
be remembered in a local variable while browsing the tree. There is no need to persistently store
the parent reference for each object. (-8 bytes)
# Need to optimize data-node to block map. Currently each DatanodeDescriptor holds a TreeMap of
blocks contained in the node, and we have an overhead of one TreeMap.Entry per block replica.
I expect we can reorganize datanodeMap in a way that it stores only 1 or 2 references per replica
instead of an entire TreeMap.Entry. (-48 * replication)

Note: In general TreeMaps turned out to be very expensive, we should avoid using them if possible.
Or implement a custom map structure, which would avoid using objects for each map entry.


This is what we will have after all the optimizations
||Object||size estimate (bytes)||typical size (bytes)||current typical size (bytes)||
|File|112 + fileName.length|125|250|
|Directory|144 + fileName.length|155|290|
|Block|112 + 24 * replication|184|368|
Konstantin Shvachko made changes - 21/Aug/07 03:15 AM
Link This issue incorporates HADOOP-1743 [ HADOOP-1743 ]
Konstantin Shvachko made changes - 22/Aug/07 06:12 PM
Link This issue incorporates HADOOP-1759 [ HADOOP-1759 ]
Konstantin Shvachko made changes - 23/Aug/07 01:45 AM
Link This issue incorporates HADOOP-1766 [ HADOOP-1766 ]
Konstantin Shvachko added a comment - 24/Aug/07 02:03 AM

Block map optimization.

Block map (implemented as a HashMap) is a mapping of blocks to data-nodes that contain this block.
Currently a block map entry BlockInfo contains an array of references to DatanodeDescriptors corresponding to the data-nodes this block is replicated to.
Each DatanodeDescriptors in turn contains a collection (implemented as a TreeMap) of references to all blocks that belong to the data-node forming a data-node map.
The TreeMap entries in DatanodeDescriptors are the main contributors to the space overhead the system bears for blocks.
The block map and the data-node map are to be kept consistent at all times with respect to following three operations:

  1. add block;
  2. remove block;
  3. process block report.

Add/ remove block are individual block operations and they happen often.
The block report is a fairly rare operation but involves processing of all blocks belonging to the data-node. A block report is an array of blocks received by the name-node from the data-node. The name-node scans the array and verifies that its current knowledge of the data-node blocks matches the actual state reported.
I propose the following changes to the existing data structures.

  1. Each DatanodeDescriptor should maintain a doubly-linked list of references to the blocks instead of a TreeMap. DatanodeDescriptor will contain only one reference to the head element of the list.
  2. The list itself is formed from BlockInfo entries. In order to main lists we will add a pair of references - next and previous - to the BlockInfo for each data-node this block is replicated to. In common case (replication = 3) each BlockInfo will have three pairs of references.

When a block is removed from the block map we re-link adjacent BlockInfo entries for each data-node the block belongs to.
Addition of a block should include it into the corresponding data-node lists. We always include new blocks into the beginning of the list.
When the name-node receives a block report it as usually locks the whole name system so that other updates could not intervene and starts processing the report one block at a time. Before processing the first block it inserts a fake BlockInfo entry into the head of the block list, which will serve as a delimiter between the blocks that have been reported by the data-node and those that were not.
For each block b in the block report we check whether it is contained in the block map. If yes we move it to the head of this data-node block list. If not then we schedule it for removal from this data-node.
When the whole report is processed the blocks that were considered by the name-node belonging to this data-node but turned out be missing there will remain to the right of the delimiting (fake) element initially inserted into the list. The name-node needs either to schedule them for re-replication if insufficient number replicas is present or otherwise just remove the blocks (along with the delimiting entry) from the list corresponding to the reporting data-node.

The BlockInfo structure will be as follows:

BlockInfo {
    private INodeFile inode;
    private DatanodeDescriptor[replication] nodes;
    private BlockInfo[2*replication] neighbors;
}

assuming that neighbors[2*i] and neighbors[2*i+1] are the previous and the next references respectively, corresponding to the data-node nodes[i].

Space overhead per block is now limited to:

  • reference to array of next/previous pairs (8 bytes)
  • the array header (24 bytes)
  • two references for each replica (16 bytes)

That is 32 + 16*replication total.
Versus 64*replication (see "the size of a block" (5) above).
In a typical case (replication = 3) this saves us 192 - 80 = 112.

Optional further optimization would be to merge two arrays of DatanodeDescriptors and the previous/next array in BlockInfo into one Object[] array of size 3*replication such that for the i-th data-node Object[3i] is the DatanodeDescriptor reference and Object[3i+1] and Object[3i+2] are the respective references to previous and next.
I am not quite sure we should go this way but if we do we will save additional 32 bytes per block.


Raghu Angadi added a comment - 28/Aug/07 10:42 PM
Since we want to compact BlockInfo, we could also merge DatanodeDescriptor array and list references array. This will save 24 bytes per block. So the array would be Object[3*replication]. Of course all the hairy details of management of this array would be private to this class.

Konstantin Shvachko added a comment - 31/Aug/07 11:11 PM
This is the block map optimization patch.
With this patch all name-node memory optimization items will be done except
for one #8 (removing INode.parent field). This one depends on changes to the crc upgrade code.
Will do it whenever the changes are complete.

Konstantin Shvachko made changes - 31/Aug/07 11:11 PM
Attachment DNBlockList.patch [ 12364934 ]
Konstantin Shvachko made changes - 01/Sep/07 03:50 PM
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 04/Sep/07 04:25 PM
-1, build or testing failed

2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12364934/DNBlockList.patch against trunk revision r571711.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/672/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/672/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.


Konstantin Shvachko made changes - 05/Sep/07 12:53 AM
Status Patch Available [ 10002 ] Open [ 1 ]
Konstantin Shvachko added a comment - 05/Sep/07 12:56 AM
Hudson reports failure of the TestMiniMRWithDFS test.
Cannot reproduce it. Resubmitting the patch as it is.

Konstantin Shvachko made changes - 05/Sep/07 12:56 AM
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 05/Sep/07 05:41 PM
-1, build or testing failed

2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12364934/DNBlockList.patch against trunk revision r572826.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/690/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/690/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.


Doug Cutting added a comment - 06/Sep/07 09:07 PM
Failure again does not look related to patch.

Doug Cutting made changes - 06/Sep/07 09:07 PM
Status Patch Available [ 10002 ] Open [ 1 ]
Doug Cutting made changes - 06/Sep/07 09:08 PM
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 07/Sep/07 04:21 AM
-1, build or testing failed

2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12364934/DNBlockList.patch against trunk revision r573413.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/702/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/702/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.


Repository Revision Date User Message
ASF #573698 Fri Sep 07 20:55:26 UTC 2007 cutting HADOOP-1687. Save memory on namenode by optimizing BlockMap representation. Contributed by Konstantin.
Files Changed
MODIFY /lucene/hadoop/trunk/src/java/org/apache/hadoop/dfs/INode.java
MODIFY /lucene/hadoop/trunk/src/java/org/apache/hadoop/dfs/BlocksMap.java
MODIFY /lucene/hadoop/trunk/src/java/org/apache/hadoop/dfs/DatanodeDescriptor.java
MODIFY /lucene/hadoop/trunk/src/java/org/apache/hadoop/dfs/FSNamesystem.java
MODIFY /lucene/hadoop/trunk/src/java/org/apache/hadoop/dfs/FSDirectory.java
MODIFY /lucene/hadoop/trunk/CHANGES.txt

Doug Cutting added a comment - 07/Sep/07 08:55 PM
I just committed this. Thanks, Konstantin!

Doug Cutting made changes - 07/Sep/07 08:55 PM
Resolution Fixed [ 1 ]
Status Patch Available [ 10002 ] Resolved [ 5 ]
Konstantin Shvachko added a comment - 27/Sep/07 02:39 AM

P.S.

The benchmarks I run confirm that this changes doubled the name-node space utilization compared to hadoop-0.14.1
and quadrupled it compared to hadoop-0.13.1. The latter is because crc files are not a part of the name space anymore.

My benchmark runs the name-node with a small fixed maximum heap size.
For each hadoop version I run it 3 times separately for the heap size values of -Xmx = 1, 4, and 8 MB.
The data-nodes never send block reports, since the report processing requires extra memory.
The client part of the benchmark generates new files with small blocks until the name-node runs out of memory (OutOfMemoryError).
The benchmark is written in such a way that it keeps the ratio of #files / #block / #directories close to a real cluster distribution.
The file name length is also set to average the name lengths of the real cluster.

Table 1 summarizes the benchmark reults:
Note that for hadoop-0.13.1 # files includes crc files as reported by fsck.
In order to get the "real" number of files one should devided that number by 2.

1 MB hadoop 0.13.1 hadoop 0.14.1 hadoop 0.15
files 1016 816 1327
blocks 1268 1205 1478
dirs 19 28 46
total objects 2303 2049 2851
4 MB hadoop 0.13.1 hadoop 0.14.1 hadoop 0.15
files 4914 4230 8265
blocks 6146 6310 12287
dirs 83 137 262
total objects 11143 10677 20814
8 MB hadoop 0.13.1 hadoop 0.14.1 hadoop 0.15
files 12818 11241 22039
blocks 16003 16893 32969
dirs 206 358 694
total objects 29027 28492 55702

hadoop14 has a little bit less files than hadoop13, which is consistent with the introduction of mod time.
1M heap is not representative because it's too small. 4mB and 8MB heaps show advantages of hadoop15 in all components.
The numbers are consistent with my estimates and expectations.

I also checked the performance of file creation on all three versions.
I run the same benchmark, but the name-node does not have memory restrictions and
the client creates exactly 2048 files in 64 directories and measures the elapsed time.

Table 1 summarizes the benchmark reults:

  hadoop 0.13.1 hadoop 0.14.1 hadoop 0.15
time in sec 354 186 269
files per sec 6 11 8
blocks per sec 8 16 12
objects per sec 14 27 20

hadoop15 is faster than hadoop13, but slower than hadoop14.
There is a field for optimization here.


Doug Cutting made changes - 05/Nov/07 06:12 PM
Status Resolved [ 5 ] Closed [ 6 ]
Owen O'Malley made changes - 08/Jul/09 04:42 PM
Component/s dfs [ 12310710 ]