Hadoop Common
  1. Hadoop Common
  2. HADOOP-803

Reducing memory consumption on Namenode : Part 1

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: None
    • Labels:
      None

      Description

      There appears to be some places in Namenode that allow reducing memory consumption without intrusive code or feature changes. This bug is an initial attempt making those changes. Please include your thoughts as well.

      One change I am planning to make :

      Currently one copy of each block exists for each of the replicas and one copy for blockMap. I think they are all supposed to be same.

      1. block-refs-2.patch
        6 kB
        Raghu Angadi
      2. block-refs-3.patch
        8 kB
        Raghu Angadi
      3. block-refs-5.patch
        8 kB
        Raghu Angadi
      4. HADOOP-803.patch
        13 kB
        Raghu Angadi
      5. HADOOP-803-2.patch
        13 kB
        Raghu Angadi
      6. HADOOP-803_3.patch
        18 kB
        Raghu Angadi
      7. NameNodeMemoryHogs.txt
        1 kB
        Raghu Angadi
      8. HADOOP-803_4.patch
        18 kB
        Doug Cutting
      9. HADOOP-803_5.patch
        18 kB
        Raghu Angadi

        Activity

        Hide
        Doug Cutting added a comment -

        I just committed this. Thanks, Raghu!

        Show
        Doug Cutting added a comment - I just committed this. Thanks, Raghu!
        Hide
        Hadoop QA added a comment -

        +1, because http://issues.apache.org/jira/secure/attachment/12351093/HADOOP-803_5.patch applied and successfully tested against trunk revision r507276.

        Show
        Hadoop QA added a comment - +1, because http://issues.apache.org/jira/secure/attachment/12351093/HADOOP-803_5.patch applied and successfully tested against trunk revision r507276.
        Hide
        Raghu Angadi added a comment -

        attached 5.patch: Updated patch for current trunk.

        Thanks for reviewing Hairong.

        Show
        Raghu Angadi added a comment - attached 5.patch: Updated patch for current trunk. Thanks for reviewing Hairong.
        Hide
        Hairong Kuang added a comment -

        +1 code reviewed. Raghu, you might need to regenerate the patch.

        Show
        Hairong Kuang added a comment - +1 code reviewed. Raghu, you might need to regenerate the patch.
        Hide
        Raghu Angadi added a comment -

        TreeMap<Block, Block> allows us to get the object that was inserted into the map. With TreeSet we can not get the object that was inserted. This is the base for removing extra block objects in this patch.

        Show
        Raghu Angadi added a comment - TreeMap<Block, Block> allows us to get the object that was inserted into the map. With TreeSet we can not get the object that was inserted. This is the base for removing extra block objects in this patch.
        Hide
        Hairong Kuang added a comment -

        In the patch the type of the field "blocks" in DatanodeDescriptor is changed from TreeSet<Block> to TreeMap<Block, Block>. I think type TreeSet<Block> is the same as TreeMap<Block, Block>, but with a cleaner interface. Is there any reason that we need the change?

        Show
        Hairong Kuang added a comment - In the patch the type of the field "blocks" in DatanodeDescriptor is changed from TreeSet<Block> to TreeMap<Block, Block>. I think type TreeSet<Block> is the same as TreeMap<Block, Block>, but with a cleaner interface. Is there any reason that we need the change?
        Hide
        Raghu Angadi added a comment -

        > Even after a name node starts up, a block in inode does not share the object with blockMap.
        > So a name node contains two Block instantiations per block.

        In the patch, when a file is closed, we do use the reference in blockMap (because, datanode that has block would have informed namenode already).

        Yes, we can get rid of one of blockMap and activeBlocks maps (not both). This also removes extra block object we have for files that exist before namenode restarts. The changes for this are a bit more intrusive, I am wondering if I should do it as part of this patch..

        Show
        Raghu Angadi added a comment - > Even after a name node starts up, a block in inode does not share the object with blockMap. > So a name node contains two Block instantiations per block. In the patch, when a file is closed, we do use the reference in blockMap (because, datanode that has block would have informed namenode already). Yes, we can get rid of one of blockMap and activeBlocks maps (not both). This also removes extra block object we have for files that exist before namenode restarts. The changes for this are a bit more intrusive, I am wondering if I should do it as part of this patch..
        Hide
        Hairong Kuang added a comment -

        > If the blocks were created after Namenode is started, one in inode will share the object with blockMap. When name nodes starts up, it initially creates all the blocks in Indode while reading the image file.. does not seem easy to share that reference.

        Even after a name node starts up, a block in inode does not share the object with blockMap. So a name node contains two Block instantiations per block.

        To reduce the number of block instantiations to 1, I think we can create a TreeSet of blocks which maps a block id to its block object. When a block is intially created, simply add the block to this block set. When later a reference to the block is needed, we can get it from the block set.

        We can furthur remove the blockMap and activeBlock list by adding two non-persistent fields to the Block data structures. One is a reference to all its containing data nodes and one to the inode representing the file that the block belongs to.

        Show
        Hairong Kuang added a comment - > If the blocks were created after Namenode is started, one in inode will share the object with blockMap. When name nodes starts up, it initially creates all the blocks in Indode while reading the image file.. does not seem easy to share that reference. Even after a name node starts up, a block in inode does not share the object with blockMap. So a name node contains two Block instantiations per block. To reduce the number of block instantiations to 1, I think we can create a TreeSet of blocks which maps a block id to its block object. When a block is intially created, simply add the block to this block set. When later a reference to the block is needed, we can get it from the block set. We can furthur remove the blockMap and activeBlock list by adding two non-persistent fields to the Block data structures. One is a reference to all its containing data nodes and one to the inode representing the file that the block belongs to.
        Hide
        Raghu Angadi added a comment -

        Thanks Doug. The new patch looks fine.
        For now I am removing the 'patch available' state until this patch gets reviewed.

        Show
        Raghu Angadi added a comment - Thanks Doug. The new patch looks fine. For now I am removing the 'patch available' state until this patch gets reviewed.
        Hide
        Doug Cutting added a comment -

        Patch now applies to current trunk, assuming I resolved the conflict correctly...

        Show
        Doug Cutting added a comment - Patch now applies to current trunk, assuming I resolved the conflict correctly...
        Hide
        Doug Cutting added a comment -

        Updated patch.

        Show
        Doug Cutting added a comment - Updated patch.
        Hide
        Doug Cutting added a comment -

        Patch doesn't apply to latest trunk.

        Show
        Doug Cutting added a comment - Patch doesn't apply to latest trunk.
        Hide
        Raghu Angadi added a comment -

        see attached NameNodeMemoryHogs.txt for better formated list. Following is cut-n-paste of the attachment.

        Main FS wide datastructures :
        -----------------------------

        blockMap : block --> containing nodes : HashMap
        activeBLocks : block --> INode : HashMap
        INode.children : FileName --> INode : TreeMap (in dir INode).
        DN.blocks : block --> self : TreeMap ( in each DataNodeDescriptor )
        datanodeMap : String --> DatanodeDescriptor ( linear cost with
        number of dataNodes )
        Memory cost per Block :
        ----------------------
        Before this patch :
        Block Objects : 3 ( one per replica in DN.blocks)
        1 ( in blockMap )
        1 ( in INode );
        TreeMap : 1 ( containingNode TreeSet )
        TreeMap$Entry : 3 ( conatiningNodes TreeSet in blockMap )
        3 ( in DN.blocks in each of the nodes )
        HashMap$Entry : 1 ( blockMap )
        1 ( activeBlocks )

        After this patch :
        Block Objects : 1 ( in blockMap )
        1 ( in INode that existed before NameNode restarted )
        ArrayList : 1 ( containingNodes Arraylist )
        TreeMap$Entry : 3 ( in DN.blocks in each of the nodes )
        HashMap$Entry : 1 ( blockMap )
        1 ( activeBlocks )

        Memory cost per INode :
        ----------------------
        TreeMap : 1 ( children. removed in this issue for regular files )
        TreeMap$Entry : 1 ( in parent.children )
        INode object : 1
        char[128] : 1 (char[] used in String)
        String : 1 ( name )

        Show
        Raghu Angadi added a comment - see attached NameNodeMemoryHogs.txt for better formated list. Following is cut-n-paste of the attachment. Main FS wide datastructures : ----------------------------- blockMap : block --> containing nodes : HashMap activeBLocks : block --> INode : HashMap INode.children : FileName --> INode : TreeMap (in dir INode). DN.blocks : block --> self : TreeMap ( in each DataNodeDescriptor ) datanodeMap : String --> DatanodeDescriptor ( linear cost with number of dataNodes ) Memory cost per Block : ---------------------- Before this patch : Block Objects : 3 ( one per replica in DN.blocks) 1 ( in blockMap ) 1 ( in INode ); TreeMap : 1 ( containingNode TreeSet ) TreeMap$Entry : 3 ( conatiningNodes TreeSet in blockMap ) 3 ( in DN.blocks in each of the nodes ) HashMap$Entry : 1 ( blockMap ) 1 ( activeBlocks ) After this patch : Block Objects : 1 ( in blockMap ) 1 ( in INode that existed before NameNode restarted ) ArrayList : 1 ( containingNodes Arraylist ) TreeMap$Entry : 3 ( in DN.blocks in each of the nodes ) HashMap$Entry : 1 ( blockMap ) 1 ( activeBlocks ) Memory cost per INode : ---------------------- TreeMap : 1 ( children. removed in this issue for regular files ) TreeMap$Entry : 1 ( in parent.children ) INode object : 1 char [128] : 1 (char[] used in String) String : 1 ( name )
        Hide
        Raghu Angadi added a comment -

        The patch is estimated to give around 25% on a large NameNode on 32bit JVM.

        Show
        Raghu Angadi added a comment - The patch is estimated to give around 25% on a large NameNode on 32bit JVM.
        Hide
        Raghu Angadi added a comment -

        > A big per file consumer of memory is INode.name. It stores full path.
        > We can save hundred or more bytes per file if we store only the file name.
        > Full path name can always be constructed from parent. — (b) .

        Konstantin pointed out INode.name is in fact just the file name. Since it is declared as String it still seems to be taking around 128 bytes. I will check if the size comes down if it is declared as char[]. Not sure if it can be declared byte[].

        Show
        Raghu Angadi added a comment - > A big per file consumer of memory is INode.name. It stores full path. > We can save hundred or more bytes per file if we store only the file name. > Full path name can always be constructed from parent. — (b) . Konstantin pointed out INode.name is in fact just the file name. Since it is declared as String it still seems to be taking around 128 bytes. I will check if the size comes down if it is declared as char[]. Not sure if it can be declared byte[].
        Hide
        Raghu Angadi added a comment -

        803_3.patch add the following : remove allocation of a TreeMap in INode by default.

        Show
        Raghu Angadi added a comment - 803_3.patch add the following : remove allocation of a TreeMap in INode by default.
        Hide
        Raghu Angadi added a comment -

        Few more thoughts: (these are not intended to be included in patch for this issue)

        A big per file consumer of memory is INode.name. It stores full path. We can save hundred or more bytes per file if we store only the file name. Full path name can always be constructed from parent. — (b) .

        Each directory has 'activeBlocks' which is a HashMap for block to INode. We already have a global blockMap (block to containingNodes). This also implies that every call to getBlock(File) results in recursing from root to the node, each of which involves a TreeMap look up in children map. I think we should have just Map : block to

        { INode, self-ref, containingNodes ... }

        . This will save HashMap entry (30+ bytes) and block object (20-30 bytes) for each block. It also improves getFile() by many times. This will also let us use ArrayList instead of TreeMap for INode.children (30-40 bytes per file) — (c)

        Show
        Raghu Angadi added a comment - Few more thoughts: (these are not intended to be included in patch for this issue) A big per file consumer of memory is INode.name. It stores full path. We can save hundred or more bytes per file if we store only the file name. Full path name can always be constructed from parent. — (b) . Each directory has 'activeBlocks' which is a HashMap for block to INode. We already have a global blockMap (block to containingNodes). This also implies that every call to getBlock(File) results in recursing from root to the node, each of which involves a TreeMap look up in children map. I think we should have just Map : block to { INode, self-ref, containingNodes ... } . This will save HashMap entry (30+ bytes) and block object (20-30 bytes) for each block. It also improves getFile() by many times. This will also let us use ArrayList instead of TreeMap for INode.children (30-40 bytes per file) — (c)
        Hide
        Raghu Angadi added a comment -

        Another relatively simpler change :

        each Inode allocates a TreeMap for chidren. Each TreeMap takes around 40 bytes (from profiler, not sure if it includes gc overhead). Since most nodes in FS don't have any children, we can postpone allocating TreeMap until it is needed. — (a)

        INode.children does not strictly need to be a TreeMap. Each TreeMap entry seems to be around 30 bytes. I am not planning to include this change in this bug, but that would be another 30 bytes per node.

        Show
        Raghu Angadi added a comment - Another relatively simpler change : each Inode allocates a TreeMap for chidren. Each TreeMap takes around 40 bytes (from profiler, not sure if it includes gc overhead). Since most nodes in FS don't have any children, we can postpone allocating TreeMap until it is needed. — (a) INode.children does not strictly need to be a TreeMap. Each TreeMap entry seems to be around 30 bytes. I am not planning to include this change in this bug, but that would be another 30 bytes per node.
        Hide
        Raghu Angadi added a comment -

        Attaching the same patch. did not grant ASF license by mistake last time.

        Show
        Raghu Angadi added a comment - Attaching the same patch. did not grant ASF license by mistake last time.
        Hide
        Raghu Angadi added a comment -

        Attaching patch for review. changes between 803.patch and 803-2.patch are minor.

        Show
        Raghu Angadi added a comment - Attaching patch for review. changes between 803.patch and 803-2.patch are minor.
        Hide
        Raghu Angadi added a comment -

        On a small cluster with 3035 blocks verified blocks references work as expected using netbeans profiler:

        Number of Block objects :
        before the patch : 5*3035 ( 3 replicas, 1 in blockMap, 1 in Inode/File )
        after the patch : 2 * 3035 ( in blockMap and inode/File).

        If the blocks were created after Namenode is started, one in inode will share the object with blockMap. When name nodes starts up, it initially creates all the blocks in Indode while reading the image file.. does not seem easy to share that reference.

        TreeMap.Entry objects also reduced from 20k to around 11.5 k. Due to changing containingNodes to ArrayList instead of TreeMap.

        TreeMap.Entry and Block used to take max memory after byte[] and char[] from the profiler. Now Blocks has gone down in the list.

        I will submit the patch today. We could wait till current trunk is more stable to check it in.

        Show
        Raghu Angadi added a comment - On a small cluster with 3035 blocks verified blocks references work as expected using netbeans profiler: Number of Block objects : before the patch : 5*3035 ( 3 replicas, 1 in blockMap, 1 in Inode/File ) after the patch : 2 * 3035 ( in blockMap and inode/File). If the blocks were created after Namenode is started, one in inode will share the object with blockMap. When name nodes starts up, it initially creates all the blocks in Indode while reading the image file.. does not seem easy to share that reference. TreeMap.Entry objects also reduced from 20k to around 11.5 k. Due to changing containingNodes to ArrayList instead of TreeMap. TreeMap.Entry and Block used to take max memory after byte[] and char[] from the profiler. Now Blocks has gone down in the list. I will submit the patch today. We could wait till current trunk is more stable to check it in.
        Hide
        Raghu Angadi added a comment -

        > e.g: 'diff' wont be < 0 when blocks ids are LONG_MAX and -10.
        I meant 'wont be > 0'.

        Show
        Raghu Angadi added a comment - > e.g: 'diff' wont be < 0 when blocks ids are LONG_MAX and -10. I meant 'wont be > 0'.
        Hide
        Raghu Angadi added a comment -

        The bug is in the following patch. Very costly oversight: The does not affect equals() but affects Block.comparedTo().

        public int compareTo(Object o) {

        • Block b = (Block) o;
        • if (getBlockId() < b.getBlockId()) { - return -1; - }

          else if (getBlockId() == b.getBlockId())

          { - return 0; - }

          else

          { - return 1; - }

          + long diff = getBlockId() - ((Block)o).getBlockId();
          + return ( diff < 0 ) ? -1 : ( ( diff > 0 ) ? 1 : 0 );
          }

        e.g: 'diff' wont be < 0 when blocks ids are LONG_MAX and -10.

        changing this to following fixes it.

        + Block b = (Block)o;
        + return ( blkid < b.blkid ) ? -1 :
        + ( ( blkid > b.blkid ) ? 1 : 0 );

        Now TestSmallBlocks patch does not fail.

        Show
        Raghu Angadi added a comment - The bug is in the following patch. Very costly oversight: The does not affect equals() but affects Block.comparedTo(). public int compareTo(Object o) { Block b = (Block) o; if (getBlockId() < b.getBlockId()) { - return -1; - } else if (getBlockId() == b.getBlockId()) { - return 0; - } else { - return 1; - } + long diff = getBlockId() - ((Block)o).getBlockId(); + return ( diff < 0 ) ? -1 : ( ( diff > 0 ) ? 1 : 0 ); } e.g: 'diff' wont be < 0 when blocks ids are LONG_MAX and -10. changing this to following fixes it. + Block b = (Block)o; + return ( blkid < b.blkid ) ? -1 : + ( ( blkid > b.blkid ) ? 1 : 0 ); Now TestSmallBlocks patch does not fail.
        Hide
        Raghu Angadi added a comment -

        This is pretty weird. Both NPE and SmallBlock test failures in HADOOP-898 are caused by the same problem : node.getBlock(blockId) returns null sometimes. But I verified that node.blocks contains this block earlier and right after this failure. Any ideas?

        blocks map in DatanodeDescrptor is changed like this :

        • private volatile Collection<Block> blocks = new TreeSet<Block>();
          + private volatile SortedMap<Block, Block> blocks = new TreeMap<Block, Block>();

        and getBlock(long blockid) is defined as :

        { return blocks.get( new Block(blockId, 0) ); }
        Show
        Raghu Angadi added a comment - This is pretty weird. Both NPE and SmallBlock test failures in HADOOP-898 are caused by the same problem : node.getBlock(blockId) returns null sometimes. But I verified that node.blocks contains this block earlier and right after this failure. Any ideas? blocks map in DatanodeDescrptor is changed like this : private volatile Collection<Block> blocks = new TreeSet<Block>(); + private volatile SortedMap<Block, Block> blocks = new TreeMap<Block, Block>(); and getBlock(long blockid) is defined as : { return blocks.get( new Block(blockId, 0) ); }
        Hide
        Doug Cutting added a comment -

        This was causing problems and has been reverted in HADOOP-898.

        Show
        Doug Cutting added a comment - This was causing problems and has been reverted in HADOOP-898 .
        Hide
        Doug Cutting added a comment -

        I just committed this. Thanks, Raghu!

        Show
        Doug Cutting added a comment - I just committed this. Thanks, Raghu!
        Hide
        Hadoop QA added a comment -

        +1, because http://issues.apache.org/jira/secure/attachment/12348783/HADOOP-803.patch applied and successfully tested against trunk revision r495045.

        Show
        Hadoop QA added a comment - +1, because http://issues.apache.org/jira/secure/attachment/12348783/HADOOP-803.patch applied and successfully tested against trunk revision r495045.
        Hide
        Raghu Angadi added a comment -

        The latest patch changes Datanode container associated with each block to ArrayList instead of a SortedSet. ArrayList's initial size is set to number of replications for the file.

        Show
        Raghu Angadi added a comment - The latest patch changes Datanode container associated with each block to ArrayList instead of a SortedSet. ArrayList's initial size is set to number of replications for the file.
        Hide
        Raghu Angadi added a comment -

        Modified containingNodes to be ArrayList instead of sortedSet. On a lightly loaded 500 node cluster (each node has 500-600 blocks), with the patch memory (in MB) was low to mid 40s after GC and with out the patch memory was in mid 50s. I will submit a new patch.

        Show
        Raghu Angadi added a comment - Modified containingNodes to be ArrayList instead of sortedSet. On a lightly loaded 500 node cluster (each node has 500-600 blocks), with the patch memory (in MB) was low to mid 40s after GC and with out the patch memory was in mid 50s. I will submit a new patch.
        Hide
        Raghu Angadi added a comment -

        The attached patch seems to save in the order of 1MB with around 3000 blocks (9000 total blocks/3). Without the patch Jconsole show around 5.3-5.57 MB of heap in use just after GC. With the patch it is around 4.4-4.8 MB. Calculation would be more accurate on a larger cluster. I will try.

        Show
        Raghu Angadi added a comment - The attached patch seems to save in the order of 1MB with around 3000 blocks (9000 total blocks/3). Without the patch Jconsole show around 5.3-5.57 MB of heap in use just after GC. With the patch it is around 4.4-4.8 MB. Calculation would be more accurate on a larger cluster. I will try.
        Hide
        Hairong Kuang added a comment -

        A LinkedList entry contains a pointer to the previous entry and a pointer to the next entry. So it is more expensive entry-wise. One problem with ArrayList is that it creates an array of size 10 by default. But because most of time the set size is 3, we could set the intial array size to be 3.

        Show
        Hairong Kuang added a comment - A LinkedList entry contains a pointer to the previous entry and a pointer to the next entry. So it is more expensive entry-wise. One problem with ArrayList is that it creates an array of size 10 by default. But because most of time the set size is 3, we could set the intial array size to be 3.
        Hide
        Raghu Angadi added a comment -

        +1
        I was thinking of the same.
        How much do think the memory difference between a LinkedList entry and ArrayList entry?

        Show
        Raghu Angadi added a comment - +1 I was thinking of the same. How much do think the memory difference between a LinkedList entry and ArrayList entry?
        Hide
        Hairong Kuang added a comment -

        Currently blockMap maps a block to a TreeSet of DatanodeDescriptors. I would suggest that we use ArrayList in order to reduce the use of memory. Most of the time the set size is 3 because the default replication factor of a file is 3. So in term of speed, there is no benifit using TreeSet. However in term of memory TreeSet is way more expensive than ArrayList. An entry in TreeSet is at least 6 times as expensive as an entry in ArrayList and approximately we have 3*total#_of_blocks of such entries in FSNamesystem.

        Show
        Hairong Kuang added a comment - Currently blockMap maps a block to a TreeSet of DatanodeDescriptors. I would suggest that we use ArrayList in order to reduce the use of memory. Most of the time the set size is 3 because the default replication factor of a file is 3. So in term of speed, there is no benifit using TreeSet. However in term of memory TreeSet is way more expensive than ArrayList. An entry in TreeSet is at least 6 times as expensive as an entry in ArrayList and approximately we have 3*total#_of_blocks of such entries in FSNamesystem.
        Hide
        Raghu Angadi added a comment -

        please disregard 4. diff between patch 5 and 3:

        • block = containingNodes.first().getBlock(block.getBlockId());
          + Block storedBlock =
          + containingNodes.first().getBlock(block.getBlockId());
          + // update stored block's length.
          + if ( block.getNumBytes() > 0 ) { + storedBlock.setNumBytes( block.getNumBytes() ); + block = storedBlock; + }
        Show
        Raghu Angadi added a comment - please disregard 4. diff between patch 5 and 3: block = containingNodes.first().getBlock(block.getBlockId()); + Block storedBlock = + containingNodes.first().getBlock(block.getBlockId()); + // update stored block's length. + if ( block.getNumBytes() > 0 ) { + storedBlock.setNumBytes( block.getNumBytes() ); + block = storedBlock; + }
        Hide
        Raghu Angadi added a comment -

        block-refs-4.patch : adds Konstantin's suggestion above. Diff between 3 and 4 :

        • block = containingNodes.first().getBlock(block.getBlockId());
          + Block storedBlock =
          + containingNodes.first().getBlock(block.getBlockId());
          + // update stored block's length.
          + if ( block != storedBlock && block.getNumBytes() > 0 ) { + storedBlock.setNumBytes( block.getNumBytes() ); + }

        We now update the block length with the length reported by latest datanode.

        As before this does not affect block lengths of blocks that belong to files that created during previous runs of the namenode.

        Show
        Raghu Angadi added a comment - block-refs-4.patch : adds Konstantin's suggestion above. Diff between 3 and 4 : block = containingNodes.first().getBlock(block.getBlockId()); + Block storedBlock = + containingNodes.first().getBlock(block.getBlockId()); + // update stored block's length. + if ( block != storedBlock && block.getNumBytes() > 0 ) { + storedBlock.setNumBytes( block.getNumBytes() ); + } We now update the block length with the length reported by latest datanode. As before this does not affect block lengths of blocks that belong to files that created during previous runs of the namenode.
        Hide
        Raghu Angadi added a comment -

        Ok, so the length reported in the latest block report is the correct length. I will attach a new patch with this change.

        Show
        Raghu Angadi added a comment - Ok, so the length reported in the latest block report is the correct length. I will attach a new patch with this change.
        Hide
        Konstantin Shvachko added a comment -

        Different replicas having different lengths should be detected by check sums.
        Your patch should update length imo, since before it was updated with every block report.
        If an append occurs the file length should change.

        Show
        Konstantin Shvachko added a comment - Different replicas having different lengths should be detected by check sums. Your patch should update length imo, since before it was updated with every block report. If an append occurs the file length should change.
        Hide
        Raghu Angadi added a comment -

        This patch does not. I was wondering about this. No where in the code do we check or enforce that lengths reported by datanodes are same. For e.g. when a file is closed all the blocks for the file use the length reported by first data node in that has that block. This patch does not change that behavior. Block length is rarely considered.

        Show
        Raghu Angadi added a comment - This patch does not. I was wondering about this. No where in the code do we check or enforce that lengths reported by datanodes are same. For e.g. when a file is closed all the blocks for the file use the length reported by first data node in that has that block. This patch does not change that behavior. Block length is rarely considered.
        Hide
        Konstantin Shvachko added a comment -

        If the new report contains different block length do you update it in the stored block?

        Show
        Konstantin Shvachko added a comment - If the new report contains different block length do you update it in the stored block?
        Hide
        Raghu Angadi added a comment -

        block-refs-3.patch:

        1) changed to getBlock(Block) to getBlock( long blockid ). Note that this version includes a 'new'.
        2) Konstantin found a bug because of change in semantics of removeStoredBlock() and addStoredBlock().
        previously they did not modify node's map. restored.

        this also slightly modifies Block.compareTo().

        Why not have TreeMap<BlockId, Block> instead of TreeMap<Block, Block>?
        BlockId is long and it needs to be Long for this generic class. That implies another allocation of Long for each element in the map.

        Show
        Raghu Angadi added a comment - block-refs-3.patch: 1) changed to getBlock(Block) to getBlock( long blockid ). Note that this version includes a 'new'. 2) Konstantin found a bug because of change in semantics of removeStoredBlock() and addStoredBlock(). previously they did not modify node's map. restored. this also slightly modifies Block.compareTo(). Why not have TreeMap<BlockId, Block> instead of TreeMap<Block, Block>? BlockId is long and it needs to be Long for this generic class. That implies another allocation of Long for each element in the map.
        Hide
        Raghu Angadi added a comment -

        > I do not see how replacing TreeSet=TreeMap<Block, static final Object> by TreeMap<Block, Block> in DatanodeDescriptor
        > would reduce memory consumption.

        This does not reduce memory .This will let us get hold of original object that was inserted in to the map. That fact that all nodes that have a particular block and blockMap reference the same object reduces memory. Now we have 1 Block object instead of 4 (in the case of 3 replicas). This references argument might be flawed since I am kind of new to Java.

        > Block getBlock( Block b)
        > looks very strange. So you need a block in order to get it ..........

        . We could call it getStoredBloc() or getBlock(BlockId). We want a function that gives the Block object what was stored in the map.

        Show
        Raghu Angadi added a comment - > I do not see how replacing TreeSet=TreeMap<Block, static final Object> by TreeMap<Block, Block> in DatanodeDescriptor > would reduce memory consumption. This does not reduce memory .This will let us get hold of original object that was inserted in to the map. That fact that all nodes that have a particular block and blockMap reference the same object reduces memory. Now we have 1 Block object instead of 4 (in the case of 3 replicas). This references argument might be flawed since I am kind of new to Java. > Block getBlock( Block b) > looks very strange. So you need a block in order to get it .......... . We could call it getStoredBloc() or getBlock(BlockId). We want a function that gives the Block object what was stored in the map.
        Hide
        Konstantin Shvachko added a comment -

        I do not see how replacing TreeSet=TreeMap<Block, static final Object> by TreeMap<Block, Block> in DatanodeDescriptor
        would reduce memory consumption.

        Block getBlock( Block b)
        looks very strange. So you need a block in order to get it ..........

        Show
        Konstantin Shvachko added a comment - I do not see how replacing TreeSet=TreeMap<Block, static final Object> by TreeMap<Block, Block> in DatanodeDescriptor would reduce memory consumption. Block getBlock( Block b) looks very strange. So you need a block in order to get it ..........
        Hide
        Raghu Angadi added a comment -

        2) Another candidate :
        There are two global Maps for blocks
        a) blockMap : block --> nodes that have this block.
        b) activeBlock : block --> file Inode

        We can have only one map :

        blockMap : block --> BlockInfo

        { Block, nodes, Inode, ... }

        Of course this is a bigger code change.

        Is TreeSet much bigger in size than a LinkedList? We have one TreeSet of nodes for each block.

        Show
        Raghu Angadi added a comment - 2) Another candidate : There are two global Maps for blocks a) blockMap : block --> nodes that have this block. b) activeBlock : block --> file Inode We can have only one map : blockMap : block --> BlockInfo { Block, nodes, Inode, ... } Of course this is a bigger code change. Is TreeSet much bigger in size than a LinkedList? We have one TreeSet of nodes for each block.
        Hide
        Raghu Angadi added a comment -

        block-prefs-2.patch. This removes another copy of block for files that are created in current instance of running Namenode. Also removes iterations when a file is added.

        Extra copy still exists for files that are created at the beginning while reading fsimage.

        Show
        Raghu Angadi added a comment - block-prefs-2.patch. This removes another copy of block for files that are created in current instance of running Namenode. Also removes iterations when a file is added. Extra copy still exists for files that are created at the beginning while reading fsimage.
        Hide
        Raghu Angadi added a comment -

        Now datanode's blocks is defined as TreeMap.

        I am attaching the proposed patch for review. I have done brief testing and will test with manually deleting some blocks from the datanode to trigger new code.

        Show
        Raghu Angadi added a comment - Now datanode's blocks is defined as TreeMap. I am attaching the proposed patch for review. I have done brief testing and will test with manually deleting some blocks from the datanode to trigger new code.
        Hide
        Doug Cutting added a comment -

        TreeSet wraps a TreeMap, so you might as well just use a TreeMap whose keys and values are identical, or somesuch.

        Show
        Doug Cutting added a comment - TreeSet wraps a TreeMap, so you might as well just use a TreeMap whose keys and values are identical, or somesuch.
        Hide
        Raghu Angadi added a comment -

        TreeSet() there does not seem to be any method to get reference of an object that exists in the set.

        Show
        Raghu Angadi added a comment - TreeSet() there does not seem to be any method to get reference of an object that exists in the set.

          People

          • Assignee:
            Raghu Angadi
            Reporter:
            Raghu Angadi
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development