Hadoop Common
  1. Hadoop Common
  2. HADOOP-1687

Name-node memory size estimates and optimization proposal.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.13.1
    • Fix Version/s: 0.15.0
    • Component/s: None
    • Labels:
      None

      Description

      I've done some estimates on how much space our data structures take on the name-node per block, file and directory.

      Brief overview of the data structures:
      Directory tree (FSDirectory) is built of inodes. Each INode points either to an array of blocks
      if it corresponds to a file or to a TreeMap<String, INode> of children INodes if it is a directory.
      [Note: this estimates were made before Dhruba replaced the children TreeMap by ArrayList.]
      Each block participates also in at least 2 more data structures.
      BlocksMap contains a HashMap<Block, BlockInfo> of all blocks mapping a Block into a BlockInfo.
      DatanodeDescriptor contains a TreeMap<Block, Block> of all blocks belonging to this data-node.
      A block may or may not be contained also in other data-structures, like

      UnderReplicatedBlocks
      PendingReplicationBlocks
      recentInvalidateSets
      excessReplicateMap
      

      Presence of a block in any of these structures is temporary and therefore I do not count them in my estimates.
      The estimates can be viewed as lower bounds.

      These are some classes that we are looking at here

      class INode {
         String name;
         INode parent;
         TreeMap<String, INode> children;
         Block blocks[];
         short blockReplication;
         long modificationTime;
      }
      
      class Block {
         long blkid;
         long len;
      }
      
      class BlockInfo {
         FSDirectory.INode       inode;
         DatanodeDescriptor[]   nodes;
         Block                          block;
      }
      

      The calculations are made for a 64-bit java vm based on that
      Reference size = 8 bytes
      Object header size = 16 bytes
      Array header size = 24 bytes

      Commonly used objects:
      TreeMap.Entry = 64 bytes. It has 5 reference fields
      HashMap.Entry = 48 bytes. It has 3 reference fields
      String header = 64 bytes.

      The size of a file includes:

      1. Size of an empty file INode: INode.children = null, INode.blocks is a 0-length array, and file name is empty. (152 bytes)
      2. A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
      3. file name length times 2, because String represents each name character by 2 bytes.
      4. Reference to the outer FSDirectory class (8 bytes)

      The total: 224 + 2 * fileName.length

      The size of a directory includes:

      1. Size of an empty directory INode: INode.children is an empty TreeMap, INode.blocks = null, and file name is empty. (192 bytes)
      2. A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
      3. file name length times 2
      4. Reference to the outer FSDirectory class (8 bytes)

      The total: 264 + 2 * fileName.length

      The size of a block includes:

      1. Size of Block. (32 bytes)
      2. Size of BlockInfo. (64 + 8*replication bytes)
      3. Reference to the block from INode.blocks (8 bytes)
      4. HashMap.Entry referencing the block from BlocksMap. (48 bytes)
      5. References to the block from all DatanodeDescriptors it belongs to.
        This is a TreeMap.Entry size times block replication. (64 * replication)

      The total: 152 + 72 * replication

      Typical object sizes:
      Taking into account that typical file name is 10-15 bytes and our default replication is 3 we can say that typical sizes are
      File size: 250
      Directory size: 290
      Block size: 368

      Object size estimate (bytes) typical size (bytes)
      File 224 + 2 * fileName.length 250
      Directory 264 + 2 * fileName.length 290
      Block 152 + 72 * replication 368

      One of our clusters has

      1. Files: 10 600 000
      2. Dirs: 310 000
      3. Blocks: 13 300 000

      Total Size (estimate): 7,63 GB
      Memory used on the name-node (actual reported by jconsole after gc): 9 GB

      This means that other data structures like NetworkTopology, heartbeats, datanodeMap, Host2NodesMap,
      leases, sortedLeases, and multiple block replication maps occupy ~1.4 GB, which seems to be pretty high
      and need to be investigated as well.

      Based on the above estimates blocks should be the main focus of the name-node memory reduction effort.
      Space used by a block is 50% larger compared to a file, and there is more blocks than files or directories.

      Some ideas on how we can reduce the name-node size without substantially changing the data structures.

      1. INode.children should be an ArrayList instead of a TreeMap. Already done HADOOP-1565. (-48 bytes)
      2. Factor out the INode class into a separate class (-8 bytes)
      3. Create base INode class and derive file inode and directory inode classes from the base.
        Directory inodes do not need to contain blocks and replication fields (-16 bytes)
        File inodes do not need to contain children field (-8 bytes)
      4. String name should be replaced by a mere byte[]. (-(40 + fileName.length) ~ -50 bytes)
      5. Eliminate the Block object.
        We should move Block fields into into BlockInfo and completely get rid of the Block object. (-16 bytes)
      6. Block object is referenced at least 5 times in our structures for each physical block.
        The number of references should be reduced to just 2. (-24)
      7. Remove name field from INode. File or directory name is stored in the corresponding directory
        entry and does need to be duplicated in the INode (-8 bytes)
      8. Eliminate INode.parent field. INodes are accessed through the directory tree, and the parent can
        be remembered in a local variable while browsing the tree. There is no need to persistently store
        the parent reference for each object. (-8 bytes)
      9. Need to optimize data-node to block map. Currently each DatanodeDescriptor holds a TreeMap of
        blocks contained in the node, and we have an overhead of one TreeMap.Entry per block replica.
        I expect we can reorganize datanodeMap in a way that it stores only 1 or 2 references per replica
        instead of an entire TreeMap.Entry. (-48 * replication)

      Note: In general TreeMaps turned out to be very expensive, we should avoid using them if possible.
      Or implement a custom map structure, which would avoid using objects for each map entry.

      This is what we will have after all the optimizations

      Object size estimate (bytes) typical size (bytes) current typical size (bytes)
      File 112 + fileName.length 125 250
      Directory 144 + fileName.length 155 290
      Block 112 + 24 * replication 184 368
      1. DNBlockList.patch
        28 kB
        Konstantin Shvachko

        Issue Links

          Activity

          Owen O'Malley made changes -
          Component/s dfs [ 12310710 ]
          Doug Cutting made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Doug Cutting made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Doug Cutting made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Doug Cutting made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Konstantin Shvachko made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Konstantin Shvachko made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Konstantin Shvachko made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Konstantin Shvachko made changes -
          Attachment DNBlockList.patch [ 12364934 ]
          Konstantin Shvachko made changes -
          Link This issue incorporates HADOOP-1766 [ HADOOP-1766 ]
          Konstantin Shvachko made changes -
          Link This issue incorporates HADOOP-1759 [ HADOOP-1759 ]
          Konstantin Shvachko made changes -
          Link This issue incorporates HADOOP-1743 [ HADOOP-1743 ]
          Konstantin Shvachko made changes -
          Field Original Value New Value
          Description I've done some estimates on how much space our data structures take on the name-node per block, file and directory.

          Brief overview of the data structures:
          Directory tree (FSDirectory) is built of inodes. Each INode points either to an array of blocks
          if it corresponds to a file or to a TreeMap<String, INode> of children INodes if it is a directory.
          [Note: this estimates were made before Dhruba replaced the children TreeMap by ArrayList.]
          Each block participates also in at least 2 more data structures.
          BlocksMap contains a HashMap<Block, BlockInfo> of all blocks mapping a Block into a BlockInfo.
          DatanodeDescriptor contains a TreeMap<Block, Block> of all blocks belonging to this data-node.
          A block may or may not be contained also in other data-structures, like
          {code}
          UnderReplicatedBlocks
          PendingReplicationBlocks
          recentInvalidateSets
          excessReplicateMap
          {code}
          Presence of a block in any of these structures is temporary and therefore I do not count them in my estimates.
          The estimates can be viewed as lower bounds.

          These are some classes that we are looking at here
          {code}
          class INode {
             String name;
             INode parent;
             TreeMap<String, INode> children;
             Block blocks[];
             short blockReplication;
             long modificationTime;
          }

          class Block {
             long blkid;
             long len;
          }

          class BlockInfo {
             FSDirectory.INode inode;
             DatanodeDescriptor[] nodes;
             Block block;
          }
          {code}

          The calculations are made for a 64-bit java vm based on that
          Reference size = 8 bytes
          Object header size = 16 bytes
          Array header size = 24 bytes

          Commonly used objects:
          TreeMap.Entry = 64 bytes. It has 5 reference fields
          HashMap.Entry = 48 bytes. It has 3 reference fields
          String header = 64 bytes.

          The size of a file includes:
          # Size of an empty file INode: INode.children = null, INode.blocks is a 0-length array, and file name is empty. (152 bytes)
          # A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
          # file name length times 2, because String represents each name character by 2 bytes.
          # Reference to the outer FSDirectory class (8 bytes)
          The total: 224 + 2 * fileName.length

          The size of a directory includes:
          # Size of an empty directory INode: INode.children is an empty TreeMap, INode.blocks = null, and file name is empty. (192 bytes)
          # A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
          # file name length times 2
          # Reference to the outer FSDirectory class (8 bytes)
          The total: 264 + 2 * fileName.length

          The size of a block includes:
          # Size of Block. (32 bytes)
          # Size of BlockInfo. (64 + 8*replication bytes)
          # Reference to the block from INode.blocks (8 bytes)
          # HashMap.Entry referencing the block from BlocksMap. (48 bytes)
          # References to the block from all DatanodeDescriptors it belongs to.
          This is a TreeMap.Entry size times block replication. (64 * replication)
          The total: 152 + 72 * replication

          Typical object sizes:
          Taking into account that typical file name is 10-15 bytes and our default replication is 3 we can say that typical sizes are
          File size: 250
          Directory size: 290
          Block size: 368

          ||Object||size estimate (bytes)||typical size (bytes)||
          |File|224 + 2 * fileName.length|250|
          |Directory|264 + 2 * fileName.length|290|
          |Block|152 + 72 * replication|368|

          One of our clusters has
          # Files: 10 600 000
          # Dirs: 310 000
          # Blocks: 13 300 000
          Total Size (estimate): 7,63 GB
          Memory used on the name-node (actual reported by jconsole after gc): 9 GB

          This means that other data structures like NetworkTopology, heartbeats, datanodeMap, Host2NodesMap,
          leases, sortedLeases, and multiple block replication maps occupy ~1.4 GB, which seems to be pretty high
          and need to be investigated as well.

          Based on the above estimates blocks should be the main focus of the name-node memory reduction effort.
          Space used by a block is 50% larger compared to a file, and there is more blocks than files or directories.

          Some ideas on how we can reduce the name-node size without substantially changing the data structures.
          # INode.children should be an ArrayList instead of a TreeMap. Already done HADOOP-1565. (-48 bytes)
          # Factor out the INode class into a separate class (-8 bytes)
          # Create base INode class and derive file inode and directory inode classes from the base.
          Directory inodes do not need to contain blocks and replication fields (-16 bytes)
          File inodes do not need to contain children field (-8 bytes)
          # String name should be replaced by a mere byte[]. (-(40 + fileName.length) ~ -50 bytes)
          # Eliminate the Block object.
          We should move Block fields into into BlockInfo and completely get rid of the Block object. (-16 bytes)
          # Block object is referenced at least 5 times in our structures for each physical block.
          The number of references should be reduced to just 2. (-24)
          # Remove name field from INode. File or directory name is stored in the corresponding directory
          entry and does need to be duplicated in the INode (-8 bytes)
          # Eliminate INode.parent field. INodes are accessed through the directory tree, and the parent can
          be remembered in a local variable while browsing the tree. There is no need to persistently store
          the parent reference for each object. (-8 bytes)
          # Need to optimize data-node to block map. Currently each DatanodeDescriptor holds a TreeMap of
          blocks contained in the node, and we have an overhead of one TreeMap.Entry per block replica.
          I expect we can reorganize datanodeMap in a way that it stores only 1 or 2 references per replica
          instead of an entire TreeMap.Entry. (-48 * replication)

          Note: In general TreeMaps turned out to be very expensive, we should avoid using them if possible.
          Or implement a custom map structure, which would avoid using objects for each map entry.


          This is what we will have after all the optimizations
          ||Object||size estimate (bytes)||typical size (bytes)||current typical size (bytes)||
          |File|112 + fileName.length|125|250|
          |Directory|144 + fileName.length|155|290|
          |Block|112 + 24 * replication|184|368|
          I've done some estimates on how much space our data structures take on the name-node per block, file and directory.

          Brief overview of the data structures:
          Directory tree (FSDirectory) is built of inodes. Each INode points either to an array of blocks
          if it corresponds to a file or to a TreeMap<String, INode> of children INodes if it is a directory.
          [Note: this estimates were made before Dhruba replaced the children TreeMap by ArrayList.]
          Each block participates also in at least 2 more data structures.
          BlocksMap contains a HashMap<Block, BlockInfo> of all blocks mapping a Block into a BlockInfo.
          DatanodeDescriptor contains a TreeMap<Block, Block> of all blocks belonging to this data-node.
          A block may or may not be contained also in other data-structures, like
          {code}
          UnderReplicatedBlocks
          PendingReplicationBlocks
          recentInvalidateSets
          excessReplicateMap
          {code}
          Presence of a block in any of these structures is temporary and therefore I do not count them in my estimates.
          The estimates can be viewed as lower bounds.

          These are some classes that we are looking at here
          {code}
          class INode {
             String name;
             INode parent;
             TreeMap<String, INode> children;
             Block blocks[];
             short blockReplication;
             long modificationTime;
          }

          class Block {
             long blkid;
             long len;
          }

          class BlockInfo {
             FSDirectory.INode inode;
             DatanodeDescriptor[] nodes;
             Block block;
          }
          {code}

          The calculations are made for a 64-bit java vm based on that
          Reference size = 8 bytes
          Object header size = 16 bytes
          Array header size = 24 bytes

          Commonly used objects:
          TreeMap.Entry = 64 bytes. It has 5 reference fields
          HashMap.Entry = 48 bytes. It has 3 reference fields
          String header = 64 bytes.

          The size of a file includes:
          # Size of an empty file INode: INode.children = null, INode.blocks is a 0-length array, and file name is empty. (152 bytes)
          # A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
          # file name length times 2, because String represents each name character by 2 bytes.
          # Reference to the outer FSDirectory class (8 bytes)

          The total: 224 + 2 * fileName.length

          The size of a directory includes:
          # Size of an empty directory INode: INode.children is an empty TreeMap, INode.blocks = null, and file name is empty. (192 bytes)
          # A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
          # file name length times 2
          # Reference to the outer FSDirectory class (8 bytes)

          The total: 264 + 2 * fileName.length

          The size of a block includes:
          # Size of Block. (32 bytes)
          # Size of BlockInfo. (64 + 8*replication bytes)
          # Reference to the block from INode.blocks (8 bytes)
          # HashMap.Entry referencing the block from BlocksMap. (48 bytes)
          # References to the block from all DatanodeDescriptors it belongs to.
          This is a TreeMap.Entry size times block replication. (64 * replication)

          The total: 152 + 72 * replication

          Typical object sizes:
          Taking into account that typical file name is 10-15 bytes and our default replication is 3 we can say that typical sizes are
          File size: 250
          Directory size: 290
          Block size: 368

          ||Object||size estimate (bytes)||typical size (bytes)||
          |File|224 + 2 * fileName.length|250|
          |Directory|264 + 2 * fileName.length|290|
          |Block|152 + 72 * replication|368|

          One of our clusters has
          # Files: 10 600 000
          # Dirs: 310 000
          # Blocks: 13 300 000

          Total Size (estimate): 7,63 GB
          Memory used on the name-node (actual reported by jconsole after gc): 9 GB

          This means that other data structures like NetworkTopology, heartbeats, datanodeMap, Host2NodesMap,
          leases, sortedLeases, and multiple block replication maps occupy ~1.4 GB, which seems to be pretty high
          and need to be investigated as well.

          Based on the above estimates blocks should be the main focus of the name-node memory reduction effort.
          Space used by a block is 50% larger compared to a file, and there is more blocks than files or directories.

          Some ideas on how we can reduce the name-node size without substantially changing the data structures.
          # INode.children should be an ArrayList instead of a TreeMap. Already done HADOOP-1565. (-48 bytes)
          # Factor out the INode class into a separate class (-8 bytes)
          # Create base INode class and derive file inode and directory inode classes from the base.
          Directory inodes do not need to contain blocks and replication fields (-16 bytes)
          File inodes do not need to contain children field (-8 bytes)
          # String name should be replaced by a mere byte[]. (-(40 + fileName.length) ~ -50 bytes)
          # Eliminate the Block object.
          We should move Block fields into into BlockInfo and completely get rid of the Block object. (-16 bytes)
          # Block object is referenced at least 5 times in our structures for each physical block.
          The number of references should be reduced to just 2. (-24)
          # Remove name field from INode. File or directory name is stored in the corresponding directory
          entry and does need to be duplicated in the INode (-8 bytes)
          # Eliminate INode.parent field. INodes are accessed through the directory tree, and the parent can
          be remembered in a local variable while browsing the tree. There is no need to persistently store
          the parent reference for each object. (-8 bytes)
          # Need to optimize data-node to block map. Currently each DatanodeDescriptor holds a TreeMap of
          blocks contained in the node, and we have an overhead of one TreeMap.Entry per block replica.
          I expect we can reorganize datanodeMap in a way that it stores only 1 or 2 references per replica
          instead of an entire TreeMap.Entry. (-48 * replication)

          Note: In general TreeMaps turned out to be very expensive, we should avoid using them if possible.
          Or implement a custom map structure, which would avoid using objects for each map entry.


          This is what we will have after all the optimizations
          ||Object||size estimate (bytes)||typical size (bytes)||current typical size (bytes)||
          |File|112 + fileName.length|125|250|
          |Directory|144 + fileName.length|155|290|
          |Block|112 + 24 * replication|184|368|
          Konstantin Shvachko created issue -

            People

            • Assignee:
              Konstantin Shvachko
              Reporter:
              Konstantin Shvachko
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development