Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-4489

Use InodeID as as an identifier of a file in HDFS protocols and APIs

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1.0-beta
    • Component/s: namenode
    • Labels:
      None

      Description

      The benefit of using InodeID to uniquely identify a file can be multiple folds. Here are a few of them:

      1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437.
      2. modification checks in tools like distcp. Since a file could have been replaced or renamed to, the file name and size combination is no t reliable, but the combination of file id and size is unique.
      3. id based protocol support (e.g., NFS)
      4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385).

      1. 4434.optimized.patch
        82 kB
        Suresh Srinivas

        Issue Links

          Activity

          Hide
          Suresh Srinivas added a comment -

          Introduction

          This jira proposes to introduce InodeID in HDFS. InodeID uniquely identifies an instance of an inode such as directory, file, symbolic links etc. independent of the file path name. This helps in several use cases:

          1. HDFS can evolve to support ID based protocols such as NFS. We plan to add an experimental NFS V3 gateway to HDFS using this mechanism. Will post a github link soon.
          2. InodeID can be used by the tools to track a single instance of a file, for cacheing data or tracking and checking for modification based on INodeID, in tools like distcp.
          3. Path cannot identify a unique instance of a file. This causes issues as described in HDFS-4258 and HDFS-4437. It has also been a requirement of many other jiras such as HDFS-385.
          4. Using InodeID as an identifier instead of path can be more efficient than path bases accesses.

          New Inode identifier

          It is 64 bit long number and has the following properties:

          1. Number 0 is reserved and would be used for identifying invalid/default value.
          2. Numbers 1-1023 are reserved for some unforeseen future requirements. The InodeID starts from 1024.
          3. An InodeID is never re-used in a single namenode namespace.

          General overview of the changes required

          Applications discover the InodeID by getting the FileStatus for an Inode or when an Inode is created or opened for append. FileStatus will be changed to include InodeID. Create and append will be changed to return FileStatus as well.

          For other APIs that use path (Path or String) to identify a file we have two choices:

          1. API Change: Add another variant of the API that uses InodeID to identify a file or add additional parameter InodeID to the API.
          2. No API Change: Use the path to send the InodeID and keep the API changes to a minimum. Example, one could use path /.inodes/<inodesID>, where ".inodes" is a reserved name to identify the path that pass InodeID instead of regular path. This similar to /proc used for special purposes on *nix.

          InodeID to Inode map

          A new map (based on GSet) will be introduced in the namenode to map a given InodeID to an Inode. This is in addition to existing map of BlockID to BlockInfo.

          Additional memory consumption

          Adding all this will require additional memory in the namenode.

          • 8 byte InodeID into Inode object results in a cost of 8 bytes per Inode. As proposed in HDFS-4258, this can be folded into existing modification and access time.
          • Introducing InodeID to Inode GSet results in additional memory of 16 bytes per Inode:
            • 8 * size of Gset (where size of GSet could be as big as number of Inodes)
            • 8 bytes per Inode for a java reference (pointer to next element as required by GSet)

          Inodes and related objects consume approximately 1/3 of the JVM heap in a system that is full. Inode size is ~180 bytes and this proposal adds 16-24 bytes per Inode. With change the JVM heap needs to be increased by 3% to 4.5%. While further optimizations are possible to reduce this size further, it adds needless code complexity.

          Security concerns

          InodeID is an alternate name similar to path. All the existing security mechanism that applies to path (that is ensuring permissions are checked from the root to the Inode) will also be done for InodeID based access.

          Show
          Suresh Srinivas added a comment - Introduction This jira proposes to introduce InodeID in HDFS. InodeID uniquely identifies an instance of an inode such as directory, file, symbolic links etc. independent of the file path name. This helps in several use cases: HDFS can evolve to support ID based protocols such as NFS. We plan to add an experimental NFS V3 gateway to HDFS using this mechanism. Will post a github link soon. InodeID can be used by the tools to track a single instance of a file, for cacheing data or tracking and checking for modification based on INodeID, in tools like distcp. Path cannot identify a unique instance of a file. This causes issues as described in HDFS-4258 and HDFS-4437 . It has also been a requirement of many other jiras such as HDFS-385 . Using InodeID as an identifier instead of path can be more efficient than path bases accesses. New Inode identifier It is 64 bit long number and has the following properties: Number 0 is reserved and would be used for identifying invalid/default value. Numbers 1-1023 are reserved for some unforeseen future requirements. The InodeID starts from 1024. An InodeID is never re-used in a single namenode namespace. General overview of the changes required Applications discover the InodeID by getting the FileStatus for an Inode or when an Inode is created or opened for append. FileStatus will be changed to include InodeID. Create and append will be changed to return FileStatus as well. For other APIs that use path (Path or String) to identify a file we have two choices: API Change : Add another variant of the API that uses InodeID to identify a file or add additional parameter InodeID to the API. No API Change : Use the path to send the InodeID and keep the API changes to a minimum. Example, one could use path /.inodes/<inodesID> , where ".inodes" is a reserved name to identify the path that pass InodeID instead of regular path. This similar to /proc used for special purposes on *nix. InodeID to Inode map A new map (based on GSet) will be introduced in the namenode to map a given InodeID to an Inode. This is in addition to existing map of BlockID to BlockInfo. Additional memory consumption Adding all this will require additional memory in the namenode. 8 byte InodeID into Inode object results in a cost of 8 bytes per Inode. As proposed in HDFS-4258 , this can be folded into existing modification and access time. Introducing InodeID to Inode GSet results in additional memory of 16 bytes per Inode: 8 * size of Gset (where size of GSet could be as big as number of Inodes) 8 bytes per Inode for a java reference (pointer to next element as required by GSet) Inodes and related objects consume approximately 1/3 of the JVM heap in a system that is full. Inode size is ~180 bytes and this proposal adds 16-24 bytes per Inode. With change the JVM heap needs to be increased by 3% to 4.5%. While further optimizations are possible to reduce this size further, it adds needless code complexity. Security concerns InodeID is an alternate name similar to path. All the existing security mechanism that applies to path (that is ensuring permissions are checked from the root to the Inode) will also be done for InodeID based access.
          Hide
          Suresh Srinivas added a comment -

          Given that the subtasks do not break the trunk, I plan to start reviewing individual jiras and committing the patches attached to subtasks. Some of these patches are already committed to trunk.

          Show
          Suresh Srinivas added a comment - Given that the subtasks do not break the trunk, I plan to start reviewing individual jiras and committing the patches attached to subtasks. Some of these patches are already committed to trunk.
          Hide
          Aaron T. Myers added a comment -

          Why not do this on a branch? That makes the most sense to me, given that the individual patches themselves don't make a lot of sense when considered individually.

          Show
          Aaron T. Myers added a comment - Why not do this on a branch? That makes the most sense to me, given that the individual patches themselves don't make a lot of sense when considered individually.
          Hide
          Suresh Srinivas added a comment -

          Aaron T. Myers Can you describe which of the individual patches do not make sense to you? I thought the previous comments indicated that it was not clear how the overall design is and how the pieces fit together. Now that this jiras describes the over all motivation, approach being taken, I hope there is more clarity.

          Show
          Suresh Srinivas added a comment - Aaron T. Myers Can you describe which of the individual patches do not make sense to you? I thought the previous comments indicated that it was not clear how the overall design is and how the pieces fit together. Now that this jiras describes the over all motivation, approach being taken, I hope there is more clarity.
          Hide
          Aaron T. Myers added a comment -

          Sorry if I wasn't clear - all the patches make sense to me, it's just that several of them don't really stand on their own, so it seems like we should work on the whole work on a separate branch, get the feature in shape there, and then merge it back to trunk once the whole project is completed.

          Show
          Aaron T. Myers added a comment - Sorry if I wasn't clear - all the patches make sense to me, it's just that several of them don't really stand on their own, so it seems like we should work on the whole work on a separate branch, get the feature in shape there, and then merge it back to trunk once the whole project is completed.
          Hide
          Suresh Srinivas added a comment -

          all the patches make sense to me, it's just that several of them don't really stand on their own

          I am not sure I understand this. One of the main reasons for a feature branch (at least for me), while during development, we may break trunk. But in this case that is not the case.

          I have cleaned up the list of subtasks in this jira. Hopefully the subtasks should make it more clear.

          Let me add some details about individual jiras and that should help in understanding them better:

          1. HDFS-4334 - Adds unique ID to each INode.
          2. HDFS-4346 - Refactored the code to remove code duplication between INode generation and block ID generation
          3. HDFS-4339 - Persist the INode in fsimage.
          4. HDFS-4434 - Introduce a map of inode ID to inode so that inodeid/fileid can be used as an identifier to address a file
          Show
          Suresh Srinivas added a comment - all the patches make sense to me, it's just that several of them don't really stand on their own I am not sure I understand this. One of the main reasons for a feature branch (at least for me), while during development, we may break trunk. But in this case that is not the case. I have cleaned up the list of subtasks in this jira. Hopefully the subtasks should make it more clear. Let me add some details about individual jiras and that should help in understanding them better: HDFS-4334 - Adds unique ID to each INode. HDFS-4346 - Refactored the code to remove code duplication between INode generation and block ID generation HDFS-4339 - Persist the INode in fsimage. HDFS-4434 - Introduce a map of inode ID to inode so that inodeid/fileid can be used as an identifier to address a file
          Hide
          Aaron T. Myers added a comment -

          One of the main reasons for a feature branch (at least for me), while during development, we may break trunk. But in this case that is not the case.

          At one point in time I believe the INodeID work did indeed break WebHDFS on trunk.

          Another reason for using a development branch is because the feature isn't necessarily complete without certain patches having been committed. The fact that HDFS-4339 (persist INodeIDs in the fsimage) isn't committed yet suggests that this feature won't really work as-intended until that's committed, but yet we've already committed other patches involving INodeIDs to trunk. That doesn't make much sense to me.

          I don't understand the resistance to doing this on a feature branch. What's the concern with doing so?

          Show
          Aaron T. Myers added a comment - One of the main reasons for a feature branch (at least for me), while during development, we may break trunk. But in this case that is not the case. At one point in time I believe the INodeID work did indeed break WebHDFS on trunk. Another reason for using a development branch is because the feature isn't necessarily complete without certain patches having been committed. The fact that HDFS-4339 (persist INodeIDs in the fsimage) isn't committed yet suggests that this feature won't really work as-intended until that's committed, but yet we've already committed other patches involving INodeIDs to trunk. That doesn't make much sense to me. I don't understand the resistance to doing this on a feature branch. What's the concern with doing so?
          Hide
          Suresh Srinivas added a comment -

          At one point in time I believe the INodeID work did indeed break WebHDFS on trunk.

          That is because of a bug introduced in the change. When I say break in my previous comment, it is breaking the functionality because incomplete set of changes where HDFS is not functional and not a bug in the code committed.

          I don't understand the resistance to doing this on a feature branch. What's the concern with doing so?

          I am not resisting it, I do not see a need for it. I believe we have two more jiras to go in. Other jiras are already in. I think moving those commits to a separate branch, adding mere two more commits in that branch, calling for merge vote is unnecessary waste of time and I want to avoid it.

          Show
          Suresh Srinivas added a comment - At one point in time I believe the INodeID work did indeed break WebHDFS on trunk. That is because of a bug introduced in the change. When I say break in my previous comment, it is breaking the functionality because incomplete set of changes where HDFS is not functional and not a bug in the code committed. I don't understand the resistance to doing this on a feature branch. What's the concern with doing so? I am not resisting it, I do not see a need for it. I believe we have two more jiras to go in. Other jiras are already in. I think moving those commits to a separate branch, adding mere two more commits in that branch, calling for merge vote is unnecessary waste of time and I want to avoid it.
          Hide
          Aaron T. Myers added a comment -

          I am not resisting it, I do not see a need for it. I believe we have two more jiras to go in. Other jiras are already in. I think moving those commits to a separate branch, adding mere two more commits in that branch, calling for merge vote is unnecessary waste of time and I want to avoid it.

          Alright, go for it. I'll repeat my claim that this work should have been done on a branch to begin with, but c'est la vie.

          Show
          Aaron T. Myers added a comment - I am not resisting it, I do not see a need for it. I believe we have two more jiras to go in. Other jiras are already in. I think moving those commits to a separate branch, adding mere two more commits in that branch, calling for merge vote is unnecessary waste of time and I want to avoid it. Alright, go for it. I'll repeat my claim that this work should have been done on a branch to begin with, but c'est la vie.
          Hide
          Todd Lipcon added a comment -

          Inode size is ~180 bytes and this proposal adds 16-24 bytes per Inode.

          How is this calculated? I see the following 5 fields:

            private byte[] name = null;
            private long permission = 0L;
            protected INodeDirectory parent = null;
            protected long modificationTime = 0L;
            protected long accessTime = 0L;
          

          for a total of 40 bytes on a 64-bit JVM. So, adding 16-24 bytes is a pretty substantial new memory use.

          I agree with ATM that this should go on a branch since it's fairly invasive. Once the branch is working, we can evaluate the benefit of the new feature vs the measured cost (both memory and additional CPU to manage this new structure)

          Show
          Todd Lipcon added a comment - Inode size is ~180 bytes and this proposal adds 16-24 bytes per Inode. How is this calculated? I see the following 5 fields: private byte [] name = null ; private long permission = 0L; protected INodeDirectory parent = null ; protected long modificationTime = 0L; protected long accessTime = 0L; for a total of 40 bytes on a 64-bit JVM. So, adding 16-24 bytes is a pretty substantial new memory use. I agree with ATM that this should go on a branch since it's fairly invasive. Once the branch is working, we can evaluate the benefit of the new feature vs the measured cost (both memory and additional CPU to manage this new structure)
          Hide
          Todd Lipcon added a comment -

          (I guess I should add the subclass fields, in which case INodeFile has another two 8-byte fields, plus the associated array object for BlockInfo, etc)... but still seems to come in a lot less than 180 bytes.

          Show
          Todd Lipcon added a comment - (I guess I should add the subclass fields, in which case INodeFile has another two 8-byte fields, plus the associated array object for BlockInfo, etc)... but still seems to come in a lot less than 180 bytes.
          Hide
          Suresh Srinivas added a comment -

          for a total of 40 bytes on a 64-bit JVM. So, adding 16-24 bytes is a pretty substantial new memory use.

          Here are the things that goes into ~180 bytes:
          INode is an object. It comes with the cost of 16 bytes object header overhead. Members include:

          1. byte[] name - I assume typically ~56 bytes for this. That is (16 bytes object overhead, 8 byte length + bytes that make up file name, say 32)
          2. reference to byte[] name - 8 bytes
          3. long permission at the cost of 8 bytes.
          4. parent reference at 8 bytes cost
          5. modification time at 8 bytes cost
          6. accessTime at 8 bytes cost

          That is roughly ~112 bytes.

          Typically most of the INodes are INode files (I will leave the other type of inodes as an exercise).

          1. It has BlockInfo[]. This is again 16 bytes of object, 8 bytes length, say two blocks in a file with two references, with a cost of 40 bytes.
          2. It has long header that adds another 8 bytes.

          Total ~160 bytes. So it is not very far off and the number I had posted was based on what I had calculated long back.

          That said, 16-24 might seem like a huge percentage (10 to 15%) of INode size. But what is the amount of memory in NN heap that is allocate for Inodes. Assuming Inodes make up for 1/3, blocks make up for another 1/3, remaining 1/3 for floating garbage, head room etc, the net impact on NN heap is 3 to 5%. That is not far off from the analysis posted above.

          I believe half of the work is already in trunk. Remaining two jiras need to go in. I believe doing a branch at this point in time is unnecessary work.

          If you are concerned about memory usage of your installs, I can add a config option and not instantiate the map.

          Show
          Suresh Srinivas added a comment - for a total of 40 bytes on a 64-bit JVM. So, adding 16-24 bytes is a pretty substantial new memory use. Here are the things that goes into ~180 bytes: INode is an object. It comes with the cost of 16 bytes object header overhead. Members include: byte[] name - I assume typically ~56 bytes for this. That is (16 bytes object overhead, 8 byte length + bytes that make up file name, say 32) reference to byte[] name - 8 bytes long permission at the cost of 8 bytes. parent reference at 8 bytes cost modification time at 8 bytes cost accessTime at 8 bytes cost That is roughly ~112 bytes. Typically most of the INodes are INode files (I will leave the other type of inodes as an exercise). It has BlockInfo[]. This is again 16 bytes of object, 8 bytes length, say two blocks in a file with two references, with a cost of 40 bytes. It has long header that adds another 8 bytes. Total ~160 bytes. So it is not very far off and the number I had posted was based on what I had calculated long back. That said, 16-24 might seem like a huge percentage (10 to 15%) of INode size. But what is the amount of memory in NN heap that is allocate for Inodes. Assuming Inodes make up for 1/3, blocks make up for another 1/3, remaining 1/3 for floating garbage, head room etc, the net impact on NN heap is 3 to 5%. That is not far off from the analysis posted above. I believe half of the work is already in trunk. Remaining two jiras need to go in. I believe doing a branch at this point in time is unnecessary work. If you are concerned about memory usage of your installs, I can add a config option and not instantiate the map.
          Hide
          Suresh Srinivas added a comment -

          One more thing I forgot to add. There are many optimizations that are possible to reduce the memory consumed. It comes at the cost of code complexity and not so clean abstractions. I would rather avoid it and go for additional memory given newer machines are coming with more memory, than make the code unclean.

          Show
          Suresh Srinivas added a comment - One more thing I forgot to add. There are many optimizations that are possible to reduce the memory consumed. It comes at the cost of code complexity and not so clean abstractions. I would rather avoid it and go for additional memory given newer machines are coming with more memory, than make the code unclean.
          Hide
          Todd Lipcon added a comment -

          byte[] name - I assume typically ~56 bytes for this. That is (16 bytes object overhead, 8 byte length + bytes that make up file name, say 32)

          According to your comment here: https://issues.apache.org/jira/browse/HDFS-1110?focusedCommentId=12861548&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12861548 a typical image with ~50M files will only need ~5M unique name byte[] objects, so I think it's unfair to count the above against the inode.

          I think you're also adding an extra 8 bytes on the arrays – the array length as I understand it is a field within the 16byte object header (occupying the second half of the klassId field).

          Regardless, this seems like something that's very easy to test rather than try to solve analytically. Do you have results for the additional memory overhead of this map on a large production image? If it's truly 3-5%, seems reasonably, but I'm afraid it may look closer to 10+% in practice.

          Show
          Todd Lipcon added a comment - byte[] name - I assume typically ~56 bytes for this. That is (16 bytes object overhead, 8 byte length + bytes that make up file name, say 32) According to your comment here: https://issues.apache.org/jira/browse/HDFS-1110?focusedCommentId=12861548&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12861548 a typical image with ~50M files will only need ~5M unique name byte[] objects, so I think it's unfair to count the above against the inode. I think you're also adding an extra 8 bytes on the arrays – the array length as I understand it is a field within the 16byte object header (occupying the second half of the klassId field). Regardless, this seems like something that's very easy to test rather than try to solve analytically. Do you have results for the additional memory overhead of this map on a large production image? If it's truly 3-5%, seems reasonably, but I'm afraid it may look closer to 10+% in practice.
          Hide
          Suresh Srinivas added a comment -

          I think you're also adding an extra 8 bytes on the arrays – the array length as I understand it is a field within the 16byte object header (occupying the second half of the klassId field).

          If you have an authoritative source, please send me that. I cannot understand how 16 byte object header have spare of say possible 8 bytes to track array length. Some of of my previous instrumentation had led me to conclude the the array length is 4 bytes for 32bit JVM and 8 bytes for 64 bit JVM. See discussion here - http://www.javamex.com/tutorials/memory/object_memory_usage.shtml.

          a typical image with ~50M files will only need ~5M unique name byte[] objects, so I think it's unfair to count the above against the inode.

          That is a fair point. But my own inodes occupies 1/3rd of java heap is also an approximation and in practice I would think it inodes occupy smaller than that.

          I would like to run an experiment on a large production image. But I do not have ready access to it and will have to spend time getting to it. Do you have any?

          but I'm afraid it may look closer to 10+% in practice.

          I do not think it will be close to 10%, but lets say it is. I do not see much issues with it. When we did some of the optimizations earlier, we were not sure how JVM would do if goes closes to 64G and hence wanted to keep the heap size down. But since then many large installations have successfully, without any issues gone beyond that size. Smaller installations should be able to spare, say, 10% extra heap. But if that is not acceptable, here are the alternatives I see:

          1. Add configuration options to turn this feature off. Not instantiating GSet will reduce the overhead by 1/3rd. This is simple to do.
          2. Make more optimizations at the expense of code complexity. I would like to avoid this. But if it is deemed very important, with some optimizations, we can get it close to 0%.
          Show
          Suresh Srinivas added a comment - I think you're also adding an extra 8 bytes on the arrays – the array length as I understand it is a field within the 16byte object header (occupying the second half of the klassId field). If you have an authoritative source, please send me that. I cannot understand how 16 byte object header have spare of say possible 8 bytes to track array length. Some of of my previous instrumentation had led me to conclude the the array length is 4 bytes for 32bit JVM and 8 bytes for 64 bit JVM. See discussion here - http://www.javamex.com/tutorials/memory/object_memory_usage.shtml . a typical image with ~50M files will only need ~5M unique name byte[] objects, so I think it's unfair to count the above against the inode. That is a fair point. But my own inodes occupies 1/3rd of java heap is also an approximation and in practice I would think it inodes occupy smaller than that. I would like to run an experiment on a large production image. But I do not have ready access to it and will have to spend time getting to it. Do you have any? but I'm afraid it may look closer to 10+% in practice. I do not think it will be close to 10%, but lets say it is. I do not see much issues with it. When we did some of the optimizations earlier, we were not sure how JVM would do if goes closes to 64G and hence wanted to keep the heap size down. But since then many large installations have successfully, without any issues gone beyond that size. Smaller installations should be able to spare, say, 10% extra heap. But if that is not acceptable, here are the alternatives I see: Add configuration options to turn this feature off. Not instantiating GSet will reduce the overhead by 1/3rd. This is simple to do. Make more optimizations at the expense of code complexity. I would like to avoid this. But if it is deemed very important, with some optimizations, we can get it close to 0%.
          Hide
          Todd Lipcon added a comment -

          If you have an authoritative source, please send me that

          Sure, from the JDK7 source code hotspot/src/share/vm/oops/arrayOop.hpp:

          // The layout of array Oops is:
          //
          //  markOop
          //  klassOop  // 32 bits if compressed but declared 64 in LP64.
          //  length    // shares klass memory or allocated after declared fields.
          

          Important to note that the length of arrays is 32-bit, since array.length is an int rather than a long. So given a 64-bit field for klassId, it can use 32-bits for the actual class and 32 bits for the array length.

          I would like to run an experiment on a large production image. But I do not have ready access to it and will have to spend time getting to it. Do you have any?

          Yes, I can run the experiment on a large image. Is HDFS-4434's patch ready to apply so I can test it?

          Show
          Todd Lipcon added a comment - If you have an authoritative source, please send me that Sure, from the JDK7 source code hotspot/src/share/vm/oops/arrayOop.hpp: // The layout of array Oops is: // // markOop // klassOop // 32 bits if compressed but declared 64 in LP64. // length // shares klass memory or allocated after declared fields. Important to note that the length of arrays is 32-bit, since array.length is an int rather than a long. So given a 64-bit field for klassId, it can use 32-bits for the actual class and 32 bits for the array length. I would like to run an experiment on a large production image. But I do not have ready access to it and will have to spend time getting to it. Do you have any? Yes, I can run the experiment on a large image. Is HDFS-4434 's patch ready to apply so I can test it?
          Hide
          Suresh Srinivas added a comment -

          or allocated after declared fields.

          Not sure what this means though.

          HDFS-4434 patch is ready. Thanks in advance for running the tests.

          Show
          Suresh Srinivas added a comment - or allocated after declared fields. Not sure what this means though. HDFS-4434 patch is ready. Thanks in advance for running the tests.
          Hide
          Todd Lipcon added a comment -

          // The _length field is not declared in C++. It is allocated after the
          // declared nonstatic fields in arrayOopDesc if not compressed, otherwise
          // it occupies the second half of the _klass field in oopDesc.
          static int length_offset_in_bytes()

          { return UseCompressedOops ? klass_gap_offset_in_bytes() : sizeof(arrayOopDesc); }

          Basically if CompressedOops are on, then klassids are only 32-bits, but there's already a 64-bit field for it, so it just uses the latter 4 bytes for the array length. Otherwise it's an extra 4 bytes that comes after the standard oop header (oopDesc). So, without compressed oops, arrays take 20 bytes base. With them (on by default on heaps <32GB since 6u18 I believe), the array header is the same size as normal objects (16 bytes).

          Will take a look at loading a big image with that patch now.

          Show
          Todd Lipcon added a comment - // The _length field is not declared in C++. It is allocated after the // declared nonstatic fields in arrayOopDesc if not compressed, otherwise // it occupies the second half of the _klass field in oopDesc. static int length_offset_in_bytes() { return UseCompressedOops ? klass_gap_offset_in_bytes() : sizeof(arrayOopDesc); } Basically if CompressedOops are on, then klassids are only 32-bits, but there's already a 64-bit field for it, so it just uses the latter 4 bytes for the array length. Otherwise it's an extra 4 bytes that comes after the standard oop header (oopDesc). So, without compressed oops, arrays take 20 bytes base. With them (on by default on heaps <32GB since 6u18 I believe), the array header is the same size as normal objects (16 bytes). Will take a look at loading a big image with that patch now.
          Hide
          Suresh Srinivas added a comment -

          Ran some quick tests for object sizes, using https://github.com/dweiss/java-sizeof (pretty neat stuff!)

            public static void main(String[] args) {
              System.out.println(RamUsageEstimator.sizeOf(new Object()));
              System.out.println(RamUsageEstimator.sizeOf(new Object[0]));
              System.out.println(RamUsageEstimator.sizeOf(new Object[1000000]));
            }
          

          With compressed oops on I get:
          16
          16
          4000016

          After turning it off:
          16
          24
          8000024

          Show
          Suresh Srinivas added a comment - Ran some quick tests for object sizes, using https://github.com/dweiss/java-sizeof (pretty neat stuff!) public static void main( String [] args) { System .out.println(RamUsageEstimator.sizeOf( new Object ())); System .out.println(RamUsageEstimator.sizeOf( new Object [0])); System .out.println(RamUsageEstimator.sizeOf( new Object [1000000])); } With compressed oops on I get: 16 16 4000016 After turning it off: 16 24 8000024
          Hide
          Todd Lipcon added a comment -

          Neat. I'm setting up those tests now... taking a while to clone/build hadoop onto the right machine that has enough RAM.

          Show
          Todd Lipcon added a comment - Neat. I'm setting up those tests now... taking a while to clone/build hadoop onto the right machine that has enough RAM.
          Hide
          Todd Lipcon added a comment -

          Here's the results from the latest patch:

          Setup

          • Java 6u31 convigured with a 24gb heap (-Xms24g -Xmx24g)
          • fsimage is 4.1GB on disk, snapshot from a mid size production cluster which runs both hbase and some MR workloads.
          • 31249022 files and directories, 26525575 blocks = 57774597 total filesystem objects.

          In each test, I started the NameNode, waited until it had loaded the image and opened its IPC port, and then used "jmap -histo:live", which issues a full GC and reports heap usage statistics.

          2.0.3-beta release

          Total heap: 7069MB

          Top consumers

           num     #instances         #bytes  class name
          ----------------------------------------------
             1:      38421509     2049194112  [Ljava.lang.Object;
             2:      26525179     1485410024  org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
             3:      19134601     1071537656  org.apache.hadoop.hdfs.server.namenode.INodeFile
             4:      16228949      753517120  [B
             5:      12113580      581451840  org.apache.hadoop.hdfs.server.namenode.INodeDirectory
             6:      19135442      484175352  [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
             7:       1621399      403948560  [I
             8:      11895039      285480936  java.util.ArrayList
             9:             1      268435472  [Lorg.apache.hadoop.hdfs.util.LightWeightGSet$LinkedElement;
          

          Patched trunk with the map turned off

          Total heap: 7528MB (6.5% increase from 2.0.3)

          Top consumers

           num     #instances         #bytes  class name
          ----------------------------------------------
             1:      38421427     2049187584  [Ljava.lang.Object;
             2:      26525179     1485410024  org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
             3:      19134601     1377691272  org.apache.hadoop.hdfs.server.namenode.INodeFile
             4:      12113580      775269120  org.apache.hadoop.hdfs.server.namenode.INodeDirectory
             5:      16228690      753509864  [B
             6:      19135442      484175352  [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
             7:       1654298      384726200  [I
             8:      11895040      285480960  java.util.ArrayList
             9:             1      268435472  [Lorg.apache.hadoop.hdfs.util.LightWeightGSet$LinkedElement;
          

          Patched trunk with the map turned on

          Total heap: 7696MB (8.9% increase from 2.0)

          Top consumers

           num     #instances         #bytes  class name
          ----------------------------------------------
             1:      38421429     2049187632  [Ljava.lang.Object;
             2:      26525179     1485410024  org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo
             3:      19134601     1377691272  org.apache.hadoop.hdfs.server.namenode.INodeFile
             4:      12113580      775269120  org.apache.hadoop.hdfs.server.namenode.INodeDirectory
             5:      16228746      753515976  [B
             6:      19135442      484175352  [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo;
             7:       1499494      426158720  [I
             8:             2      402653216  [Lorg.apache.hadoop.hdfs.util.LightWeightGSet$LinkedElement;
             9:      11895040      285480960  java.util.ArrayList
          

          I don't think this increased memory is necessarily unacceptable, I just wanted to see true measurement of the overhead instead of hypotheses. It looks like the increased memory cost is about twice what was estimated above.

          Show
          Todd Lipcon added a comment - Here's the results from the latest patch: Setup Java 6u31 convigured with a 24gb heap (-Xms24g -Xmx24g) fsimage is 4.1GB on disk, snapshot from a mid size production cluster which runs both hbase and some MR workloads. 31249022 files and directories, 26525575 blocks = 57774597 total filesystem objects. In each test, I started the NameNode, waited until it had loaded the image and opened its IPC port, and then used "jmap -histo:live", which issues a full GC and reports heap usage statistics. 2.0.3-beta release Total heap: 7069MB Top consumers num #instances #bytes class name ---------------------------------------------- 1: 38421509 2049194112 [Ljava.lang. Object ; 2: 26525179 1485410024 org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo 3: 19134601 1071537656 org.apache.hadoop.hdfs.server.namenode.INodeFile 4: 16228949 753517120 [B 5: 12113580 581451840 org.apache.hadoop.hdfs.server.namenode.INodeDirectory 6: 19135442 484175352 [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; 7: 1621399 403948560 [I 8: 11895039 285480936 java.util.ArrayList 9: 1 268435472 [Lorg.apache.hadoop.hdfs.util.LightWeightGSet$LinkedElement; Patched trunk with the map turned off Total heap: 7528MB (6.5% increase from 2.0.3) Top consumers num #instances #bytes class name ---------------------------------------------- 1: 38421427 2049187584 [Ljava.lang. Object ; 2: 26525179 1485410024 org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo 3: 19134601 1377691272 org.apache.hadoop.hdfs.server.namenode.INodeFile 4: 12113580 775269120 org.apache.hadoop.hdfs.server.namenode.INodeDirectory 5: 16228690 753509864 [B 6: 19135442 484175352 [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; 7: 1654298 384726200 [I 8: 11895040 285480960 java.util.ArrayList 9: 1 268435472 [Lorg.apache.hadoop.hdfs.util.LightWeightGSet$LinkedElement; Patched trunk with the map turned on Total heap: 7696MB (8.9% increase from 2.0) Top consumers num #instances #bytes class name ---------------------------------------------- 1: 38421429 2049187632 [Ljava.lang. Object ; 2: 26525179 1485410024 org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo 3: 19134601 1377691272 org.apache.hadoop.hdfs.server.namenode.INodeFile 4: 12113580 775269120 org.apache.hadoop.hdfs.server.namenode.INodeDirectory 5: 16228746 753515976 [B 6: 19135442 484175352 [Lorg.apache.hadoop.hdfs.server.blockmanagement.BlockInfo; 7: 1499494 426158720 [I 8: 2 402653216 [Lorg.apache.hadoop.hdfs.util.LightWeightGSet$LinkedElement; 9: 11895040 285480960 java.util.ArrayList I don't think this increased memory is necessarily unacceptable, I just wanted to see true measurement of the overhead instead of hypotheses. It looks like the increased memory cost is about twice what was estimated above.
          Hide
          Suresh Srinivas added a comment -

          Todd Lipcon Thanks for running the tests.

          I personally am not concerned about this increased memory. If there are others with concerns, I can try reducing memory consumption further at the expense more complex code. Thoughts?

          Show
          Suresh Srinivas added a comment - Todd Lipcon Thanks for running the tests. I personally am not concerned about this increased memory. If there are others with concerns, I can try reducing memory consumption further at the expense more complex code. Thoughts?
          Hide
          Suresh Srinivas added a comment -

          BTW my calculations of increased memory is against the total java heap allocated to the process than memory used in old generation alone. That is a better way to quantify the impact on users, right?

          Show
          Suresh Srinivas added a comment - BTW my calculations of increased memory is against the total java heap allocated to the process than memory used in old generation alone. That is a better way to quantify the impact on users, right?
          Hide
          Suresh Srinivas added a comment -

          BTW my calculations of increased memory is against the total java heap allocated to the process than memory used in old generation alone. That is a better way to quantify the impact on users, right?

          Sorry my previous comments may not be clear to every one. Increases of 625MB from 7069MB to 7696MB is 8.9%, the way I was quantifying was percentage of entire java heap memory. That is 625MB out of 24G, that is 2.6%.

          Show
          Suresh Srinivas added a comment - BTW my calculations of increased memory is against the total java heap allocated to the process than memory used in old generation alone. That is a better way to quantify the impact on users, right? Sorry my previous comments may not be clear to every one. Increases of 625MB from 7069MB to 7696MB is 8.9%, the way I was quantifying was percentage of entire java heap memory. That is 625MB out of 24G, that is 2.6%.
          Hide
          Suresh Srinivas added a comment -

          Any further comments? I plan to wrap up HDFS-4334 soon. If there are no further concerns, I do not plan on optimizing memory further at the expense of code complexity.

          Show
          Suresh Srinivas added a comment - Any further comments? I plan to wrap up HDFS-4334 soon. If there are no further concerns, I do not plan on optimizing memory further at the expense of code complexity.
          Hide
          Daryn Sharp added a comment -

          I've only skimmed this jira, but a 9% increase is fairly substantial for large namespaces. Are there any performance metrics available?

          Show
          Daryn Sharp added a comment - I've only skimmed this jira, but a 9% increase is fairly substantial for large namespaces. Are there any performance metrics available?
          Hide
          Suresh Srinivas added a comment -

          9% increase is fairly substantial for large namespaces.

          Please look at the overall increase in memory usage instead of increase over used memory. As I said that is close 2.6%.

          Are there any performance metrics available?

          I do not see much concern here. In fact I removed the flag to turn this feature on or off. If you think based on the code this is a concern, I could add the flag back. What metrics would you like to see?

          Show
          Suresh Srinivas added a comment - 9% increase is fairly substantial for large namespaces. Please look at the overall increase in memory usage instead of increase over used memory. As I said that is close 2.6%. Are there any performance metrics available? I do not see much concern here. In fact I removed the flag to turn this feature on or off. If you think based on the code this is a concern, I could add the flag back. What metrics would you like to see?
          Hide
          Daryn Sharp added a comment -

          Maybe something simple like GridMix to get a rough feel for the overhead of the extra resolution. I don't expect it to be much, but it'd be nice to know.

          Show
          Daryn Sharp added a comment - Maybe something simple like GridMix to get a rough feel for the overhead of the extra resolution. I don't expect it to be much, but it'd be nice to know.
          Hide
          Kihwal Lee added a comment -

          Please look at the overall increase in memory usage instead of increase over used memory.

          Your point would be valid only if the overhead was entirely a fixed amount (e.g. GSet). Since the extra memory consumption increases as the size of namespace grows, factoring the arbitrary max heap size into this can be misleading. But I agree that the 9% figure does not have an absolute meaning either. If the inode-to-block ratio is different, the number will be different. For the clusters I have seen, it will be a lower number. The GSet used for InodeID to INode map is also semi-fixed. Is it allocated similarly to BlocksMap?

          In any case, I would not call this insignificant. We have a namenode which will not work well if we upgrade to a release with this feature since it will need extra 4-6GB for the steady-state operation. Even if it could absorb the extra memory requirement, we would have to tell users that the namespace limit is X% worse.

          Simply saying the overhead is insignificant won't convince users. We should explain why the benefit from having this feature justifies the overhead. I don't think on/off switch is necessary.

          Show
          Kihwal Lee added a comment - Please look at the overall increase in memory usage instead of increase over used memory. Your point would be valid only if the overhead was entirely a fixed amount (e.g. GSet). Since the extra memory consumption increases as the size of namespace grows, factoring the arbitrary max heap size into this can be misleading. But I agree that the 9% figure does not have an absolute meaning either. If the inode-to-block ratio is different, the number will be different. For the clusters I have seen, it will be a lower number. The GSet used for InodeID to INode map is also semi-fixed. Is it allocated similarly to BlocksMap? In any case, I would not call this insignificant. We have a namenode which will not work well if we upgrade to a release with this feature since it will need extra 4-6GB for the steady-state operation. Even if it could absorb the extra memory requirement, we would have to tell users that the namespace limit is X% worse. Simply saying the overhead is insignificant won't convince users. We should explain why the benefit from having this feature justifies the overhead. I don't think on/off switch is necessary.
          Hide
          Suresh Srinivas added a comment -

          The GSet used for InodeID to INode map is also semi-fixed. Is it allocated similarly to BlocksMap?

          Yes. Please see the patch in HDFS-4434. About 1% of heap is used for the GSet.

          Simply saying the overhead is insignificant won't convince users. We should explain why the benefit from having this feature justifies the overhead. I don't think on/off switch is necessary.

          I think the assertion here is not overhead is insignificant. Depending on details of how the namespace of a system is laid out, I would think this would be anywhere from 2 to 5%.

          As far the benefits, in the main description I laid this out:


          This helps in several use cases:

          1. HDFS can evolve to support ID based protocols such as NFS. We plan to add an experimental NFS V3 gateway to HDFS using this mechanism. Will post a github link soon.
          2. InodeID can be used by the tools to track a single instance of a file, for cacheing data or tracking and checking for modification based on INodeID, in tools like distcp.
          3. Path cannot identify a unique instance of a file. This causes issues as described in HDFS-4258 and HDFS-4437. It has also been a requirement of many other jiras such as HDFS-385.
          4. Using InodeID as an identifier instead of path can be more efficient than path bases accesses.

          We have a namenode which will not work well if we upgrade to a release with this feature since it will need extra 4-6GB for the steady-state operation. Even if it could absorb the extra memory requirement, we would have to tell users that the namespace limit is X% worse.

          Is this because namenode does not have RAM? With this change, it is expected that NN is allocated more memory, say 5%. If this is done I am not sure why users should be told namespace limit is X% worse?

          My rationale, repeating what I said earlier is, machines are becoming available with more RAM. Adding 5% JVM heap should not be a problem. In fact most of the namenodes are configured with enough head room already and might not even need a change. But if this is a big concern, I am okay making additional change to bring down the memory consumption close to zero.

          Show
          Suresh Srinivas added a comment - The GSet used for InodeID to INode map is also semi-fixed. Is it allocated similarly to BlocksMap? Yes. Please see the patch in HDFS-4434 . About 1% of heap is used for the GSet. Simply saying the overhead is insignificant won't convince users. We should explain why the benefit from having this feature justifies the overhead. I don't think on/off switch is necessary. I think the assertion here is not overhead is insignificant. Depending on details of how the namespace of a system is laid out, I would think this would be anywhere from 2 to 5%. As far the benefits, in the main description I laid this out: — This helps in several use cases: HDFS can evolve to support ID based protocols such as NFS. We plan to add an experimental NFS V3 gateway to HDFS using this mechanism. Will post a github link soon. InodeID can be used by the tools to track a single instance of a file, for cacheing data or tracking and checking for modification based on INodeID, in tools like distcp. Path cannot identify a unique instance of a file. This causes issues as described in HDFS-4258 and HDFS-4437 . It has also been a requirement of many other jiras such as HDFS-385 . Using InodeID as an identifier instead of path can be more efficient than path bases accesses. — We have a namenode which will not work well if we upgrade to a release with this feature since it will need extra 4-6GB for the steady-state operation. Even if it could absorb the extra memory requirement, we would have to tell users that the namespace limit is X% worse. Is this because namenode does not have RAM? With this change, it is expected that NN is allocated more memory, say 5%. If this is done I am not sure why users should be told namespace limit is X% worse? My rationale, repeating what I said earlier is, machines are becoming available with more RAM. Adding 5% JVM heap should not be a problem. In fact most of the namenodes are configured with enough head room already and might not even need a change. But if this is a big concern, I am okay making additional change to bring down the memory consumption close to zero.
          Hide
          Daryn Sharp added a comment -

          Perhaps ASN.1 encoding the long for the inode id will significantly decrease the memory consumption?

          Show
          Daryn Sharp added a comment - Perhaps ASN.1 encoding the long for the inode id will significantly decrease the memory consumption?
          Hide
          Suresh Srinivas added a comment -

          Perhaps ASN.1 encoding the long for the inode id will significantly decrease the memory consumption?

          Can you add more details on how this would decrease memory consumption? BTW inodeID was added as a part of HDFS-4334. See the discussion about how reduce the impact of adding inode ID - https://issues.apache.org/jira/browse/HDFS-4258?focusedCommentId=13508432&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13508432.

          But I am not sure if that optimization is necessary at the expense of code. Thoughts?

          Show
          Suresh Srinivas added a comment - Perhaps ASN.1 encoding the long for the inode id will significantly decrease the memory consumption? Can you add more details on how this would decrease memory consumption? BTW inodeID was added as a part of HDFS-4334 . See the discussion about how reduce the impact of adding inode ID - https://issues.apache.org/jira/browse/HDFS-4258?focusedCommentId=13508432&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13508432 . But I am not sure if that optimization is necessary at the expense of code. Thoughts?
          Hide
          Kihwal Lee added a comment -

          With this change, it is expected that NN is allocated more memory, say 5%. If this is done I am not sure why users should be told namespace limit is X% worse?

          In many use cases, allocating more heap may not be a problem since machines typically have more memory available. But if you approach from the view point of owners of existing hardware that was spec'ed to hold certain size of namespace, it can be viewed as a decrease of capacity. I am not saying it is a showstopper. I just felt it should be given more thought.

          I will review the implementation and try to understand your concerns about more memory efficient design.

          Show
          Kihwal Lee added a comment - With this change, it is expected that NN is allocated more memory, say 5%. If this is done I am not sure why users should be told namespace limit is X% worse? In many use cases, allocating more heap may not be a problem since machines typically have more memory available. But if you approach from the view point of owners of existing hardware that was spec'ed to hold certain size of namespace, it can be viewed as a decrease of capacity. I am not saying it is a showstopper. I just felt it should be given more thought. I will review the implementation and try to understand your concerns about more memory efficient design.
          Hide
          Suresh Srinivas added a comment -

          But if you approach from the view point of owners of existing hardware that was spec'ed to hold certain size of namespace, it can be viewed as a decrease of capacity.

          Again I do not believe anyone runs with NN very tightly configured given the nature garbage collection. That said, to make further progress, the following optimizations can be done:

          1. Initialize the map only when this feature is enabled. Should take away roughly 1/3 of extra memory.
          2. Reuse existing bits in INodeId - https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12618468&commentId=13508432. Should take away roughly 1/3 of extra memory.
          3. Use first block ID of the file (after ensuring even empty file has an associated block) as the InodeID. This is very ugly and mixing two abstractions that should not be mixed. I am reluctant to make this optimization.

          My vote is to keep the code simple, abstractions clean. If folks think the above optimizations is worth pursuing, I will update the patch.

          Show
          Suresh Srinivas added a comment - But if you approach from the view point of owners of existing hardware that was spec'ed to hold certain size of namespace, it can be viewed as a decrease of capacity. Again I do not believe anyone runs with NN very tightly configured given the nature garbage collection. That said, to make further progress, the following optimizations can be done: Initialize the map only when this feature is enabled. Should take away roughly 1/3 of extra memory. Reuse existing bits in INodeId - https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12618468&commentId=13508432 . Should take away roughly 1/3 of extra memory. Use first block ID of the file (after ensuring even empty file has an associated block) as the InodeID. This is very ugly and mixing two abstractions that should not be mixed. I am reluctant to make this optimization. My vote is to keep the code simple, abstractions clean. If folks think the above optimizations is worth pursuing, I will update the patch.
          Hide
          Brandon Li added a comment -

          I am not saying it is a showstopper. I just felt it should be given more thought.

          In many cases, a trade-off is involved with the introduction of a new feature or enhancement.
          This JIRA was forked from HDFS-4258 and the discussion/experiment has been going on for more than 4 months.

          As shown in the theory analysis and experiment results, the memory overhead of this change is not significant. It doesn't seems to be worthwhile for now to complicate NameNode code to do the extra optimizations.

          Show
          Brandon Li added a comment - I am not saying it is a showstopper. I just felt it should be given more thought. In many cases, a trade-off is involved with the introduction of a new feature or enhancement. This JIRA was forked from HDFS-4258 and the discussion/experiment has been going on for more than 4 months. As shown in the theory analysis and experiment results, the memory overhead of this change is not significant. It doesn't seems to be worthwhile for now to complicate NameNode code to do the extra optimizations.
          Hide
          Daryn Sharp added a comment -

          Perhaps ASN.1 encoding the long for the inode id will significantly decrease the memory consumption?

          Can you add more details on how this would decrease memory consumption?

          If the long is encoded as a variable length byte array, it should take a long time to exceed 4-5 bytes. With minimal effort & complexity, the memory increase would nominally be cut in half for many deployments. Just a suggestion.

          Show
          Daryn Sharp added a comment - Perhaps ASN.1 encoding the long for the inode id will significantly decrease the memory consumption? Can you add more details on how this would decrease memory consumption? If the long is encoded as a variable length byte array, it should take a long time to exceed 4-5 bytes. With minimal effort & complexity, the memory increase would nominally be cut in half for many deployments. Just a suggestion.
          Hide
          Arpit Agarwal added a comment -

          If the long is encoded as a variable length byte array, it should take a long time to exceed 4-5 bytes. With minimal effort & complexity, the memory increase would nominally be cut in half for many deployments.

          This would save space when serializing the fsImage. I am not sure if we can reduce in-memory usage below the size of a primitive long since the byte array is an object.

          Show
          Arpit Agarwal added a comment - If the long is encoded as a variable length byte array, it should take a long time to exceed 4-5 bytes. With minimal effort & complexity, the memory increase would nominally be cut in half for many deployments. This would save space when serializing the fsImage. I am not sure if we can reduce in-memory usage below the size of a primitive long since the byte array is an object.
          Hide
          Suresh Srinivas added a comment -

          For people who are following this jira, HDFS-4434 is now ready for review and commit. Please provide any feedback you have soon. otherwise the comments that come late will have to be incorporated in a subsequent jira.

          Show
          Suresh Srinivas added a comment - For people who are following this jira, HDFS-4434 is now ready for review and commit. Please provide any feedback you have soon. otherwise the comments that come late will have to be incorporated in a subsequent jira.
          Hide
          Brandon Li added a comment -

          Close this JIRA since all its sub-issues have been resolved.

          Show
          Brandon Li added a comment - Close this JIRA since all its sub-issues have been resolved.
          Hide
          Suresh Srinivas added a comment -

          I am planning to push the subtasks of this jira to release 2.0.

          Show
          Suresh Srinivas added a comment - I am planning to push the subtasks of this jira to release 2.0.
          Hide
          Suresh Srinivas added a comment -

          I have merged the subtasks of this jira into branch-2.

          Show
          Suresh Srinivas added a comment - I have merged the subtasks of this jira into branch-2.
          Hide
          Konstantin Shvachko added a comment -

          Posted a request for the bases for porting this to branch 2.0.5 in HDFS-4434.

          suresh> What is the concern?

          My concern is that you committed incompatible change, which is a new feature and a large change, into the stabilization branch without a vote or a release plan discussed with the community.
          Being a bad practice in general, I think it is a wrong move now in particular, because people are discussing the stabilization of 2.0.5.
          This feature totals about 150K of code in patches (counting subtasks only). Not helping stabilization. And you didn't give any reasons for the merge.

          I would like to ask to revert this merge from branch 2.0.5 and follow the procedures for merging features into new release branches if you decide to proceed.

          Show
          Konstantin Shvachko added a comment - Posted a request for the bases for porting this to branch 2.0.5 in HDFS-4434 . suresh> What is the concern? My concern is that you committed incompatible change, which is a new feature and a large change, into the stabilization branch without a vote or a release plan discussed with the community. Being a bad practice in general, I think it is a wrong move now in particular, because people are discussing the stabilization of 2.0.5. This feature totals about 150K of code in patches (counting subtasks only). Not helping stabilization. And you didn't give any reasons for the merge. I would like to ask to revert this merge from branch 2.0.5 and follow the procedures for merging features into new release branches if you decide to proceed.
          Hide
          Suresh Srinivas added a comment -

          My concern is that you committed incompatible change

          Konstantin, not sure if you looked at the release notes. This change disallows a file or directory name called .reserved under root. That is the only reason why I marked it as incompatible. This is not related to wire or API incompatibility. That said, one of the goal for 2.0.5 is drive towards a state where incompatible changes are not allowed after it.

          which is a new feature and a large change, into the stabilization branch without a vote or a release plan discussed with the community.

          I agree that this is a new features. Committers routinely promote changes that they consider are okay to branch-2. I believe this does not add to the instability. Let me know if you disagree based on a code review/testing.

          Also merging to branch-2 in a lot of cases is done based on a committer's judgement. Please look various other jiras that are merged in without vote thread into branch-2. I do not consider this as a large feature. However for Snapshot feature, I would have brought up that in a release thread.

          . And you didn't give any reasons for the merge.

          I think there is enough motivation for the feature posted in the jira.

          I would like to ask to revert this merge from branch 2.0.5 and follow the procedures for merging features into new release branches if you decide to proceed.

          I have spent more than 12 hours merging the chain of jiras required and resolving conflict before getting to 4 changes that introduced file id. Is your concern about HDFS-4434 or all the related changes?

          Show
          Suresh Srinivas added a comment - My concern is that you committed incompatible change Konstantin, not sure if you looked at the release notes. This change disallows a file or directory name called .reserved under root. That is the only reason why I marked it as incompatible. This is not related to wire or API incompatibility. That said, one of the goal for 2.0.5 is drive towards a state where incompatible changes are not allowed after it. which is a new feature and a large change, into the stabilization branch without a vote or a release plan discussed with the community. I agree that this is a new features. Committers routinely promote changes that they consider are okay to branch-2. I believe this does not add to the instability. Let me know if you disagree based on a code review/testing. Also merging to branch-2 in a lot of cases is done based on a committer's judgement. Please look various other jiras that are merged in without vote thread into branch-2. I do not consider this as a large feature. However for Snapshot feature, I would have brought up that in a release thread. . And you didn't give any reasons for the merge. I think there is enough motivation for the feature posted in the jira. I would like to ask to revert this merge from branch 2.0.5 and follow the procedures for merging features into new release branches if you decide to proceed. I have spent more than 12 hours merging the chain of jiras required and resolving conflict before getting to 4 changes that introduced file id. Is your concern about HDFS-4434 or all the related changes?
          Hide
          Nathan Roberts added a comment -

          Sorry this is a really late comment but I'd really like to see some performance numbers before and after. While 6.5% increase in overall heap size is not massive, my main concern is the 25% increase in a very core data structure within the NN (1.07G->1.37G in Todd's measurement of INodeFile). This could cause significant cache pollution and therefore could have a very measurable impact on performance. I don't know for sure that it will, but it seems it would be reasonable to verify.

          Show
          Nathan Roberts added a comment - Sorry this is a really late comment but I'd really like to see some performance numbers before and after. While 6.5% increase in overall heap size is not massive, my main concern is the 25% increase in a very core data structure within the NN (1.07G->1.37G in Todd's measurement of INodeFile). This could cause significant cache pollution and therefore could have a very measurable impact on performance. I don't know for sure that it will, but it seems it would be reasonable to verify.
          Hide
          Suresh Srinivas added a comment - - edited

          Nathan Roberts What performance test would you like to see run with and without this change? NNBench?

          Show
          Suresh Srinivas added a comment - - edited Nathan Roberts What performance test would you like to see run with and without this change? NNBench?
          Hide
          Suresh Srinivas added a comment -

          I have reverted HDFS-4434 from branch-2. Will post the performance numbers and then commit the change to branch-2, based on that discussion.

          Show
          Suresh Srinivas added a comment - I have reverted HDFS-4434 from branch-2. Will post the performance numbers and then commit the change to branch-2, based on that discussion.
          Hide
          Sanjay Radia added a comment -

          Nathan. A question.
          Suresh is willing to do the performance benchmark, but I am trying to understand where you are coming from. Yahoo and FB create very large namespaces by simply buying more memory and increasing the size of the heap. Do you worry about cache pollution when you create 50K more files? Given that the NN heap (many GBs) is so much larger than the cache, does the additional inode and inode-map size impact the overall system performance? Suresh has argued that a 24GB heap grows by 625MB. Looking at the growth in memory of this feature as a percentage of the total heap size is a more realistic way of looking at the impact of the growth than the growth of an individual data structure like the inode.

          IMHO, not having an inode-map and inode number was a serious limitation in the original implementation of NN. I am willing to pay for the extra memory given the value inode-id and inode-map brings (as described by suresh in the beginning of this Jira). Permissions, access time, etc added to the memory cost of the the NN and were accepted because of the value they bring.

          Show
          Sanjay Radia added a comment - Nathan. A question. Suresh is willing to do the performance benchmark, but I am trying to understand where you are coming from. Yahoo and FB create very large namespaces by simply buying more memory and increasing the size of the heap. Do you worry about cache pollution when you create 50K more files? Given that the NN heap (many GBs) is so much larger than the cache, does the additional inode and inode-map size impact the overall system performance? Suresh has argued that a 24GB heap grows by 625MB. Looking at the growth in memory of this feature as a percentage of the total heap size is a more realistic way of looking at the impact of the growth than the growth of an individual data structure like the inode. IMHO, not having an inode-map and inode number was a serious limitation in the original implementation of NN. I am willing to pay for the extra memory given the value inode-id and inode-map brings (as described by suresh in the beginning of this Jira). Permissions, access time, etc added to the memory cost of the the NN and were accepted because of the value they bring.
          Hide
          Konstantin Shvachko added a comment -

          Suresh whatever reason for incompatibility it should go through approval process.
          You also committed the LayoutVersion change HDFS-4296. Now it requires an upgrade.

          > one of the goal for 2.0.5 is drive towards a state where incompatible changes are not allowed after it.

          That was the goal for Hadoop 0.20.
          I thought the goal for 2.0.5 is stabilization.

          > Also merging to branch-2 in a lot of cases is done based on a committer's judgement.

          I think it is wrong. Especially for the stabilization release.

          > I think there is enough motivation for the feature posted in the jira.

          Not arguing about the value of the feature. But about its necessity for 2.0.5

          > Is your concern about HDFS-4434 or all the related changes?

          Most of them. I would have reviewed if I had a proper warning.
          So again why is it necessary for 2.0.5?

          Show
          Konstantin Shvachko added a comment - Suresh whatever reason for incompatibility it should go through approval process. You also committed the LayoutVersion change HDFS-4296 . Now it requires an upgrade. > one of the goal for 2.0.5 is drive towards a state where incompatible changes are not allowed after it. That was the goal for Hadoop 0.20. I thought the goal for 2.0.5 is stabilization. > Also merging to branch-2 in a lot of cases is done based on a committer's judgement. I think it is wrong. Especially for the stabilization release. > I think there is enough motivation for the feature posted in the jira. Not arguing about the value of the feature. But about its necessity for 2.0.5 > Is your concern about HDFS-4434 or all the related changes? Most of them. I would have reviewed if I had a proper warning. So again why is it necessary for 2.0.5?
          Hide
          Suresh Srinivas added a comment -

          That was the goal for Hadoop 0.20.
          I thought the goal for 2.0.5 is stabilization.

          I am not sure if 0.20 is a typo. If it is not, I have hard time parsing that statement. See the previous discussion about 2.0.4-beta (now called 2.0.5) in this thread:
          http://hadoop.markmail.org/thread/v44nqp466p76jpkj

          I think it is wrong. Especially for the stabilization release.

          I disagree. I want to get some of the features I have been working on into this release. I think the goal of this release is to get API and wire compatibility stable.

          Most of them. I would have reviewed if I had a proper warning.

          I am not sure what kind of warning you are talking about. HDFS-4434 has been in development for a long time with more than 32 iterations of the patch.

          So again why is it necessary for 2.0.5?

          Snapshot and NFS feature depends on this. I would like see it become available in 2.0.5.

          Show
          Suresh Srinivas added a comment - That was the goal for Hadoop 0.20. I thought the goal for 2.0.5 is stabilization. I am not sure if 0.20 is a typo. If it is not, I have hard time parsing that statement. See the previous discussion about 2.0.4-beta (now called 2.0.5) in this thread: http://hadoop.markmail.org/thread/v44nqp466p76jpkj I think it is wrong. Especially for the stabilization release. I disagree. I want to get some of the features I have been working on into this release. I think the goal of this release is to get API and wire compatibility stable. Most of them. I would have reviewed if I had a proper warning. I am not sure what kind of warning you are talking about. HDFS-4434 has been in development for a long time with more than 32 iterations of the patch. So again why is it necessary for 2.0.5? Snapshot and NFS feature depends on this. I would like see it become available in 2.0.5.
          Hide
          Daryn Sharp added a comment -

          I don't think Nathan and I are questioning the utility of the feature, but need to get a feel for the possible performance impact. If there is a significant degradation then it will delay our adoption of 2.x until it's optimized.

          I think a good performance test is to create a namespace of 150M paths. Flood the NN with thousands of concurrent file & directory add/deletes per second throughout the namespace. Hopefully there is existing benchmark with those properties.

          Show
          Daryn Sharp added a comment - I don't think Nathan and I are questioning the utility of the feature, but need to get a feel for the possible performance impact. If there is a significant degradation then it will delay our adoption of 2.x until it's optimized. I think a good performance test is to create a namespace of 150M paths. Flood the NN with thousands of concurrent file & directory add/deletes per second throughout the namespace. Hopefully there is existing benchmark with those properties.
          Hide
          Suresh Srinivas added a comment -

          I think a good performance test is to create a namespace of 150M paths. Flood the NN with thousands of concurrent file & directory add/deletes per second throughout the namespace. Hopefully there is existing benchmark with those properties.

          I think we are talking about hashmap entry addition and deletion during adds and delete of files, other than increased memory. I am not sure I understand the cache pollution part of performance impact, given namenode core objects run into GBs in a large setup.

          I am currently running some slive tests. But I do not currently have bandwidth to setup a namenode with 150M paths (that would require more than 64GB of JVM heap). Do you have some bandwidth to do these tests?

          Show
          Suresh Srinivas added a comment - I think a good performance test is to create a namespace of 150M paths. Flood the NN with thousands of concurrent file & directory add/deletes per second throughout the namespace. Hopefully there is existing benchmark with those properties. I think we are talking about hashmap entry addition and deletion during adds and delete of files, other than increased memory. I am not sure I understand the cache pollution part of performance impact, given namenode core objects run into GBs in a large setup. I am currently running some slive tests. But I do not currently have bandwidth to setup a namenode with 150M paths (that would require more than 64GB of JVM heap). Do you have some bandwidth to do these tests?
          Hide
          Konstantin Shvachko added a comment -

          Suresh,
          0.20 is not typo. You should parse it as a sarcasm, sorry. Wire compatibility was a target for many previous releases and the train is still there.
          We clearly have a disagreement about what should be in the release. Other people may have other opinions. And that is my point.
          All I ask is to play by the rules. Make a release plan and put it into vote. See bylaws under "Release Plan". I'll be glad to discuss your plan.
          Here you act like its your own branch where you commit what you want and nobody else cares.
          Does it make sense?

          Show
          Konstantin Shvachko added a comment - Suresh, 0.20 is not typo. You should parse it as a sarcasm, sorry. Wire compatibility was a target for many previous releases and the train is still there. We clearly have a disagreement about what should be in the release. Other people may have other opinions. And that is my point. All I ask is to play by the rules. Make a release plan and put it into vote. See bylaws under "Release Plan". I'll be glad to discuss your plan. Here you act like its your own branch where you commit what you want and nobody else cares. Does it make sense?
          Hide
          Suresh Srinivas added a comment - - edited

          Here you act like its your own branch where you commit what you want and nobody else cares.

          I fail to understand the need for such hostile tone. That said, please look at many small features, improvements and numerous bug fixes that are committed by me and other committers into many of the 2.0.x releases, without any discussion or need for vote, entirely based on their judgement.

          To be clear, a committer can commit to any branch. It is up to the release manager to include it or not in a release.

          Instead of stating your objection to a change as it is big, 150K lines of code etc., it would be great if you can really look at the patch and express more concrete technical concerns related to stability.

          I have reverted HDFS-4434. I have also responded on the thread related to 2.0.5 on including the features that many have been working for many months.

          It seems to me that suddenly in past week or so you have decided that stability is the only paramount thing, disregarding all the discussions that have happened. Please see my earlier comment on discussion related to API and wire protocol stability that we had months ago.

          Show
          Suresh Srinivas added a comment - - edited Here you act like its your own branch where you commit what you want and nobody else cares. I fail to understand the need for such hostile tone. That said, please look at many small features, improvements and numerous bug fixes that are committed by me and other committers into many of the 2.0.x releases, without any discussion or need for vote, entirely based on their judgement. To be clear, a committer can commit to any branch. It is up to the release manager to include it or not in a release. Instead of stating your objection to a change as it is big, 150K lines of code etc., it would be great if you can really look at the patch and express more concrete technical concerns related to stability. I have reverted HDFS-4434 . I have also responded on the thread related to 2.0.5 on including the features that many have been working for many months. It seems to me that suddenly in past week or so you have decided that stability is the only paramount thing, disregarding all the discussions that have happened. Please see my earlier comment on discussion related to API and wire protocol stability that we had months ago.
          Hide
          Nathan Roberts added a comment -

          Suresh is willing to do the performance benchmark, but I am trying to understand where you are coming from. Yahoo and FB create very large namespaces by simply buying more memory and increasing the size of the heap.

          This is not always possible. Some of our namenodes are running at the maximum configuration for the box (maximum memory, maximum heap, near maximum namespace). For these clusters, upgrading to this feature will require new boxes.

          Do you worry about cache pollution when you create 50K more files?

          I don't worry about cache pollution when I create 50K more files. What's important is the size of the working set. Inodes are a very popular object within the NN, if inodes make up a significant part of our working set, then it matters. I don't know whether this is the case or not, that's why I think it makes sense to run some benchmarks to make sure we don't see any ill-effects. With the introduction of YARN, the central RM is rarely the bottleneck. Now it's much more common for the NN to be the bottleneck of the cluster, and slowing down the bottleneck always needs to be looked at carefully.

          Given that the NN heap (many GBs) is so much larger than the cache, does the additional inode and inode-map size impact the overall system performance?

          Good question. Let's find out.

          Suresh has argued that a 24GB heap grows by 625MB.

          I was using the numbers Todd gathered where a 7G heap grew by 600MB. When we looked at one of our key clusters, we calculated something like 7.5% increase.

          Looking at the growth in memory of this feature as a percentage of the total heap size is a more realistic way of looking at the impact of the growth than the growth of an individual data structure like the inode.

          Maybe.

          IMHO, not having an inode-map and inode number was a serious limitation in the original implementation of NN. I am willing to pay for the extra memory given the value inode-id and inode-map brings (as described by suresh in the beginning of this Jira). Permissions, access time, etc added to the memory cost of the the NN and were accepted because of the value they bring.

          Certainly agree it is a limitation. We just need to make sure we fully quantify all of the costs.

          Show
          Nathan Roberts added a comment - Suresh is willing to do the performance benchmark, but I am trying to understand where you are coming from. Yahoo and FB create very large namespaces by simply buying more memory and increasing the size of the heap. This is not always possible. Some of our namenodes are running at the maximum configuration for the box (maximum memory, maximum heap, near maximum namespace). For these clusters, upgrading to this feature will require new boxes. Do you worry about cache pollution when you create 50K more files? I don't worry about cache pollution when I create 50K more files. What's important is the size of the working set. Inodes are a very popular object within the NN, if inodes make up a significant part of our working set, then it matters. I don't know whether this is the case or not, that's why I think it makes sense to run some benchmarks to make sure we don't see any ill-effects. With the introduction of YARN, the central RM is rarely the bottleneck. Now it's much more common for the NN to be the bottleneck of the cluster, and slowing down the bottleneck always needs to be looked at carefully. Given that the NN heap (many GBs) is so much larger than the cache, does the additional inode and inode-map size impact the overall system performance? Good question. Let's find out. Suresh has argued that a 24GB heap grows by 625MB. I was using the numbers Todd gathered where a 7G heap grew by 600MB. When we looked at one of our key clusters, we calculated something like 7.5% increase. Looking at the growth in memory of this feature as a percentage of the total heap size is a more realistic way of looking at the impact of the growth than the growth of an individual data structure like the inode. Maybe. IMHO, not having an inode-map and inode number was a serious limitation in the original implementation of NN. I am willing to pay for the extra memory given the value inode-id and inode-map brings (as described by suresh in the beginning of this Jira). Permissions, access time, etc added to the memory cost of the the NN and were accepted because of the value they bring. Certainly agree it is a limitation. We just need to make sure we fully quantify all of the costs.
          Hide
          Konstantin Shvachko added a comment -

          > hostile tone.

          I apologize.
          I guess what I really wanted to say that it is hostile to commit incompatible changes in a stabilization branch before the release plan is proposed.

          > would be great if you can really look at the patch

          You know I did.
          Thanks for responding on the thread related to 2.0.5. I understand the plan much better.
          I appreciate your reverting HDFS-434.

          There is still an incompatible change HDFS-4296. It is listed in new features for some reason.
          Do you still need HDFS-4296 once HDFS-434 is reverted?
          We did not change LayoutVersion since branch 0.23.

          Show
          Konstantin Shvachko added a comment - > hostile tone. I apologize. I guess what I really wanted to say that it is hostile to commit incompatible changes in a stabilization branch before the release plan is proposed. > would be great if you can really look at the patch You know I did. Thanks for responding on the thread related to 2.0.5. I understand the plan much better. I appreciate your reverting HDFS-434 . There is still an incompatible change HDFS-4296 . It is listed in new features for some reason. Do you still need HDFS-4296 once HDFS-434 is reverted? We did not change LayoutVersion since branch 0.23.
          Hide
          Suresh Srinivas added a comment -

          There is still an incompatible change HDFS-4296. It is listed in new features for some reason.

          It is not incompatible and hence not marked as incompatible in jira or CHANGES.txt. It is currently listed as New Feature in CHANGES.txt. I do not think it should be listed under New Features section (though it does not qualify for Improvement, Bug fix any of that). I will move it to bug fix section.

          Do you still need HDFS-4296 once HDFS-434 is reverted?

          It is needed because it corresponds to a layout version reserved in branch-1 for concat. It is not related to HDFS-4434.

          Show
          Suresh Srinivas added a comment - There is still an incompatible change HDFS-4296 . It is listed in new features for some reason. It is not incompatible and hence not marked as incompatible in jira or CHANGES.txt. It is currently listed as New Feature in CHANGES.txt. I do not think it should be listed under New Features section (though it does not qualify for Improvement, Bug fix any of that). I will move it to bug fix section. Do you still need HDFS-4296 once HDFS-434 is reverted? It is needed because it corresponds to a layout version reserved in branch-1 for concat. It is not related to HDFS-4434 .
          Hide
          Suresh Srinivas added a comment -

          I ran Slive tests. Even with very small size data written, I could not find perceptible difference between the test runs given any additional time in NN methods is dwarfed by the overall time of calling NN over RPC etc.

          So I decided to run NNThroughputBenchmark. For folks new to it, it is a micro benchmark that does not use RPC and directly executes operations on the namenode class. Hence it gives comparisons sharply limited to NN method calls alone. I ran NNThroughputBenchmark command run to create 100K files using 100 threads in each iteration, using the command below:

          bin/hadoop jar share/hadoop/hdfs/hadoop-hdfs-2.0.5-SNAPSHOT-tests.jar org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -op create -threads 100 -files 100000 -filesPerDir 100 
          

          Without this patch:

          Opertaions Elapsed OpsPerSec AvgTime
          100000 20327 4919.565110444237 20
          100000 19199 5208.604614823688 19
          100000 19287 5184.839529216571 19
          100000 19128 5227.9381012128815 19
          100000 19082 5240.540823813018 19
          100000 18785 5323.396326856535 18
          100000 18947 5277.880403230063 18
          100000 18963 5273.427200337499 18
          100000 19206 5206.706237634073 19
          100000 19434 5145.621076463929 19
          Average 19235.8 5200.851942 18.8

          With this patch:

          Opertaions Elapsed OpsPerSec AvgTime
          100000 20104 4974.134500596896 19
          100000 19498 5128.731151913017 19
          100000 19449 5141.652527122217 19
          100000 19530 5120.327700972863 19
          100000 20067 4983.305925150745 19
          100000 19703 5075.369233111709 19
          100000 19595 5103.342689461598 19
          100000 19418 5149.860953754249 19
          100000 19932 5017.057997190447 19
          100000 20596 4855.311711011847 20
          Average 19789.2 5054.909439 19.1

          With this patch + an additional change to turn off INodeMap:

          Opertaions Elapsed OpsPerSec AvgTime
          100000 19615 5098.139179199592 19
          100000 19349 5168.225748100677 19
          100000 19136 5225.752508361204 19
          100000 19347 5168.760014472528 19
          100000 20096 4976.114649681529 19
          100000 19248 5195.344970906068 19
          100000 18916 5286.529921759357 18
          100000 19217 5203.7258677212885 19
          100000 20105 4973.887092762994 20
          100000 19882 5029.675082989639 19
          Average 19491.1 5132.615504 19
          Show
          Suresh Srinivas added a comment - I ran Slive tests. Even with very small size data written, I could not find perceptible difference between the test runs given any additional time in NN methods is dwarfed by the overall time of calling NN over RPC etc. So I decided to run NNThroughputBenchmark. For folks new to it, it is a micro benchmark that does not use RPC and directly executes operations on the namenode class. Hence it gives comparisons sharply limited to NN method calls alone. I ran NNThroughputBenchmark command run to create 100K files using 100 threads in each iteration, using the command below: bin/hadoop jar share/hadoop/hdfs/hadoop-hdfs-2.0.5-SNAPSHOT-tests.jar org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -op create -threads 100 -files 100000 -filesPerDir 100 Without this patch: Opertaions Elapsed OpsPerSec AvgTime 100000 20327 4919.565110444237 20 100000 19199 5208.604614823688 19 100000 19287 5184.839529216571 19 100000 19128 5227.9381012128815 19 100000 19082 5240.540823813018 19 100000 18785 5323.396326856535 18 100000 18947 5277.880403230063 18 100000 18963 5273.427200337499 18 100000 19206 5206.706237634073 19 100000 19434 5145.621076463929 19 Average 19235.8 5200.851942 18.8 With this patch: Opertaions Elapsed OpsPerSec AvgTime 100000 20104 4974.134500596896 19 100000 19498 5128.731151913017 19 100000 19449 5141.652527122217 19 100000 19530 5120.327700972863 19 100000 20067 4983.305925150745 19 100000 19703 5075.369233111709 19 100000 19595 5103.342689461598 19 100000 19418 5149.860953754249 19 100000 19932 5017.057997190447 19 100000 20596 4855.311711011847 20 Average 19789.2 5054.909439 19.1 With this patch + an additional change to turn off INodeMap: Opertaions Elapsed OpsPerSec AvgTime 100000 19615 5098.139179199592 19 100000 19349 5168.225748100677 19 100000 19136 5225.752508361204 19 100000 19347 5168.760014472528 19 100000 20096 4976.114649681529 19 100000 19248 5195.344970906068 19 100000 18916 5286.529921759357 18 100000 19217 5203.7258677212885 19 100000 20105 4973.887092762994 20 100000 19882 5029.675082989639 19 Average 19491.1 5132.615504 19
          Hide
          Suresh Srinivas added a comment -

          I made changes to the code to reuse the byte[][] pathComponents for file creation (made some optimizations in that method. There are other optimizations available in terms of permission checks that I did not venture to do). The throughput with those partial optimizations is:

          Opertaions Elapsed OpsPerSec AvgTime
          100000 19591 5104.384666428462 19
          100000 18969 5271.759186040382 18
          100000 19206 5206.706237634073 19
          100000 18652 5361.35535063264 18
          100000 19218 5203.455094182537 19
          100000 19179 5214.036185411127 19
          100000 19302 5180.810278727593 19
          100000 19388 5157.829585310501 19
          100000 19099 5235.876223886067 19
          100000 19591 5104.384666428462 19
          Average 19219.5 5204.059747 18.8
          Show
          Suresh Srinivas added a comment - I made changes to the code to reuse the byte[][] pathComponents for file creation (made some optimizations in that method. There are other optimizations available in terms of permission checks that I did not venture to do). The throughput with those partial optimizations is: Opertaions Elapsed OpsPerSec AvgTime 100000 19591 5104.384666428462 19 100000 18969 5271.759186040382 18 100000 19206 5206.706237634073 19 100000 18652 5361.35535063264 18 100000 19218 5203.455094182537 19 100000 19179 5214.036185411127 19 100000 19302 5180.810278727593 19 100000 19388 5157.829585310501 19 100000 19099 5235.876223886067 19 100000 19591 5104.384666428462 19 Average 19219.5 5204.059747 18.8
          Hide
          Suresh Srinivas added a comment -

          Attaching patch to given an idea on how to reuse path components added in HDFS-4434.

          Show
          Suresh Srinivas added a comment - Attaching patch to given an idea on how to reuse path components added in HDFS-4434 .
          Hide
          Suresh Srinivas added a comment - - edited

          Here is NNBench for delete operations (run with 100 threads simultaneously running:

          Opertaions Elapsed OpsPerSec AvgTim
          100000 19243 5196.694902 19
          100000 18598 5376.92225 18
          100000 17819 5611.987205 17
          100000 17953 5570.099705 17
          100000 18077 5531.891354 18
          100000 17948 5571.651437 17
          100000 18080 5530.973451 18
          100000 18032 5545.696539 18
          100000 18431 5425.641582 18
          100000 17735 5638.567804 17
          100000 1819 5500.01262 17.7
          Opertaions Elapsed OpsPerSec AvgTim
          100000 18029 5546.619336 17
          100000 18527 5397.527932 18
          100000 18164 5505.395287 18
          100000 18486 5409.49908 18
          100000 18053 5539.245555 18
          100000 18313 5460.601758 18
          100000 18299 5464.779496 18
          100000 17878 5593.466831 17
          100000 18178 5501.155243 18
          100000 18084 5529.750055 18
          100000 1820.1 5494.804057 17.7
          Show
          Suresh Srinivas added a comment - - edited Here is NNBench for delete operations (run with 100 threads simultaneously running: Opertaions Elapsed OpsPerSec AvgTim 100000 19243 5196.694902 19 100000 18598 5376.92225 18 100000 17819 5611.987205 17 100000 17953 5570.099705 17 100000 18077 5531.891354 18 100000 17948 5571.651437 17 100000 18080 5530.973451 18 100000 18032 5545.696539 18 100000 18431 5425.641582 18 100000 17735 5638.567804 17 100000 1819 5500.01262 17.7 Opertaions Elapsed OpsPerSec AvgTim 100000 18029 5546.619336 17 100000 18527 5397.527932 18 100000 18164 5505.395287 18 100000 18486 5409.49908 18 100000 18053 5539.245555 18 100000 18313 5460.601758 18 100000 18299 5464.779496 18 100000 17878 5593.466831 17 100000 18178 5501.155243 18 100000 18084 5529.750055 18 100000 1820.1 5494.804057 17.7
          Hide
          Suresh Srinivas added a comment - - edited

          Summary of results in the tests:

          1. File create tests- perform additional reserved name processing, inode map addition and reserved name check. This is where maximum additional work from the patch is being done.
            • In the mirco benchmark by just calling create file related methods, the time went from 19235.8 to 19789.2 roughly 2.8% different. This can be further reduced by turning off map to 1.3%. The patch moves splitting paths into components outside the lock. Based on this, further optimizations are possible that improves throughput by reducing the synchronized sections. The end result with that optimizations can make running times much smaller that what it is today.
            • I would also point out that, this is a micro benchmark. The % difference observed in this will be dwarfed by RPC times, network round trip time etc. Also the system will spend time on other operations which should not be affected by this patch.
          2. File delete tests - performs reseved name processing and only inode map deletion.
            • There very little difference in bench mark results.
          Show
          Suresh Srinivas added a comment - - edited Summary of results in the tests: File create tests- perform additional reserved name processing, inode map addition and reserved name check. This is where maximum additional work from the patch is being done. In the mirco benchmark by just calling create file related methods, the time went from 19235.8 to 19789.2 roughly 2.8% different. This can be further reduced by turning off map to 1.3%. The patch moves splitting paths into components outside the lock. Based on this, further optimizations are possible that improves throughput by reducing the synchronized sections. The end result with that optimizations can make running times much smaller that what it is today. I would also point out that, this is a micro benchmark. The % difference observed in this will be dwarfed by RPC times, network round trip time etc. Also the system will spend time on other operations which should not be affected by this patch. File delete tests - performs reseved name processing and only inode map deletion. There very little difference in bench mark results.
          Hide
          Suresh Srinivas added a comment - - edited

          Given the above tests, here are all the issues that are brought up:

          1. Introducing incompatible change
            • This is not a major incompatibility. As I said earlier, creating file or directory /.reserved is not allowed. That said, this should get into 2.0.5 given its main goal is compatibility.
          2. This patch could be destabilizing
            • This patch is adding an Inode map and support for path scheme which allows addressing files by inodes. Most of the code added in this patch is to support the new addressing mechanisms and extensive unit tests associated with it. The regular code path should largely be unaffected by this, with exception of adding and deleting entries in inode map. Please bring up any concerns that I might have overlooked.
          3. Performance impact - based on the results, there is a very little performance impact. I have two options:
            • The difference observed in microbenchmarks amounts to much smaller difference in a real system. That too only associated with a few write operations such as create. Hence is it acceptable?
            • Make further optimizations to reduce synchronized section size based on the mechanism added in this patch. Nathan Roberts if you feel this is important, I will undertake the work of optimizing this. Daryn Sharp also had expressed interest in it. Not sure if he has the bandwidth.

          Given this, I would like to merge this in branch-2.0.5. I hope concerns expressed by people are addressed.

          Show
          Suresh Srinivas added a comment - - edited Given the above tests, here are all the issues that are brought up: Introducing incompatible change This is not a major incompatibility. As I said earlier, creating file or directory /.reserved is not allowed. That said, this should get into 2.0.5 given its main goal is compatibility. This patch could be destabilizing This patch is adding an Inode map and support for path scheme which allows addressing files by inodes. Most of the code added in this patch is to support the new addressing mechanisms and extensive unit tests associated with it. The regular code path should largely be unaffected by this, with exception of adding and deleting entries in inode map. Please bring up any concerns that I might have overlooked. Performance impact - based on the results, there is a very little performance impact. I have two options: The difference observed in microbenchmarks amounts to much smaller difference in a real system. That too only associated with a few write operations such as create. Hence is it acceptable? Make further optimizations to reduce synchronized section size based on the mechanism added in this patch. Nathan Roberts if you feel this is important, I will undertake the work of optimizing this. Daryn Sharp also had expressed interest in it. Not sure if he has the bandwidth. Given this, I would like to merge this in branch-2.0.5. I hope concerns expressed by people are addressed.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          The performance numbers look good. Since the rpc time is not counted, a small percentage difference is nothing. Beside, the Inode ID feature is very useful. It also helps implementing the Snapshot feature.

          +1 on merging it to branch-2.

          Show
          Tsz Wo Nicholas Sze added a comment - The performance numbers look good. Since the rpc time is not counted, a small percentage difference is nothing. Beside, the Inode ID feature is very useful. It also helps implementing the Snapshot feature. +1 on merging it to branch-2.
          Hide
          Suresh Srinivas added a comment -

          Konstantin Shvachko Nathan Roberts Daryn Sharp, please let me know if your concerns are addressed. If I do not hear, I plan on committing this change by today or tomorrow.

          Show
          Suresh Srinivas added a comment - Konstantin Shvachko Nathan Roberts Daryn Sharp , please let me know if your concerns are addressed. If I do not hear, I plan on committing this change by today or tomorrow.
          Hide
          Nathan Roberts added a comment -

          Thanks for running some basic performance tests! Looks like minimal impact.

          Show
          Nathan Roberts added a comment - Thanks for running some basic performance tests! Looks like minimal impact.
          Hide
          Daryn Sharp added a comment -

          100K files via 100 threads seems like a very small sampling when we're running namespaces well over 100M. I think the only detail that might make performance worse is how well the inode map performs as the bucket chains get longer. If it's a problem we can probably fix it later.

          I did notice that unprotectedConcat appears to leak inodes in the map - it unlinks the concat'ed files but doesn't remove them from the map. The business logic for keeping the inode map in sync with the namespace is high enough up the call stack that it makes it a bit tough to prove all delete paths are safe, so you may want to double check.

          Show
          Daryn Sharp added a comment - 100K files via 100 threads seems like a very small sampling when we're running namespaces well over 100M. I think the only detail that might make performance worse is how well the inode map performs as the bucket chains get longer. If it's a problem we can probably fix it later. I did notice that unprotectedConcat appears to leak inodes in the map - it unlinks the concat'ed files but doesn't remove them from the map. The business logic for keeping the inode map in sync with the namespace is high enough up the call stack that it makes it a bit tough to prove all delete paths are safe, so you may want to double check.
          Hide
          Daryn Sharp added a comment -

          Might want to correct the misspelling: remvoed AllFromInodesFromMap

          Show
          Daryn Sharp added a comment - Might want to correct the misspelling: remvoed AllFromInodesFromMap
          Hide
          Suresh Srinivas added a comment -

          Thanks Nathan Roberts and Daryn Sharp for commenting back.

          100K files via 100 threads seems like a very small sampling when we're running namespaces well over 100M. I think the only detail that might make performance worse is how well the inode map performs as the bucket chains get longer. If it's a problem we can probably fix it later.

          100 threads is quite considerable and matches well with typical big cluster RPC handler count. Also inodeMap size is created as a percentage of total memory. That means it is sized based on the namenode size. I agree that this performance impact should be minimal and we should be able to fix if we find any issues.

          I did notice that unprotectedConcat appears to leak inodes in the map - it unlinks the concat'ed files but doesn't remove them from the map.

          Nice catch. Created HDFS-4785.

          ...so you may want to double check.

          Yes. I will run through one more review.

          Might want to correct the misspelling: remvoed AllFromInodesFromMap

          Will be addressed in another jira.

          Show
          Suresh Srinivas added a comment - Thanks Nathan Roberts and Daryn Sharp for commenting back. 100K files via 100 threads seems like a very small sampling when we're running namespaces well over 100M. I think the only detail that might make performance worse is how well the inode map performs as the bucket chains get longer. If it's a problem we can probably fix it later. 100 threads is quite considerable and matches well with typical big cluster RPC handler count. Also inodeMap size is created as a percentage of total memory. That means it is sized based on the namenode size. I agree that this performance impact should be minimal and we should be able to fix if we find any issues. I did notice that unprotectedConcat appears to leak inodes in the map - it unlinks the concat'ed files but doesn't remove them from the map. Nice catch. Created HDFS-4785 . ...so you may want to double check. Yes. I will run through one more review. Might want to correct the misspelling: remvoed AllFromInodesFromMap Will be addressed in another jira.
          Hide
          Suresh Srinivas added a comment -

          I have merged HDFS-4434 back into branch-2.

          Show
          Suresh Srinivas added a comment - I have merged HDFS-4434 back into branch-2.
          Hide
          Arpit Agarwal added a comment -

          Resolving to avoid spurious version updates.

          Show
          Arpit Agarwal added a comment - Resolving to avoid spurious version updates.

            People

            • Assignee:
              Brandon Li
              Reporter:
              Brandon Li
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development