Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1110

Namenode heap optimization - reuse objects for commonly used file names

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.22.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      There are a lot of common file names used in HDFS, mainly created by mapreduce, such as file names starting with "part". Reusing byte[] corresponding to these recurring file names will save significant heap space used for storing the file names in millions of INodeFile objects.

      1. hdfs-1110.2.patch
        16 kB
        Suresh Srinivas
      2. hdfs-1110.3.patch
        15 kB
        Suresh Srinivas
      3. hdfs-1110.4.patch
        16 kB
        Suresh Srinivas
      4. hdfs-1110.5.patch
        20 kB
        Suresh Srinivas
      5. hdfs-1110.6.patch
        20 kB
        Suresh Srinivas
      6. hdfs-1110.patch
        12 kB
        Suresh Srinivas
      7. hdfs-1110.y20.patch
        13 kB
        Suresh Srinivas

        Activity

        Suresh Srinivas created issue -
        Hide
        Suresh Srinivas added a comment -

        Approach:

        • Commonly used file names are defined by a regex in a config file. The config file is preconfgured with part-00.* and part-?-00.* regex for file names created by mapreduce. It covers part-00000 to part-00999, part-m-00000 to part-m-00999 and part-r-00000 to part-r-00999.
        • When creating a INodeFile, for names that match the regex, add an entry of file name to byte[] for the first time. For subsequent creation of INodeFile, reuse the existing byte[].
        • Clusters where there are other common file names, those names can be added to the config file by the cluster admin.
        • Max size to which dictionary can grow to will be set to prevent a poor choice of regex (example .*) from over using the heap.

        Alternative approach:

        • During startup, while loading fsimage, the number of times a file name occurs can be counted (uses a lot of heap) and dictionary can be setup with top N recurring names.

        Alternative approach has the advantage that regex file to define names that need to be added dictionary is not requried. It does not work when the namenode starts fresh or recurring names get added post startup.

        I am planning to go with approach 1.

        Show
        Suresh Srinivas added a comment - Approach: Commonly used file names are defined by a regex in a config file. The config file is preconfgured with part-00.* and part-?-00.* regex for file names created by mapreduce. It covers part-00000 to part-00999, part-m-00000 to part-m-00999 and part-r-00000 to part-r-00999. When creating a INodeFile, for names that match the regex, add an entry of file name to byte[] for the first time. For subsequent creation of INodeFile, reuse the existing byte[]. Clusters where there are other common file names, those names can be added to the config file by the cluster admin. Max size to which dictionary can grow to will be set to prevent a poor choice of regex (example .*) from over using the heap. Alternative approach: During startup, while loading fsimage, the number of times a file name occurs can be counted (uses a lot of heap) and dictionary can be setup with top N recurring names. Alternative approach has the advantage that regex file to define names that need to be added dictionary is not requried. It does not work when the namenode starts fresh or recurring names get added post startup. I am planning to go with approach 1.
        Hide
        Suresh Srinivas added a comment -

        preliminary patch.

        Show
        Suresh Srinivas added a comment - preliminary patch.
        Suresh Srinivas made changes -
        Field Original Value New Value
        Attachment hdfs-1110.patch [ 12442881 ]
        Hide
        dhruba borthakur added a comment -

        Hi suresh, awesome idea. Can we somehow make this automatic (instead of yet another regex configuration that an administrator has to configure). what if we make the NN do some periodic housekeeping that looks for the top-N matching path components?

        also, do you have a measurement on how much space this could save on your cluster?

        Show
        dhruba borthakur added a comment - Hi suresh, awesome idea. Can we somehow make this automatic (instead of yet another regex configuration that an administrator has to configure). what if we make the NN do some periodic housekeeping that looks for the top-N matching path components? also, do you have a measurement on how much space this could save on your cluster?
        Hide
        Suresh Srinivas added a comment -

        Here is an analysis of an fsimage on one of Yahoo production clusters with 54 million files:

        File names used > 100000 times 24
        File names used between 10001 to 100000 times 467
        File names used between 1001 to 10000 times 4335
        File names used between 101 to 1000 times 40031
        File names used between 10 to 100 times 403975
        File names used between 2 to 9 times 606579
        File names used between 1 times 4114531
        Total file names 5169942

        Dictionary stores names that are used more than 10 times to benefit from reuse. Names used less than 10 times may not result in a lot of saving, considering those names are stored in HashMap as HashMap.Entry (size 28 bytes) and other HashMap overhead.

        Given that, my proposal is:

        1. From my previous proposal, name dictionary will use regex listed in a config file "hdfs-filenames-regex" to track common file names. This file is setup with part-.*. This is will be a admin configured file.
        2. Build a tool to generate list names that is used more than 10 times from fsimage. The generated list excludes the names that are already covered by regex, to reduce the size of the list and loading time. This list is saved as config file "hdfs-filenames".
        3. NameDictionary loads "hdfs-filenames-regex" and "hdfs-filesnames". NN stores list of regex and a HashMap of name (String) to internal object (byte[]). A file name is first looked up in the map. If the name is not found, then a check is made to see if it matches any regex. A name matching the regex is added to the map.

        Optional:

        1. In the longer run:
          • the tool runs offline periodically (on any machine) on the namenode image.
          • namenode could refresh its dictionary with the new list of common names used. It could lazily walk through the directory structure and set the INodeFile to reuse the byte[].
        Show
        Suresh Srinivas added a comment - Here is an analysis of an fsimage on one of Yahoo production clusters with 54 million files: File names used > 100000 times 24 File names used between 10001 to 100000 times 467 File names used between 1001 to 10000 times 4335 File names used between 101 to 1000 times 40031 File names used between 10 to 100 times 403975 File names used between 2 to 9 times 606579 File names used between 1 times 4114531 Total file names 5169942 Dictionary stores names that are used more than 10 times to benefit from reuse. Names used less than 10 times may not result in a lot of saving, considering those names are stored in HashMap as HashMap.Entry (size 28 bytes) and other HashMap overhead. Given that, my proposal is: From my previous proposal, name dictionary will use regex listed in a config file "hdfs-filenames-regex" to track common file names. This file is setup with part-.* . This is will be a admin configured file. Build a tool to generate list names that is used more than 10 times from fsimage. The generated list excludes the names that are already covered by regex, to reduce the size of the list and loading time. This list is saved as config file "hdfs-filenames". NameDictionary loads "hdfs-filenames-regex" and "hdfs-filesnames". NN stores list of regex and a HashMap of name (String) to internal object (byte[]). A file name is first looked up in the map. If the name is not found, then a check is made to see if it matches any regex. A name matching the regex is added to the map. Optional: In the longer run: the tool runs offline periodically (on any machine) on the namenode image. namenode could refresh its dictionary with the new list of common names used. It could lazily walk through the directory structure and set the INodeFile to reuse the byte[].
        Hide
        Suresh Srinivas added a comment -

        Dhruba, see my comments on what can be done in the longer run. I feel analysis of what are common names can be offline to reduce the load on NN. Also if you have ideas on doing it online, let me know.

        Show
        Suresh Srinivas added a comment - Dhruba, see my comments on what can be done in the longer run. I feel analysis of what are common names can be offline to reduce the load on NN. Also if you have ideas on doing it online, let me know.
        Hide
        Suresh Srinivas added a comment -

        For the above table, 448832 file names are used for ~47 million files. So instead of 47 million byte[], we should be able to 448832 byte[]. Assuming a file size of 10 bytes, this translated to 4.7GB for a namenode with heap size 40G.

        Show
        Suresh Srinivas added a comment - For the above table, 448832 file names are used for ~47 million files. So instead of 47 million byte[], we should be able to 448832 byte[]. Assuming a file size of 10 bytes, this translated to 4.7GB for a namenode with heap size 40G.
        Hide
        Jakob Homan added a comment -

        Build a tool to generate list names that is used more than 10 times from fsimage.

        Also, we don't actually have to build a separate tool, as the offline image viewer can quickly be extended to provide these numbers and generate the new file. The numbers above were generated from just a few lines in a new viewer.

        Show
        Jakob Homan added a comment - Build a tool to generate list names that is used more than 10 times from fsimage. Also, we don't actually have to build a separate tool, as the offline image viewer can quickly be extended to provide these numbers and generate the new file. The numbers above were generated from just a few lines in a new viewer.
        Hide
        Konstantin Shvachko added a comment -

        File names used > 100000 times 24

        What are the names of these 24 files? Do they fall under the proposed default pattern. How big is the noise if we use the default pattern.

        On the one hand I see the point of providing a generic approach for people to specify their own patterns.
        But I also agree with Dhruba that we need to optimize only for the top ten (or so) file names, which will give us 5% saving in the meta-data memory footprint. The rest should be ignored, it would be a wast of resources to optimize for the rest. Your approach 2 would be a move in this direction.

        So may be it would be useful to have a tool Jacob mentions (OIV-based), so that admins could run it offline on the image and get top N frequently used names, with an estimate how much space this saves. Then they will be able to formulate the reg exp. Otherwise, it is going to be a painful guessing game.

        Show
        Konstantin Shvachko added a comment - File names used > 100000 times 24 What are the names of these 24 files? Do they fall under the proposed default pattern. How big is the noise if we use the default pattern. On the one hand I see the point of providing a generic approach for people to specify their own patterns. But I also agree with Dhruba that we need to optimize only for the top ten (or so) file names, which will give us 5% saving in the meta-data memory footprint. The rest should be ignored, it would be a wast of resources to optimize for the rest. Your approach 2 would be a move in this direction. So may be it would be useful to have a tool Jacob mentions (OIV-based), so that admins could run it offline on the image and get top N frequently used names, with an estimate how much space this saves. Then they will be able to formulate the reg exp. Otherwise, it is going to be a painful guessing game.
        Hide
        Suresh Srinivas added a comment -

        What are the names of these 24 files? Do they fall under the proposed default pattern. How big is the noise if we use the default pattern.

        Of 24, 22 are part-* files.

        we need to optimize only for the top ten (or so) file names, which will give us 5% saving in the meta-data memory footprint

        I do not think top 10 will save 5% of meta-data memory fooprint. See the posted results below.

        I have a bug in my previous calculation, that made the savings seem too good to be true. With 47 million files optimized to use the dictionary, the saving of 10 bytes gives 470MB and not 4.7GB Also I did not account for byte[] overhead of 24 bytes.

        Any way I have a tool NamespaceDedupe with the new patch. You could run on fsimage to see the frequency of occurence and savings in heap size. Dhruba you can run this on images on your production cluster to see how savings compare with what I have posted below.

        23 names are used by 3343781 between 100000 and 360461 times. Saved space 114962311
        468 names are used by 12944154 between 10000 and 100000 times. Saved space 448255164
        4335 names are used by 10522601 between 1000 and 10000 times. Saved space 391364352
        40031 names are used by 10654372 between 100 and 1000 times. Saved space 382273386
        403974 names are used by10722689 between 10 and 100 times. Saved space 354416484
        Total saved space 1691271697

        Show
        Suresh Srinivas added a comment - What are the names of these 24 files? Do they fall under the proposed default pattern. How big is the noise if we use the default pattern. Of 24, 22 are part-* files. we need to optimize only for the top ten (or so) file names, which will give us 5% saving in the meta-data memory footprint I do not think top 10 will save 5% of meta-data memory fooprint. See the posted results below. I have a bug in my previous calculation, that made the savings seem too good to be true. With 47 million files optimized to use the dictionary, the saving of 10 bytes gives 470MB and not 4.7GB Also I did not account for byte[] overhead of 24 bytes. Any way I have a tool NamespaceDedupe with the new patch. You could run on fsimage to see the frequency of occurence and savings in heap size. Dhruba you can run this on images on your production cluster to see how savings compare with what I have posted below. 23 names are used by 3343781 between 100000 and 360461 times. Saved space 114962311 468 names are used by 12944154 between 10000 and 100000 times. Saved space 448255164 4335 names are used by 10522601 between 1000 and 10000 times. Saved space 391364352 40031 names are used by 10654372 between 100 and 1000 times. Saved space 382273386 403974 names are used by10722689 between 10 and 100 times. Saved space 354416484 Total saved space 1691271697
        Suresh Srinivas made changes -
        Attachment hdfs-1110.2.patch [ 12443257 ]
        Hide
        Eli Collins added a comment -

        Won't this require a lot of synchronization that previously didn't exist? ie every time you delete a file you'll need to atomically decrement the count in the map and possibly mutate the map to remove the file. Would be nice if different parts of the namespace didn't require synchronization because they happened to share files with the same name.

        Show
        Eli Collins added a comment - Won't this require a lot of synchronization that previously didn't exist? ie every time you delete a file you'll need to atomically decrement the count in the map and possibly mutate the map to remove the file. Would be nice if different parts of the namespace didn't require synchronization because they happened to share files with the same name.
        Hide
        Suresh Srinivas added a comment -

        Access to the map is only during write operations (create and while digesting editslog). All of that is synchronized by a large FSNamesystem lock. Also I am not proposing to do reference counting. Once in this map, it will remain there. The map it self should not have large number of entries, only the top N popular names.

        Show
        Suresh Srinivas added a comment - Access to the map is only during write operations (create and while digesting editslog). All of that is synchronized by a large FSNamesystem lock. Also I am not proposing to do reference counting. Once in this map, it will remain there. The map it self should not have large number of entries, only the top N popular names.
        Hide
        dhruba borthakur added a comment -

        hi suresh, are you saying that we save 470MB on a 40GB heap size? If that is the case, then the savings might be too meagre compared to the code complexity of the patch. do you agree?

        Show
        dhruba borthakur added a comment - hi suresh, are you saying that we save 470MB on a 40GB heap size? If that is the case, then the savings might be too meagre compared to the code complexity of the patch. do you agree?
        Hide
        Suresh Srinivas added a comment -

        470MB was based on my rough calculation. Look at detailed info I posted earlier. If we use dictionary for names used more than 10, the savings is 1.6G in old generation of size 37G (BTW not all the 37G was used in old gen, ~30G was used and the remaining was headroom).

        Show
        Suresh Srinivas added a comment - 470MB was based on my rough calculation. Look at detailed info I posted earlier. If we use dictionary for names used more than 10, the savings is 1.6G in old generation of size 37G (BTW not all the 37G was used in old gen, ~30G was used and the remaining was headroom).
        Hide
        dhruba borthakur added a comment -

        Thanks Suresh for the numbers.

        > agree with Dhruba that we need to optimize only for the top ten (or so) file names, which will give us 5% saving in the meta

        One fear I have is that the regex matching inside the fsnamesystem lock could increase CPU increase in such a way that the 5% gain in memory might not be a good tradeoff. any thoughts on this?

        Show
        dhruba borthakur added a comment - Thanks Suresh for the numbers. > agree with Dhruba that we need to optimize only for the top ten (or so) file names, which will give us 5% saving in the meta One fear I have is that the regex matching inside the fsnamesystem lock could increase CPU increase in such a way that the 5% gain in memory might not be a good tradeoff. any thoughts on this?
        Hide
        Suresh Srinivas added a comment -

        agree with Dhruba that we need to optimize only for the top ten (or so) file names, which will give us 5% saving in the meta

        As I had commented earlier, we need to have at least a dictionary of size 500K to see gains, not just top ten file names. See the breakdown from my previous analysis. Regex matching should not be a big issue. Look up is always made in the dictionary hashmap (for most of the frequently used file names, there will be a hit). Regex is used only for names not found in the dictionary. This is done only while creating a file.

        After thinking about this solution, here is an alternate solution I am leaning towards. Let me know what you guys think:

        1. During startup, maintain two maps. One is a transient map used for counting number of times a file name is used and the corresponding byte[]. The second is the dictionary map which maintains file name to byte[].
        2. While consuming fsimage and editslog:
          • If name is found in the dictionary map, use byte[] corresponding to it
          • if name is found in the transient map, increment the number of times the name is used. If the name is used more than threshold (10 times), delete it from the transient map and promote it to dictionary.
          • if name is not found in the transient map, add it to the transient map with use count set to 1.
          • At the end of consuming editslog, delete the transient map.

        Advantages:

        1. No configuration files and no regex. Simplified administration.

        Disadvantages:

        1. Dictionary is initialized only during startup. Hence it does not react to and optimize for files names that become popular post startup.
        2. Impacts startup time due to two hashmap lookups (though it should be a small fraction of disk i/o time during startup)
        Show
        Suresh Srinivas added a comment - agree with Dhruba that we need to optimize only for the top ten (or so) file names, which will give us 5% saving in the meta As I had commented earlier, we need to have at least a dictionary of size 500K to see gains, not just top ten file names. See the breakdown from my previous analysis. Regex matching should not be a big issue. Look up is always made in the dictionary hashmap (for most of the frequently used file names, there will be a hit). Regex is used only for names not found in the dictionary. This is done only while creating a file. After thinking about this solution, here is an alternate solution I am leaning towards. Let me know what you guys think: During startup, maintain two maps. One is a transient map used for counting number of times a file name is used and the corresponding byte[]. The second is the dictionary map which maintains file name to byte[]. While consuming fsimage and editslog: If name is found in the dictionary map, use byte[] corresponding to it if name is found in the transient map, increment the number of times the name is used. If the name is used more than threshold (10 times), delete it from the transient map and promote it to dictionary. if name is not found in the transient map, add it to the transient map with use count set to 1. At the end of consuming editslog, delete the transient map. Advantages: No configuration files and no regex. Simplified administration. Disadvantages: Dictionary is initialized only during startup. Hence it does not react to and optimize for files names that become popular post startup. Impacts startup time due to two hashmap lookups (though it should be a small fraction of disk i/o time during startup)
        Hide
        dhruba borthakur added a comment -

        Awesome. I like this idea because it has no configuration shing-bang and is automatic. My opinion is that it is ok to do this only during NN startup time. Maybe you can have only one dictionary where you store the count-of-occurances as well. i.e move the count from the transient map to the dictionary... otherwise the same logic as u described. when the image is fully loaded, we have to purge the dictionary of all items whose count is lesser than 10 .

        also, this has some relationship to HDFS-1140.

        Show
        dhruba borthakur added a comment - Awesome. I like this idea because it has no configuration shing-bang and is automatic. My opinion is that it is ok to do this only during NN startup time. Maybe you can have only one dictionary where you store the count-of-occurances as well. i.e move the count from the transient map to the dictionary... otherwise the same logic as u described. when the image is fully loaded, we have to purge the dictionary of all items whose count is lesser than 10 . also, this has some relationship to HDFS-1140 .
        Hide
        Suresh Srinivas added a comment -

        Maybe you can have only one dictionary where you store the count-of-occurances as well. i.e move the count from the transient map to the dictionary...

        I prefer having two different hashmaps for the following reasons:

        1. If a single hashmap is used, it grows to the size of all the names, not just the names stored in the dictionary. Either we have shrink the map post purging entries (by making another copy) or use heap more than necessary to retain a map much larger than what it should be.
        2. In dictionary, I do not intend to track the number of times the name is used. This information is unnecessary. All we care about is whether it used more than certain threshold and not the exact number of times a name is used.
        Show
        Suresh Srinivas added a comment - Maybe you can have only one dictionary where you store the count-of-occurances as well. i.e move the count from the transient map to the dictionary... I prefer having two different hashmaps for the following reasons: If a single hashmap is used, it grows to the size of all the names, not just the names stored in the dictionary. Either we have shrink the map post purging entries (by making another copy) or use heap more than necessary to retain a map much larger than what it should be. In dictionary, I do not intend to track the number of times the name is used. This information is unnecessary. All we care about is whether it used more than certain threshold and not the exact number of times a name is used.
        Hide
        Suresh Srinivas added a comment -

        Attached patch implements the solution I had proposed previously. There is an additional optimization due to the chosen mechanism. During startup, if a name is used more than once, byte[] is reused (not just for the names used more than 10 times).

        Only the names used more than 10 times is added to the dictionary. Adding other names will undo the space gained due to heap used for storing hashmap entries (as described earlier).

        I ran tests to benchmark startup time and total heap size (gotten by triggering full GC after startup). Here are the results:

          Without patch With patch
        Startup Time 880s 892s
        Heap size 24.197G 22.372G

        Startup time increased by 12s with 1.825G saved from 24.197G heap.

        BTW I am thinking of removing NamespaceDeduper tool attached in the patch. Any thoughts?

        Show
        Suresh Srinivas added a comment - Attached patch implements the solution I had proposed previously. There is an additional optimization due to the chosen mechanism. During startup, if a name is used more than once, byte[] is reused (not just for the names used more than 10 times). Only the names used more than 10 times is added to the dictionary. Adding other names will undo the space gained due to heap used for storing hashmap entries (as described earlier). I ran tests to benchmark startup time and total heap size (gotten by triggering full GC after startup). Here are the results:   Without patch With patch Startup Time 880s 892s Heap size 24.197G 22.372G Startup time increased by 12s with 1.825G saved from 24.197G heap. BTW I am thinking of removing NamespaceDeduper tool attached in the patch. Any thoughts?
        Suresh Srinivas made changes -
        Attachment hdfs-1110.3.patch [ 12444377 ]
        Hide
        Suresh Srinivas added a comment -

        Dhruba or Jakob, can you please comment...

        Show
        Suresh Srinivas added a comment - Dhruba or Jakob, can you please comment...
        Hide
        Jakob Homan added a comment -

        Looking good.

        Review:

        • If you keep NamespaceDedupe, which I would recommend as I do think it adds value in and of itself, it's probably best to move its user-facing bits with the rest of the offline image viewers. OfflineImageViewer.java handles all the command line arguments and such.
        • NamespaceDedupe.java:51 line goes more than 80 characters.
        • Nit: TestNameDictionary::testNameReuse() at first looked to me like a unit test that hadn't annotated. Maybe verifyNameReuse?
        • The static class ByteArray seems like a candidate either for being a stand-alone class or wrapped by NameDictionary; it's not really an integral part of FSDirectory.
        • The NameDictionary.lookup(name, value) method seems a bit odd in its usage. Both times it's used via dictionary.lookup(name, name), which makes me wonder if this is the right API. Do we expect NameDictionary to be used elsewhere such that this abstraction is worth the odd API?

        Overall I think this is a good thing to do. The 12 second startup cost compared to the almost 2 gb savings seems worth it to me. There should be a linear tradeoff such that small clusters should see essentially no impact and large clusters pay a very small penalty at startup but have the benefits for their entire runtime.

        A useful improvement later on may be a safemode command to repopulate the dictionary, which would take into account changes since cluster startup, particularly newly popular filenames.

        Show
        Jakob Homan added a comment - Looking good. Review: If you keep NamespaceDedupe , which I would recommend as I do think it adds value in and of itself, it's probably best to move its user-facing bits with the rest of the offline image viewers. OfflineImageViewer.java handles all the command line arguments and such. NamespaceDedupe.java :51 line goes more than 80 characters. Nit: TestNameDictionary::testNameReuse() at first looked to me like a unit test that hadn't annotated. Maybe verifyNameReuse? The static class ByteArray seems like a candidate either for being a stand-alone class or wrapped by NameDictionary ; it's not really an integral part of FSDirectory . The NameDictionary.lookup(name, value) method seems a bit odd in its usage. Both times it's used via dictionary.lookup(name, name), which makes me wonder if this is the right API. Do we expect NameDictionary to be used elsewhere such that this abstraction is worth the odd API? Overall I think this is a good thing to do. The 12 second startup cost compared to the almost 2 gb savings seems worth it to me. There should be a linear tradeoff such that small clusters should see essentially no impact and large clusters pay a very small penalty at startup but have the benefits for their entire runtime. A useful improvement later on may be a safemode command to repopulate the dictionary, which would take into account changes since cluster startup, particularly newly popular filenames.
        Hide
        dhruba borthakur added a comment -

        > The 12 second startup cost compared to the almost 2 gb savings seems worth it to me.

        I agree, sounds good.

        Show
        dhruba borthakur added a comment - > The 12 second startup cost compared to the almost 2 gb savings seems worth it to me. I agree, sounds good.
        Hide
        Suresh Srinivas added a comment -

        Thanks Jakob.

        The NameDictionary.lookup(name, value) method seems a bit odd in its usage. Both times it's used via dictionary.lookup(name, name), which makes me wonder if this is the right API. Do we expect NameDictionary to be used elsewhere such that this abstraction is worth the odd API?

        I see your point. I decided to use same key and value to reduce object count. I do not want to rule out using NameDictionary for other things and hence it is generic with the possibility of key and value being different. I can add a comment to indicate key and value are the same when doing lookup. Let me know if you think of other alternatives.

        Show
        Suresh Srinivas added a comment - Thanks Jakob. The NameDictionary.lookup(name, value) method seems a bit odd in its usage. Both times it's used via dictionary.lookup(name, name), which makes me wonder if this is the right API. Do we expect NameDictionary to be used elsewhere such that this abstraction is worth the odd API? I see your point. I decided to use same key and value to reduce object count. I do not want to rule out using NameDictionary for other things and hence it is generic with the possibility of key and value being different. I can add a comment to indicate key and value are the same when doing lookup. Let me know if you think of other alternatives.
        Hide
        Suresh Srinivas added a comment -

        Alternatively we can follow the convention in java.util.Dictionary:

        1. Change the method name from lookup to put
        2. Add methods get and remove for completeness
        Show
        Suresh Srinivas added a comment - Alternatively we can follow the convention in java.util.Dictionary: Change the method name from lookup to put Add methods get and remove for completeness
        Hide
        Jakob Homan added a comment -

        Change the method name from lookup to put

        That sounds good. To me, this seems more like a cache (or as Suresh pointed out, interning of Strings), than a dictionary, but the distinction is definitely blurry.

        Add methods get and remove for completeness

        This would be extra complexity that wouldn't be called by anyone, correct? I'd hold off on that functionality until it's needed.

        Show
        Jakob Homan added a comment - Change the method name from lookup to put That sounds good. To me, this seems more like a cache (or as Suresh pointed out, interning of Strings), than a dictionary, but the distinction is definitely blurry. Add methods get and remove for completeness This would be extra complexity that wouldn't be called by anyone, correct? I'd hold off on that functionality until it's needed.
        Hide
        Suresh Srinivas added a comment -

        Named the class NameCache instead NameDictionary. Also moved ByteArray class to util package.

        Show
        Suresh Srinivas added a comment - Named the class NameCache instead NameDictionary. Also moved ByteArray class to util package.
        Suresh Srinivas made changes -
        Attachment hdfs-1110.4.patch [ 12446370 ]
        Suresh Srinivas made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Konstantin Shvachko added a comment -
        1. variable cache should read nameCache
        2. comment for it should be transformed to JavaDoc comments.
        3. FSDirectory.cache should be initialized in the constructor rather than during declaration.
          And 10 should be declared as a constant.
        4. I would consider using NameCache<byte[]> instead of NameCache<ByteArray>.
          You get less objects and conversions, if of course I didn't miss anything here.
        5. Introduce FSDirectory.cache(INode) method, which calls NameCache.put().
        6. In NameCache some comments need clarification
          • "This class has two phases"
            Probably something else has 2 phases.
          • "This class must be synchronized externally"
          • Member inline comments should be transformed into javadoc.
        7. NameCache.cache should be initialized in the constructor rather than during declaration.
        8. UseCount should probably be a private inner (rather than static) class,
          and should use the same parameter K with which NameCache<K> is parametrized.
          private class UseCount {
              int count;  // Number of times a name occurs
              final K value;  // Internal value for the name
          
              UseCount(final K value) {
                this.value = value;
              }
          }
          
        9. UseCount.count should be initialized in the constructor. It is better to have increment()
          and get() methods rather than accessing count directly from the outside.

        I like the idea of using the useThreshold to determine names that should be promoted to the nameCache.
        My main concern is, that the threshold is 10. This means there will a lot of names in the cache.
        And all these names are in a HashTable, which has a huge overhead, as we know from another jira.
        We still save space, but for names that occur only 10 times the savings are probably negligible.

        I would imagine that only 5% or 10% of the most frequently used names get promoted.
        It is fine with me to use this simple promoting scheme as a starting point, with an intention to
        optimize it later. But I would increase the useThreshold to 1000 or so.

        Should we make it configurable? Could be useful for testing.

        Show
        Konstantin Shvachko added a comment - variable cache should read nameCache comment for it should be transformed to JavaDoc comments. FSDirectory.cache should be initialized in the constructor rather than during declaration. And 10 should be declared as a constant. I would consider using NameCache<byte[]> instead of NameCache<ByteArray>. You get less objects and conversions, if of course I didn't miss anything here. Introduce FSDirectory.cache(INode) method, which calls NameCache.put(). In NameCache some comments need clarification "This class has two phases" Probably something else has 2 phases. "This class must be synchronized externally" Member inline comments should be transformed into javadoc. NameCache.cache should be initialized in the constructor rather than during declaration. UseCount should probably be a private inner (rather than static) class, and should use the same parameter K with which NameCache<K> is parametrized. private class UseCount { int count; // Number of times a name occurs final K value; // Internal value for the name UseCount( final K value) { this .value = value; } } UseCount.count should be initialized in the constructor. It is better to have increment() and get() methods rather than accessing count directly from the outside. I like the idea of using the useThreshold to determine names that should be promoted to the nameCache. My main concern is, that the threshold is 10. This means there will a lot of names in the cache. And all these names are in a HashTable, which has a huge overhead, as we know from another jira. We still save space, but for names that occur only 10 times the savings are probably negligible. I would imagine that only 5% or 10% of the most frequently used names get promoted. It is fine with me to use this simple promoting scheme as a starting point, with an intention to optimize it later. But I would increase the useThreshold to 1000 or so. Should we make it configurable? Could be useful for testing.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12446370/hdfs-1110.4.patch
        against trunk revision 951555.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/397/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/397/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/397/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/397/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12446370/hdfs-1110.4.patch against trunk revision 951555. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/397/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/397/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/397/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/397/console This message is automatically generated.
        Hide
        Suresh Srinivas added a comment -

        Removed NamespaceDedupe and added it as a processor "NameDistribution" to oiv tool.

        Took care of all the comments from Konstantin (thanks!).

        I would consider using NameCache<byte[]> instead of NameCache<ByteArray>....

        Cache inserts byte[] into HashMap. This requires wrapping byte[] in another class to provide hashCode() and equals()

        My main concern is, that the threshold is 10. This means there will a lot of names in the cache...

        For the fsimage I am working with, threshold of 10 results in addition of 10% of the files names to the cache. See the analysis I have posted.

        For each name cached, a HashMap.Entry takes 48 bytes. With threshold 10, space equivalent 9 byte arrays is saved. This is 9 * (24+bytes in array) = 216+9*(bytes in array) bytes. This is significant savings, compared to the cost of HashMap.Entry.

        I have made the threshold configurable with a hidden option "dfs.namenode.name.cache.threshold". This could be used to run tests to see if we can decrease the threshold further.

        Show
        Suresh Srinivas added a comment - Removed NamespaceDedupe and added it as a processor "NameDistribution" to oiv tool. Took care of all the comments from Konstantin (thanks!). I would consider using NameCache<byte[]> instead of NameCache<ByteArray>.... Cache inserts byte[] into HashMap. This requires wrapping byte[] in another class to provide hashCode() and equals() My main concern is, that the threshold is 10. This means there will a lot of names in the cache... For the fsimage I am working with, threshold of 10 results in addition of 10% of the files names to the cache. See the analysis I have posted. For each name cached, a HashMap.Entry takes 48 bytes. With threshold 10, space equivalent 9 byte arrays is saved. This is 9 * (24+bytes in array) = 216+9*(bytes in array) bytes. This is significant savings, compared to the cost of HashMap.Entry. I have made the threshold configurable with a hidden option "dfs.namenode.name.cache.threshold". This could be used to run tests to see if we can decrease the threshold further.
        Suresh Srinivas made changes -
        Attachment hdfs-1110.5.patch [ 12446554 ]
        Hide
        Suresh Srinivas added a comment -

        Minor chages:

        • fixed typo "in in" in ByteArray.java
        • Removed NameCache.UseCount.count initialization during declaration.
        Show
        Suresh Srinivas added a comment - Minor chages: fixed typo "in in" in ByteArray.java Removed NameCache.UseCount.count initialization during declaration.
        Suresh Srinivas made changes -
        Attachment hdfs-1110.6.patch [ 12446649 ]
        Suresh Srinivas made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Suresh Srinivas made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12446649/hdfs-1110.6.patch
        against trunk revision 952861.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/191/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/191/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/191/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/191/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12446649/hdfs-1110.6.patch against trunk revision 952861. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/191/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/191/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/191/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/191/console This message is automatically generated.
        Suresh Srinivas made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Suresh Srinivas made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Suresh Srinivas made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Suresh Srinivas made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Suresh Srinivas added a comment -

        Failed tests TestBalancer and TestLargeDirectoryDelete are not related to this...

        Show
        Suresh Srinivas added a comment - Failed tests TestBalancer and TestLargeDirectoryDelete are not related to this...
        Hide
        Konstantin Shvachko added a comment -

        +1 on the patch.

        We also talked about replacing HashMap with a TreeMap.
        The advantages of TreeMap are

        • you don't need to wrap byte[] into a class, as it lets to provide a comparator, which compares byte[]s, and
        • it does not have memory overhead of HashMap

        The disadvantage is that file creation will require a log-time lookup in TreeMap instead of a constant lookup in HashMap. Besides, the HashMap memory overhead is small compared to the overall memory savings provided by the approach.
        The decision is to use HashMap with a byte[] wrapped into ByteArray class.

        Show
        Konstantin Shvachko added a comment - +1 on the patch. We also talked about replacing HashMap with a TreeMap. The advantages of TreeMap are you don't need to wrap byte[] into a class, as it lets to provide a comparator, which compares byte[]s, and it does not have memory overhead of HashMap The disadvantage is that file creation will require a log-time lookup in TreeMap instead of a constant lookup in HashMap. Besides, the HashMap memory overhead is small compared to the overall memory savings provided by the approach. The decision is to use HashMap with a byte[] wrapped into ByteArray class.
        Hide
        Suresh Srinivas added a comment -

        Attaching 20 version of patch. It does not include OfflineImageViewer tool (since oiv is not available in 20).

        Show
        Suresh Srinivas added a comment - Attaching 20 version of patch. It does not include OfflineImageViewer tool (since oiv is not available in 20).
        Suresh Srinivas made changes -
        Attachment hdfs-1110.y20.patch [ 12446772 ]
        Hide
        Konstantin Shvachko added a comment -

        +1 for the patch for 0.20

        Show
        Konstantin Shvachko added a comment - +1 for the patch for 0.20
        Hide
        Suresh Srinivas added a comment -

        I committed the change to trunk.

        Show
        Suresh Srinivas added a comment - I committed the change to trunk.
        Suresh Srinivas made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #310 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/310/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #310 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/310/ )
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Hi Suresh, new commit should be appended at end of the list in CHANGE.txt but not inserted in the beginning.

        --- hadoop/hdfs/trunk/CHANGES.txt	2010/06/11 18:09:48	953802
        +++ hadoop/hdfs/trunk/CHANGES.txt	2010/06/11 18:15:17	953803
        @@ -9,6 +9,9 @@
         
           IMPROVEMENTS
         
        +    HDFS-1110. Reuses objects for commonly used file names in namenode to
        +    reduce the heap usage. (suresh)
        +
             HDFS-1096. fix for prev. commit. (boryas)
        
        Show
        Tsz Wo Nicholas Sze added a comment - Hi Suresh, new commit should be appended at end of the list in CHANGE.txt but not inserted in the beginning. --- hadoop/hdfs/trunk/CHANGES.txt 2010/06/11 18:09:48 953802 +++ hadoop/hdfs/trunk/CHANGES.txt 2010/06/11 18:15:17 953803 @@ -9,6 +9,9 @@ IMPROVEMENTS + HDFS-1110. Reuses objects for commonly used file names in namenode to + reduce the heap usage. (suresh) + HDFS-1096. fix for prev. commit. (boryas)
        Hide
        Suresh Srinivas added a comment -

        Sure. Next time I commit some thing I will fix it.

        Show
        Suresh Srinivas added a comment - Sure. Next time I commit some thing I will fix it.
        Konstantin Shvachko made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Suresh Srinivas
            Reporter:
            Suresh Srinivas
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development