Hadoop Common
  1. Hadoop Common
  2. HADOOP-1017

Optimization: Reduce Overhead from ReflectionUtils.newInstance

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: util
    • Labels:
      None

      Description

      I found that a significant amount of time on my project was being spent in creating constructors for each row of data. I dramatically optimized this performance by creating a simple WeakHashMap to cache constructors by class. For example, in a sample job I find that ReflectionUtils.newInstance takes 200 ms (2% of total) with the cache enabled, but it uses 900 ms (6% of total) without the cache.

      1. cacheCtor.patch
        5 kB
        Ron Bodkin
      2. cacheCtor.patch
        5 kB
        Ron Bodkin
      3. TestReflectionUtils.java
        2 kB
        Ron Bodkin
      4. ReflectionUtils.patch.txt
        2 kB
        Ron Bodkin

        Activity

        Ron Bodkin created issue -
        Hide
        Ron Bodkin added a comment -

        Patch to ReflectionUtils to cache constructors.

        Show
        Ron Bodkin added a comment - Patch to ReflectionUtils to cache constructors.
        Ron Bodkin made changes -
        Field Original Value New Value
        Attachment ReflectionUtils.patch.txt [ 12351074 ]
        Hide
        Ron Bodkin added a comment -

        New file: tests for reflection utils.

        Show
        Ron Bodkin added a comment - New file: tests for reflection utils.
        Ron Bodkin made changes -
        Attachment TestReflectionUtils.java [ 12351075 ]
        Hide
        Doug Cutting added a comment -

        Overall, I like this. In most cases, where classes are not GC'd, the use of a WeakHashMap is moot, but it shouldn't hurt and may help in some cases. But I'd prefer we don't expose the cache through a public API, nor permit disabling it through a system property. A larger public API complicates long-term support. If the cache size ever becomes an issue then we can address that then.

        Show
        Doug Cutting added a comment - Overall, I like this. In most cases, where classes are not GC'd, the use of a WeakHashMap is moot, but it shouldn't hurt and may help in some cases. But I'd prefer we don't expose the cache through a public API, nor permit disabling it through a system property. A larger public API complicates long-term support. If the cache size ever becomes an issue then we can address that then.
        Hide
        Ron Bodkin added a comment -

        I'd be glad to remove the public API: I certainly appreciate the desire to avoid that complexity, or let me know if you'd rather do so. If I do so, I'll make the additional methods package default access to allow the tests.

        Show
        Ron Bodkin added a comment - I'd be glad to remove the public API: I certainly appreciate the desire to avoid that complexity, or let me know if you'd rather do so. If I do so, I'll make the additional methods package default access to allow the tests.
        Hide
        Doug Cutting added a comment -

        Package-private for access from unit tests sounds good. Yes, please submit another patch. The unit test can be included in the patch by using 'svn add <file>' before you use 'svn diff > my.patch' to create the patch, as described on http://wiki.apache.org/lucene-hadoop/HowToContribute. Thanks!

        Show
        Doug Cutting added a comment - Package-private for access from unit tests sounds good. Yes, please submit another patch. The unit test can be included in the patch by using 'svn add <file>' before you use 'svn diff > my.patch' to create the patch, as described on http://wiki.apache.org/lucene-hadoop/HowToContribute . Thanks!
        Hide
        Ron Bodkin added a comment -

        Here is the revised simplified patch following the contribution guidelines.

        Show
        Ron Bodkin added a comment - Here is the revised simplified patch following the contribution guidelines.
        Ron Bodkin made changes -
        Attachment cacheCtor.patch [ 12351387 ]
        Hide
        Doug Cutting added a comment -

        I just realized that this cache is not thread-safe: access to the cache should be synchronized. I can see a few ways of doing this:

        a. wrap 'synchronized (constructorCache)

        { ... }

        ' around the newInstance method body

        b. switch to using a synchronized map (Collections.synchronizedMap)

        c. keep a separate constructorCache per thread, via a threadLocal

        (a) is the simplest, but could result in more contention, since the cache stays locked while constructors are created. (b) would unlock the cache while constructors are created, but might sometimes create a given constructor twice. (c) would be fastest, but would always create a new constructor per thread. I'd probably opt for (b). What do others think?

        Show
        Doug Cutting added a comment - I just realized that this cache is not thread-safe: access to the cache should be synchronized. I can see a few ways of doing this: a. wrap 'synchronized (constructorCache) { ... } ' around the newInstance method body b. switch to using a synchronized map (Collections.synchronizedMap) c. keep a separate constructorCache per thread, via a threadLocal (a) is the simplest, but could result in more contention, since the cache stays locked while constructors are created. (b) would unlock the cache while constructors are created, but might sometimes create a given constructor twice. (c) would be fastest, but would always create a new constructor per thread. I'd probably opt for (b). What do others think?
        Hide
        Doug Cutting added a comment -

        A few more thoughts:

        • should constructorCache be 'static final', and hence named CACHE, all-caps?
        • the values of the WeakHashMap are of type Constructor, each which references its Class, the key. According to http://java.sun.com/j2se/1.5.0/docs/api/java/util/WeakHashMap.html, if values reference the keys, entries will never by collected. So we should either just use a HashMap, and not permit these to be collected, which is probably fine, or change the values to be WeakReferences to the constructors.
        Show
        Doug Cutting added a comment - A few more thoughts: should constructorCache be 'static final', and hence named CACHE, all-caps? the values of the WeakHashMap are of type Constructor, each which references its Class, the key. According to http://java.sun.com/j2se/1.5.0/docs/api/java/util/WeakHashMap.html , if values reference the keys, entries will never by collected. So we should either just use a HashMap, and not permit these to be collected, which is probably fine, or change the values to be WeakReferences to the constructors.
        Hide
        Owen O'Malley added a comment -

        I'd vote for the synchronized map. The failure mode would just be that you look up the constructor a second time.

        Show
        Owen O'Malley added a comment - I'd vote for the synchronized map. The failure mode would just be that you look up the constructor a second time.
        Hide
        Ron Bodkin added a comment -

        How about using a concurrent map if one is available (on Java 5+ or if the concurrent backport is on the classpath), but falling back to a synchronized map if one is not? I've implemented code like that before (in our environment we run single threaded Hadoop jobs so I wasn't aware of the need for thread safety).

        You are right about the values having a reference back to the Class -I think making the map just a HashMap is probably the right approach, since in most programs there would be only a handful and the classes won't need to be gc'd anyhow. it would be possible to make the values SoftReferences instead to allow collecting Classes but to make the cache less likely to lose useful data.

        The cache member certainly could be final and named CACHE, good idea.

        Show
        Ron Bodkin added a comment - How about using a concurrent map if one is available (on Java 5+ or if the concurrent backport is on the classpath), but falling back to a synchronized map if one is not? I've implemented code like that before (in our environment we run single threaded Hadoop jobs so I wasn't aware of the need for thread safety). You are right about the values having a reference back to the Class -I think making the map just a HashMap is probably the right approach, since in most programs there would be only a handful and the classes won't need to be gc'd anyhow. it would be possible to make the values SoftReferences instead to allow collecting Classes but to make the cache less likely to lose useful data. The cache member certainly could be final and named CACHE, good idea.
        Hide
        Ron Bodkin added a comment -

        See for example the methods makeConcurrentMap and makeConcurrentMapClass, which I wrote and contributed, at http://dev.eclipse.org/viewcvs/index.cgi/org.aspectj/modules/weaver/src/org/aspectj/weaver/ltw/LTWWorld.java?root=Tools_Project&view=markup

        Show
        Ron Bodkin added a comment - See for example the methods makeConcurrentMap and makeConcurrentMapClass, which I wrote and contributed, at http://dev.eclipse.org/viewcvs/index.cgi/org.aspectj/modules/weaver/src/org/aspectj/weaver/ltw/LTWWorld.java?root=Tools_Project&view=markup
        Hide
        Doug Cutting added a comment -

        Hadoop requires Java5 already, so, yes, let's use ConcurrentMap for thread-safety and not worry about GCing classes.

        Show
        Doug Cutting added a comment - Hadoop requires Java5 already, so, yes, let's use ConcurrentMap for thread-safety and not worry about GCing classes.
        Hide
        Ron Bodkin added a comment -

        Here is the revised patch that uses a concurrent map for thread safety and holds hard references instead.

        Show
        Ron Bodkin added a comment - Here is the revised patch that uses a concurrent map for thread safety and holds hard references instead.
        Ron Bodkin made changes -
        Attachment cacheCtor.patch [ 12351407 ]
        Hide
        Doug Cutting added a comment -

        +1 Looks good to me.

        Show
        Doug Cutting added a comment - +1 Looks good to me.
        Doug Cutting made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Fix Version/s 0.12.0 [ 12312293 ]
        Hide
        Doug Cutting added a comment -

        I just committed this. Thanks, Ron!

        Show
        Doug Cutting added a comment - I just committed this. Thanks, Ron!
        Doug Cutting made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Doug Cutting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Ron Bodkin
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development