Velocity
  1. Velocity
  2. VELOCITY-607

Runtime macro rendering very slow in Velocity 1.6-dev (679708) compared to 1.5

    Details

      Description

      The following test template (see VELOCITY-24):

        1. local macro, not global
          #macro(letter $char)
          This is the letter $char
          #end

      #letter("A")
      #letter("B")
      #letter("C")
      #letter("D")
      #letter("E")
      #letter("F")
      #letter("G")
      #letter("H")
      #letter("I")
      #letter("J")
      #letter("K")
      #letter("L")
      #letter("M")
      #letter("N")
      #letter("O")
      #letter("P")
      #letter("Q")
      #letter("R")
      #letter("S")
      #letter("T")
      #letter("U")
      #letter("V")
      #letter("W")
      #letter("X")
      #letter("Y")
      #letter("Z")

      Works quickly and correctly with Velocity 1.5 with several concurrent threads. However, 1.6-dev is a LOT slower (even 20x).

      The major performance bottlenecks seem to be:
      RuntimeMacro.render (60% of time)
      VelocimacroFactory.getVelocimacro (20% of time)

      With several threads this test also causes Velocity to throw error(s):

      org.apache.velocity.exception.MacroOverflowException: Exceed maximum 20 macro calls. Call Stack:letter->letter->letter->letter->letter
      at org.apache.velocity.runtime.VelocimacroFactory.startMacroRendering(VelocimacroFactory.java:179)
      at org.apache.velocity.runtime.RuntimeInstance.startMacroRendering(RuntimeInstance.java:1693)
      at org.apache.velocity.runtime.directive.VelocimacroProxy.render(VelocimacroProxy.java:200)
      at org.apache.velocity.runtime.directive.RuntimeMacro.render(RuntimeMacro.java:230)
      at org.apache.velocity.runtime.parser.node.ASTDirective.render(ASTDirective.java:178)
      at org.apache.velocity.runtime.parser.node.SimpleNode.render(SimpleNode.java:323)
      at org.apache.velocity.Template.merge(Template.java:324)
      at org.apache.velocity.Template.merge(Template.java:232)
      at org.apache.velocity.test.load.Velocity24Test.testRendering(Velocity24Test.java:51)

      This is related to VELOCITY-297 but the fix doesn't seem work with the new modified macro implementation.

      1. velocity-1.6-head-20080725-velocity24-test.PNG
        114 kB
        Jarkko Viinamäki
      2. velocity-1.5-velocity24-test.PNG
        110 kB
        Jarkko Viinamäki
      3. velocity-1.6-dev-macro-performance-IDEAS-v2.6.patch
        119 kB
        Jarkko Viinamäki
      4. velocity-1.6-dev-macro-performance-IDEAS-v2.7.patch
        128 kB
        Jarkko Viinamäki

        Activity

        Hide
        Nathan Bubna added a comment -

        Ok, marking this resolved. Jarkko, feel free to re-open if i missed something or messed up something. I think i got it all though. Performance looks much better in my testing too.

        Now, with all these changes, it's probably time for a beta release to encourage people to start kicking the tires and make sure we didn't change something important...

        Show
        Nathan Bubna added a comment - Ok, marking this resolved. Jarkko, feel free to re-open if i missed something or messed up something. I think i got it all though. Performance looks much better in my testing too. Now, with all these changes, it's probably time for a beta release to encourage people to start kicking the tires and make sure we didn't change something important...
        Hide
        Will Glass-Husain added a comment -

        If there's no code in the project that references those classes I'm okay with removing them. They seem pretty internal to me.

        Show
        Will Glass-Husain added a comment - If there's no code in the project that references those classes I'm okay with removing them. They seem pretty internal to me.
        Hide
        Nathan Bubna added a comment -

        Oh, i forgot to mention that it appeared your patch was not built against the trunk's head, as recent changes to VelocimacroManager were reverted. For simplicity, i went ahead and let them be reverted. I'll redo those soon, but not tonight. Dinner calls.

        Show
        Nathan Bubna added a comment - Oh, i forgot to mention that it appeared your patch was not built against the trunk's head, as recent changes to VelocimacroManager were reverted. For simplicity, i went ahead and let them be reverted. I'll redo those soon, but not tonight. Dinner calls.
        Hide
        Nathan Bubna added a comment -

        Ok, i've committed everything from your latest patch, with just a bit of cleanup here and there. All the tests pass, performance looks quite improved to me. This is awesome, Jarkko! Thanks!

        Before i resolve this, does anyone know of any code that could/would/should/does/did/might reference the VMProxyArg, VMContext, or VMReferenceMungeVisitor classes? It seems unlikely to me, so i'm tempted to just remove them, rather than merely deprecate them. Also, the Runtime class has been deprecated for some time now. I'm ready to yank that out too. Thoughts anyone?

        Show
        Nathan Bubna added a comment - Ok, i've committed everything from your latest patch, with just a bit of cleanup here and there. All the tests pass, performance looks quite improved to me. This is awesome, Jarkko! Thanks! Before i resolve this, does anyone know of any code that could/would/should/does/did/might reference the VMProxyArg, VMContext, or VMReferenceMungeVisitor classes? It seems unlikely to me, so i'm tempted to just remove them, rather than merely deprecate them. Also, the Runtime class has been deprecated for some time now. I'm ready to yank that out too. Thoughts anyone?
        Hide
        Christopher Schultz added a comment -

        Thanks for re3moving that dependency – I had forgotten that we had a JDK 1.4 requirement for compiling. I thought we were up to 1.5 these days.

        You are swallowing an exception in your MapFactory class and allowing the pre-JDK 1.5 code to kick-in instead. Your in-text comment says "this should never happen". If that's the case, then you should throw some type of error. If someone finds that this error is being thrown, they'll know to come to us and we'll fix it. Otherwise, we may not be getting the speed improvement provided by ConcurrentMap.

        Good call on the difference between Hashtable and a synchronized HashMap: I had forgotten that they had slightly different semantics.

        If you choose to implement my "throw an error" strategy above, I woulod recommend taking out the addition null check for your "map" reference afterward – because it should always be null.

        Show
        Christopher Schultz added a comment - Thanks for re3moving that dependency – I had forgotten that we had a JDK 1.4 requirement for compiling. I thought we were up to 1.5 these days. You are swallowing an exception in your MapFactory class and allowing the pre-JDK 1.5 code to kick-in instead. Your in-text comment says "this should never happen". If that's the case, then you should throw some type of error. If someone finds that this error is being thrown, they'll know to come to us and we'll fix it. Otherwise, we may not be getting the speed improvement provided by ConcurrentMap. Good call on the difference between Hashtable and a synchronized HashMap: I had forgotten that they had slightly different semantics. If you choose to implement my "throw an error" strategy above, I woulod recommend taking out the addition null check for your "map" reference afterward – because it should always be null.
        Hide
        Jarkko Viinamäki added a comment -

        Inspired by Christopher's comment I created a MapFactory which can create Java 5 concurrent maps if they are available. I implemented it in slightly different way though to avoid compile time dependency.

        Since there are several maps in VelocimacroFactory and VelocimacroManager, I modified those to get their maps from the MapFactory.

        Running on JDK 1.6.0_07 server mode (single core) Velocity 1.6-dev with this patch (2.7) was up to 25% faster than 1.5 and I believe the difference grows larger when there are more CPUs or concurrent threads.

        I'll leave the older 2.6 patch in place since these changes are a bit radical.

        I'll stop this optimizing madness and wait for Nathan's input.

        Show
        Jarkko Viinamäki added a comment - Inspired by Christopher's comment I created a MapFactory which can create Java 5 concurrent maps if they are available. I implemented it in slightly different way though to avoid compile time dependency. Since there are several maps in VelocimacroFactory and VelocimacroManager, I modified those to get their maps from the MapFactory. Running on JDK 1.6.0_07 server mode (single core) Velocity 1.6-dev with this patch (2.7) was up to 25% faster than 1.5 and I believe the difference grows larger when there are more CPUs or concurrent threads. I'll leave the older 2.6 patch in place since these changes are a bit radical. I'll stop this optimizing madness and wait for Nathan's input.
        Hide
        Christopher Schultz added a comment -

        Jarkko,

        You might be able to make this ConcurrentHashMap code perform faster if you were to check for it at class initialization, and then associated a map-factory class that used the chosen strategy thereafter. I'm not sure if the complexity is worth the (marginal) performance savings, though.

        Maybe something like this:

        private static final MapFactory _threadSafeMapFactory;

        static
        {
        try

        { Class mapClass = Class.forName("java.util.concurrent.ConcurrentHashMap"); _threadSafeMapFactory = new ConcurrentMapFactory(); }

        catch (Exception e)

        { _threadSafeMapFactory = new SynchronizedMapFactory(); }

        }

        /* these go in their own files, so the ClassLoader won't choke on a missing ConcurrentHashMap */
        public interface MapFactory

        { public Map createMap(); }

        public class ConcurrentMapFactory
        {
        public Map createMap()

        { return new java.util.concurent.ConcurrentHashMap(); }

        }
        public class SynchronizedMapFactory
        {
        public Map createMap()

        { return new java.util.Hashtable(); // I would bet that Hashtable is slightly faster than a wrapped HashMap }

        }

        Show
        Christopher Schultz added a comment - Jarkko, You might be able to make this ConcurrentHashMap code perform faster if you were to check for it at class initialization, and then associated a map-factory class that used the chosen strategy thereafter. I'm not sure if the complexity is worth the (marginal) performance savings, though. Maybe something like this: private static final MapFactory _threadSafeMapFactory; static { try { Class mapClass = Class.forName("java.util.concurrent.ConcurrentHashMap"); _threadSafeMapFactory = new ConcurrentMapFactory(); } catch (Exception e) { _threadSafeMapFactory = new SynchronizedMapFactory(); } } /* these go in their own files, so the ClassLoader won't choke on a missing ConcurrentHashMap */ public interface MapFactory { public Map createMap(); } public class ConcurrentMapFactory { public Map createMap() { return new java.util.concurent.ConcurrentHashMap(); } } public class SynchronizedMapFactory { public Map createMap() { return new java.util.Hashtable(); // I would bet that Hashtable is slightly faster than a wrapped HashMap } }
        Hide
        Will Glass-Husain added a comment -

        I can confirm we have a requirement for minimum compile-time JDK 1.4 and run-time JDK 1.3. Clever use of reflection to dynamically choose.

        Show
        Will Glass-Husain added a comment - I can confirm we have a requirement for minimum compile-time JDK 1.4 and run-time JDK 1.3. Clever use of reflection to dynamically choose.
        Hide
        Jarkko Viinamäki added a comment -

        The new patch does not require JDK 1.5. It uses dynamic classloading to check if the code is running under JRE 1.5+, if it is, it uses ConcurrentHashMap, if it is not, it uses synchronized HashMap. See ResourceCacheImpl.

        Show
        Jarkko Viinamäki added a comment - The new patch does not require JDK 1.5. It uses dynamic classloading to check if the code is running under JRE 1.5+, if it is, it uses ConcurrentHashMap, if it is not, it uses synchronized HashMap. See ResourceCacheImpl.
        Hide
        Erron Austin added a comment -

        Somewhere I thought I read someone was against requiring JDK 1.5. Is this not the case?

        Show
        Erron Austin added a comment - Somewhere I thought I read someone was against requiring JDK 1.5. Is this not the case?
        Hide
        Jarkko Viinamäki added a comment -

        Some small additional improvements in patch 2.6. This patch detects if ConcurrentHashMap is available and uses it in resourcecache (if the cache is configured to be unbounded). Also changed some StringBuffer -> StrBuilder in render time segments.

        Some minor speed improvements might still be possible in VelocimacroManager and VelocimacroFactory (investigate the use of Hashtables and synchronized blocks).

        Show
        Jarkko Viinamäki added a comment - Some small additional improvements in patch 2.6. This patch detects if ConcurrentHashMap is available and uses it in resourcecache (if the cache is configured to be unbounded). Also changed some StringBuffer -> StrBuilder in render time segments. Some minor speed improvements might still be possible in VelocimacroManager and VelocimacroFactory (investigate the use of Hashtables and synchronized blocks).
        Hide
        Erron Austin added a comment -

        I've verified that my bug has been fixed. I've also verified that my application is working much better than before.

        Not knowing the inner workings of Velocity, I can't comment on the way the code was written. But I can say that the results are very good.

        Thanks!

        Show
        Erron Austin added a comment - I've verified that my bug has been fixed. I've also verified that my application is working much better than before. Not knowing the inner workings of Velocity, I can't comment on the way the code was written. But I can say that the results are very good. Thanks!
        Hide
        Jarkko Viinamäki added a comment -

        Patch version 2.5 fixes a bug found by Erron Austin. Thanks for testing! The reason was that Template and ContentResource didn't initialize type by setType in their constructor. There was also another bug in refreshResource (call to newResource.setModificationCheckInterval was missing).

        I created a JUnit testcase but it's not completely automatic yet (you have to manually modify template files while the test is running) so I won't include it here yet.

        Show
        Jarkko Viinamäki added a comment - Patch version 2.5 fixes a bug found by Erron Austin. Thanks for testing! The reason was that Template and ContentResource didn't initialize type by setType in their constructor. There was also another bug in refreshResource (call to newResource.setModificationCheckInterval was missing). I created a JUnit testcase but it's not completely automatic yet (you have to manually modify template files while the test is running) so I won't include it here yet.
        Hide
        Erron Austin added a comment -

        Is there a test case for including macros in a separate file that is included using the template merge? I'm doing template.merge(context, writer, templates); where templates contains a path to the file containing my macros.

        I'm testing this patch (velocity-1.6-dev-macro-performance-IDEAS-v2.2.patch) and I'm getting a null pointer only when I update the file.

        The error occurs in the ResourceMangerImpl.refreshResource(). The macros file is coming back a resource type of 0 instead of 1 or 2. Therefore, it is throwing a null pointer exception when attempts to create a new resource.

        Show
        Erron Austin added a comment - Is there a test case for including macros in a separate file that is included using the template merge? I'm doing template.merge(context, writer, templates); where templates contains a path to the file containing my macros. I'm testing this patch (velocity-1.6-dev-macro-performance-IDEAS-v2.2.patch) and I'm getting a null pointer only when I update the file. The error occurs in the ResourceMangerImpl.refreshResource(). The macros file is coming back a resource type of 0 instead of 1 or 2. Therefore, it is throwing a null pointer exception when attempts to create a new resource.
        Hide
        Jarkko Viinamäki added a comment -

        Removed some tweaks that didn't accomplish anything. This is now about as fast as I can make it.

        Show
        Jarkko Viinamäki added a comment - Removed some tweaks that didn't accomplish anything. This is now about as fast as I can make it.
        Hide
        Jarkko Viinamäki added a comment -

        Latest attachment includes final tweaks to improve performance. I'll start waiting for feedback on these changes. With these changes 1.6-dev is 10-20% faster than 1.5 with the tests I ran (on single core laptop).

        Interestingly the creation of the macro context in VelocimacroProxy.render method costs a lot of time because of the memory allocation. There's no easy way to avoid this though.

        Show
        Jarkko Viinamäki added a comment - Latest attachment includes final tweaks to improve performance. I'll start waiting for feedback on these changes. With these changes 1.6-dev is 10-20% faster than 1.5 with the tests I ran (on single core laptop). Interestingly the creation of the macro context in VelocimacroProxy.render method costs a lot of time because of the memory allocation. There's no easy way to avoid this though.
        Hide
        Jarkko Viinamäki added a comment - - edited

        Some cleanup and Javadoc fixes. Also fixes a log problem which occurs when multiple concurrent threads parse and try to add the same macro simultaneously. This caused unnecessary spam in the log.

        I confirmed with Eclipse Memory Analyzer (which is a great tool BTW!) that Velocity 1.6-dev with this patch uses significantly less memory (because macro AST is shared). I made a template with dozens of macro calls to a macro which has huge amounts of text. Velocity 1.5 ended up with 80 MB heap, patched 1.6-dev with 4,5 MB heap.

        Show
        Jarkko Viinamäki added a comment - - edited Some cleanup and Javadoc fixes. Also fixes a log problem which occurs when multiple concurrent threads parse and try to add the same macro simultaneously. This caused unnecessary spam in the log. I confirmed with Eclipse Memory Analyzer (which is a great tool BTW!) that Velocity 1.6-dev with this patch uses significantly less memory (because macro AST is shared). I made a template with dozens of macro calls to a macro which has huge amounts of text. Velocity 1.5 ended up with 80 MB heap, patched 1.6-dev with 4,5 MB heap.
        Hide
        Jarkko Viinamäki added a comment -

        OK. I tweaked my patch a bit and removed some bottlenecks. JRat revealed that a lot of time is spent in ASTReference.render because of the value = EventHandlerUtil.referenceInsert(rsvc, context, literal(), value); call. I also managed to make VelocimacroProxy.render totally unsynchronized and made some other little tweaks here and there.

        Tests still pass but I hope this doesn't break things too much.

        The good news is that with some of my test templates and the test in this issue the patched 1.6-dev is now in some cases even 20% faster than Velocity 1.5.

        Show
        Jarkko Viinamäki added a comment - OK. I tweaked my patch a bit and removed some bottlenecks. JRat revealed that a lot of time is spent in ASTReference.render because of the value = EventHandlerUtil.referenceInsert(rsvc, context, literal(), value); call. I also managed to make VelocimacroProxy.render totally unsynchronized and made some other little tweaks here and there. Tests still pass but I hope this doesn't break things too much. The good news is that with some of my test templates and the test in this issue the patched 1.6-dev is now in some cases even 20% faster than Velocity 1.5.
        Hide
        Will Glass-Husain added a comment -

        Simple answer. It's because #parse is executed at runtime. Since macros can be loaded in a parsed file, the specific macro might change each time the page is rendered.

        Show
        Will Glass-Husain added a comment - Simple answer. It's because #parse is executed at runtime. Since macros can be loaded in a parsed file, the specific macro might change each time the page is rendered.
        Hide
        Christopher Schultz added a comment -

        Perhaps this is not the proper place to discuss such matters (and I must admit that I'm largely ignorant to the inner workings of Velocity), but is there a reason that a macro is /ever/ re-parsed from text into an AST? Why aren't macros parsed once during the parsing of the containing template and stored along with the main template's AST in memory?

        Show
        Christopher Schultz added a comment - Perhaps this is not the proper place to discuss such matters (and I must admit that I'm largely ignorant to the inner workings of Velocity), but is there a reason that a macro is /ever/ re-parsed from text into an AST? Why aren't macros parsed once during the parsing of the containing template and stored along with the main template's AST in memory?
        Hide
        Will Glass-Husain added a comment -

        Hi,

        Jarkko-- thanks so much for your great work, I'm looking forward to seeing this in production.

        I just wanted to throw in a reminder the the macro functionality in v1.6 is a tremendous leap forward from 1.5, at least for some of us. (responding to: "If the memory usage isn't significantly lower, I don't see much point in using 1.6."). The new feature is that macros defined in a file included with #parse are now available to the parent page. This means that you can make libraries of macros for various parts of your app (separate from the global macro library). Secondly, macros can be added programmatically via a call to RuntimeInstance.addVelocimacro().

        I mention this mainly to ensure we don't scope this new functionality out with the performance improvements.

        Show
        Will Glass-Husain added a comment - Hi, Jarkko-- thanks so much for your great work, I'm looking forward to seeing this in production. I just wanted to throw in a reminder the the macro functionality in v1.6 is a tremendous leap forward from 1.5, at least for some of us. (responding to: "If the memory usage isn't significantly lower, I don't see much point in using 1.6."). The new feature is that macros defined in a file included with #parse are now available to the parent page. This means that you can make libraries of macros for various parts of your app (separate from the global macro library). Secondly, macros can be added programmatically via a call to RuntimeInstance.addVelocimacro(). I mention this mainly to ensure we don't scope this new functionality out with the performance improvements.
        Hide
        Jarkko Viinamäki added a comment - - edited

        First things first:

        1. This is not a real production quality patch. It contains some ideas and major design changes that I tried when trying to make Velocity 1.6 new macro implementation perform better. I have done some functional, performance and load testing but not nearly enough.

        2. Although tests pass, I have most probably broken some functionality especially regarding error messages and namespaces.

        3. I have tested these changes with several hundred concurrent threads which does not mean that they are free from race conditions, locking issues etc. nasty stuff.

        4. Even after this patch Velocity 1.6 svn head is around 10-20% slower than 1.5. I didn't investigate 1.5 code that much but I suspect that Velocity 1.5 fully preconstructs templates by duplicating macro AST and precalculates values for constant macro arguments. This patch however uses shared macro ASTs which should reduce memory usage but introduces a bunch of new problems.

        5. There are public interface changes (yep, bad).

        Main ideas of this patch:

        1. When a macro is parsed and thrown into Macro.processAndRegister, it is not transformed and stored as String but as AST.

        2. VelocimacroManager does not create new VelocimacroProxy for each request but creates one per macro / namespace. Thus threads share the macro AST.

        3. VMContext and VMProxyArg have been replaced with ProxyVMContext. This allows us to skip parsing of macro arguments when macro is invoked and it reduces memory allocation overhead.

        4. "Render literal if null for references" is tricky to implement with shared macro AST. Velocity 1.5 implements this by preprocessing the AST tree with visitor.VMReferenceMungeVisitor. As far as I understand, this is not possible with shared macro AST. Therefore there's a rather ugly hack which puts macro argument references to the macro rendering context. ASTReference can then construct the literal format of arguments on-demand.

        Other notes:

        1. ExtendedProperties class is synchronized and very slow overall (it even says so in the JavaDocs). There are quite many runtime calls to this class throughout the Velocity code.

        2. I haven't really investigated yet if this version even uses less memory than 1.5. If the memory usage isn't significantly lower, I don't see much point in using 1.6 if one doesn't need the new macro functionality.

        3. I probably made some catastrophic design mistakes but it takes time to understand velocity functionality and design. Please forgive me.

        Show
        Jarkko Viinamäki added a comment - - edited First things first: 1. This is not a real production quality patch. It contains some ideas and major design changes that I tried when trying to make Velocity 1.6 new macro implementation perform better. I have done some functional, performance and load testing but not nearly enough. 2. Although tests pass, I have most probably broken some functionality especially regarding error messages and namespaces. 3. I have tested these changes with several hundred concurrent threads which does not mean that they are free from race conditions, locking issues etc. nasty stuff. 4. Even after this patch Velocity 1.6 svn head is around 10-20% slower than 1.5. I didn't investigate 1.5 code that much but I suspect that Velocity 1.5 fully preconstructs templates by duplicating macro AST and precalculates values for constant macro arguments. This patch however uses shared macro ASTs which should reduce memory usage but introduces a bunch of new problems. 5. There are public interface changes (yep, bad). Main ideas of this patch: 1. When a macro is parsed and thrown into Macro.processAndRegister, it is not transformed and stored as String but as AST. 2. VelocimacroManager does not create new VelocimacroProxy for each request but creates one per macro / namespace. Thus threads share the macro AST. 3. VMContext and VMProxyArg have been replaced with ProxyVMContext. This allows us to skip parsing of macro arguments when macro is invoked and it reduces memory allocation overhead. 4. "Render literal if null for references" is tricky to implement with shared macro AST. Velocity 1.5 implements this by preprocessing the AST tree with visitor.VMReferenceMungeVisitor. As far as I understand, this is not possible with shared macro AST. Therefore there's a rather ugly hack which puts macro argument references to the macro rendering context. ASTReference can then construct the literal format of arguments on-demand. Other notes: 1. ExtendedProperties class is synchronized and very slow overall (it even says so in the JavaDocs). There are quite many runtime calls to this class throughout the Velocity code. 2. I haven't really investigated yet if this version even uses less memory than 1.5. If the memory usage isn't significantly lower, I don't see much point in using 1.6 if one doesn't need the new macro functionality. 3. I probably made some catastrophic design mistakes but it takes time to understand velocity functionality and design. Please forgive me.
        Hide
        Nathan Bubna added a comment -

        I look forward to seeing your patch, Jarkko! I'm a little wary about point c) in your list, but i'll wait until i see what you've done.

        Show
        Nathan Bubna added a comment - I look forward to seeing your patch, Jarkko! I'm a little wary about point c) in your list, but i'll wait until i see what you've done.
        Hide
        Jarkko Viinamäki added a comment -

        Nathan is right. I have analyzed the code a lot and there is massive amount of rework done. Every time a macro is used, a new VelocimacroProxy is created and the macro is reparsed from string to AST. Then each macro argument is reparsed from string to AST. For each macro argument a new VMProxyArg is created. In addition, a new VMContext is created every time a macro is called (2 HashMaps). This means a lot of memory allocation which is slow and lots of garbage.

        Also when the macro is parsed the first time (Macro directive) the already parsed AST is turned back into a String and stored.

        I mercilessly refactored the code and I managed to create an implementation which passes all current tests and a) uses shared AST for all macros, b) uses shared VelocitymacroProxy c) uses call by reference for macro arguments (instead of call by value). With these modifications the performance is very close to 1.5 with the template in this issue but not quite there yet.

        I'll see if I can tweak the refactored design a bit more but it's already several times faster than current one.

        Show
        Jarkko Viinamäki added a comment - Nathan is right. I have analyzed the code a lot and there is massive amount of rework done. Every time a macro is used, a new VelocimacroProxy is created and the macro is reparsed from string to AST. Then each macro argument is reparsed from string to AST. For each macro argument a new VMProxyArg is created. In addition, a new VMContext is created every time a macro is called (2 HashMaps). This means a lot of memory allocation which is slow and lots of garbage. Also when the macro is parsed the first time (Macro directive) the already parsed AST is turned back into a String and stored. I mercilessly refactored the code and I managed to create an implementation which passes all current tests and a) uses shared AST for all macros, b) uses shared VelocitymacroProxy c) uses call by reference for macro arguments (instead of call by value). With these modifications the performance is very close to 1.5 with the template in this issue but not quite there yet. I'll see if I can tweak the refactored design a bit more but it's already several times faster than current one.
        Hide
        Nathan Bubna added a comment -

        So, it seems that synchronization is not the main problem. It seems like there is a lot of work being re-done. When the template is parsed, each use of a macro in that template creates an instance of RuntimeMacro and a VelocimacroProxy. That's fine and to be expected. What surprised me is that when you render the template (even when cached), each render of that template creates its own VelocimacroProxy instance and a VMProxyArg instance. In other words, it appears that every time a macro is rendered, it's content is re-parsed into an AST. That seems pretty overkill.

        I don't really have any more time to investigate, but there's got to be a better way to do this. But, if we can't find a way, then i'd want to revert the changes that caused this. Given the choice between this constant re-parsing and not being able to include macros via #parse calls, i would prefer the latter. Inconvenience is better than terrible performance. Hopefully, it won't come to that, and we'll just find a way to avoid all this repetition.

        Show
        Nathan Bubna added a comment - So, it seems that synchronization is not the main problem. It seems like there is a lot of work being re-done. When the template is parsed, each use of a macro in that template creates an instance of RuntimeMacro and a VelocimacroProxy. That's fine and to be expected. What surprised me is that when you render the template (even when cached), each render of that template creates its own VelocimacroProxy instance and a VMProxyArg instance. In other words, it appears that every time a macro is rendered, it's content is re-parsed into an AST. That seems pretty overkill. I don't really have any more time to investigate, but there's got to be a better way to do this. But, if we can't find a way, then i'd want to revert the changes that caused this. Given the choice between this constant re-parsing and not being able to include macros via #parse calls, i would prefer the latter. Inconvenience is better than terrible performance. Hopefully, it won't come to that, and we'll just find a way to avoid all this repetition.
        Hide
        Nathan Bubna added a comment -

        Ok, i've got your testbench running for me now. Thanks! It's nice to be able to test improvements. I only wish it could tell me what is happening within those slow methods that is taking so long. I can assume it is thread blocking, but it would be nice to know for sure. I suppose if i can reduce the synchronization and get big performance gains, than that will tell me.

        Show
        Nathan Bubna added a comment - Ok, i've got your testbench running for me now. Thanks! It's nice to be able to test improvements. I only wish it could tell me what is happening within those slow methods that is taking so long. I can assume it is thread blocking, but it would be nice to know for sure. I suppose if i can reduce the synchronization and get big performance gains, than that will tell me.
        Hide
        Jarkko Viinamäki added a comment -

        Nice work Nathan! I ran tests with 250 simultaneous threads using the test but I didn't get the incorrect macrooverflow exception anymore. I also tested that infinite recursion correctly causes the exception. So your fixes seem to work well.

        The performance is still rotten though (with 350 threads, 50 loops: 1.5: 1,3 seconds, 1.6-dev: 46 seconds)

        Show
        Jarkko Viinamäki added a comment - Nice work Nathan! I ran tests with 250 simultaneous threads using the test but I didn't get the incorrect macrooverflow exception anymore. I also tested that infinite recursion correctly causes the exception. So your fixes seem to work well. The performance is still rotten though (with 350 threads, 50 loops: 1.5: 1,3 seconds, 1.6-dev: 46 seconds)
        Hide
        Nathan Bubna added a comment -

        Yeah, it was for VELOCITY-297. But really, you may have committed it, but we're all supposed to be reviewing commits. We all missed it.

        Show
        Nathan Bubna added a comment - Yeah, it was for VELOCITY-297 . But really, you may have committed it, but we're all supposed to be reviewing commits. We all missed it.
        Hide
        Will Glass-Husain added a comment -

        You're talking about Velocity-297, right? Mea culpa as the committer who reviewed the patch.

        I think it's fine to change InternalContextAdapter – we've done this before and I don't consider it part of our public API.

        Show
        Will Glass-Husain added a comment - You're talking about Velocity-297, right? Mea culpa as the committer who reviewed the patch. I think it's fine to change InternalContextAdapter – we've done this before and I don't consider it part of our public API.
        Hide
        Nathan Bubna added a comment -

        Sorry, forgot for a second that performance issues were listed for this issue too. Re-opening to look at those.

        Show
        Nathan Bubna added a comment - Sorry, forgot for a second that performance issues were listed for this issue too. Re-opening to look at those.
        Hide
        Nathan Bubna added a comment -

        Ok, that should do it. I haven't installed and run your testbench though, Jarkko. If you wouldn't mind testing this, i'd be much obliged.

        Show
        Nathan Bubna added a comment - Ok, that should do it. I haven't installed and run your testbench though, Jarkko. If you wouldn't mind testing this, i'd be much obliged.
        Hide
        Nathan Bubna added a comment -

        I'll make it so setting a value of <=0 turns it off, but even so, the synchronization all has to go for this to work properly. The macro call stack cannot work as it's supposed to if it is shared across threads. So, i'm moving it into the InternalHousekeepingContext interface (impl is in InternalContextBase). This way, so long as no one shares a context directly (wrapped is fine) across threads (which is a Very Bad Idea to begin with), this will work as it is supposed.

        Really, i'm a bit surprised this macro call depth guard got committed in the first place. It's got lots of problems and the testcase is atrociously broken. Fix will be checked in momentarily. If anyone out there is implementing InternalContextAdapter in their own apps (perhaps for a custom directive? but even that's unlikely), this will break their code. Sorry, the fix is absolutely necessary, and i don't believe there's any other way to do it.

        Show
        Nathan Bubna added a comment - I'll make it so setting a value of <=0 turns it off, but even so, the synchronization all has to go for this to work properly. The macro call stack cannot work as it's supposed to if it is shared across threads. So, i'm moving it into the InternalHousekeepingContext interface (impl is in InternalContextBase). This way, so long as no one shares a context directly (wrapped is fine) across threads (which is a Very Bad Idea to begin with), this will work as it is supposed. Really, i'm a bit surprised this macro call depth guard got committed in the first place. It's got lots of problems and the testcase is atrociously broken. Fix will be checked in momentarily. If anyone out there is implementing InternalContextAdapter in their own apps (perhaps for a custom directive? but even that's unlikely), this will break their code. Sorry, the fix is absolutely necessary, and i don't believe there's any other way to do it.
        Hide
        Jarkko Viinamäki added a comment -

        Also I think that it would be nice if one could switch that recursive call tracking off in the configuration file since it's quite expensive to do (there are some tough synchronized blocks).

        Show
        Jarkko Viinamäki added a comment - Also I think that it would be nice if one could switch that recursive call tracking off in the configuration file since it's quite expensive to do (there are some tough synchronized blocks).
        Hide
        Nathan Bubna added a comment -

        As i said in VELOCITY-606, the bug found above looks like VelocimacroFactory.startMacroRendering is mis-designed to limit macro concurrency when it is trying to limit macro call depth. I don't think there's anyway to fix the startMacroRendering(...) method. The macro call stack probably has to be stored in the VMContext/InternalHousekeepingContext, just like the template stack. I'll see if i can make that work right before i worry about performance.

        Show
        Nathan Bubna added a comment - As i said in VELOCITY-606 , the bug found above looks like VelocimacroFactory.startMacroRendering is mis-designed to limit macro concurrency when it is trying to limit macro call depth. I don't think there's anyway to fix the startMacroRendering(...) method. The macro call stack probably has to be stored in the VMContext/InternalHousekeepingContext, just like the template stack. I'll see if i can make that work right before i worry about performance.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jarkko Viinamäki
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development