XalanJ2
  1. XalanJ2
  2. XALANJ-2540

Very inefficient default behaviour for looking up DTMManager

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.7.1, 2.7
    • Fix Version/s: None
    • Component/s: DTM, XPath
    • Security Level: No security risk; visible to anyone (Ordinary problems in Xalan projects. Anybody can view the issue.)
    • Labels:
      None

      Description

      I have analysed an issue that has been bothering me for some time. When executing XPath evaluations, it looks like a very significant amount of time is spent in the initialisation of the XPathContext. I have asked this question on Stack Overflow and answered it myself:

      http://stackoverflow.com/questions/6340802/java-xpath-apache-jaxp-implementation-performance

      I think the default behaviour of
      org.apache.xml.dtm.ObjectFactory.lookUpFactoryClassName() is quite sub-optimal and should be improved, statically. I imagine, it is unlikely that this configuration is going to change once classes have been loaded. Hence, the fallback lookup of META-INF/service/org.apache.xml.dtm.DTMManager should only be done once.

      For reference, here's the question and answer again in JIRA:


      I have come to an astonishing conclusion that this:

      Element e = (Element) document.getElementsByTagName("SomeElementName").item(0);
      String result = ((Element) e).getTextContent();

      Seems to be an incredible 100x faster than this:

      // Accounts for 30%, can be cached
      XPathFactory factory = XPathFactory.newInstance();

      // Negligible
      XPath xpath = factory.newXPath();

      // Accounts for 70% (caching a compiled expression doesn't change much...)
      String result = (String) xpath.evaluate(
      "//SomeElementName", document, XPathConstants.STRING);

      I'm using the JVM's default implementation of JAXP:

      org.apache.xpath.jaxp.XPathFactoryImpl
      org.apache.xpath.jaxp.XPathImpl

      I'm really confused, because it's easy to see how JAXP could optimise the above XPath query to actually execute a simple getElementsByTagName() instead. But it doesn't seem to do that. This problem is limited to around 5-6 frequently used XPath calls, that are abstracted and hidden by an API. Those queries involve simple paths (e.g. /a/b/c, no variables, conditions) against an always available DOM Document only. So, if an optimisation can be done, it will be quite easy to achieve.


      I have debugged and profiled my test-case and Xalan/JAXP in general. I managed to identify the big major problem in

      org.apache.xml.dtm.ObjectFactory.lookUpFactoryClassName()

      It can be seen that every one of the 10k test XPath evaluations led to the classloader trying to lookup the DTMManager instance in some sort of default configuration. This configuration is not loaded into memory but accessed every time. Furthermore, this access seems to be protected by a lock on the ObjectFactory.class itself. When the access fails (by default), then the configuration is loaded from the xalan.jar file's

      META-INF/service/org.apache.xml.dtm.DTMManager

      configuration file. Every time!:

      Fortunately, this behaviour can be overridden by specifying a JVM parameter like this:

      -Dorg.apache.xml.dtm.DTMManager=
      org.apache.xml.dtm.ref.DTMManagerDefault

      or

      -Dcom.sun.org.apache.xml.internal.dtm.DTMManager=
      com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault

      So here's a performance improvement overview for 10k consecutive XPath evaluations of //SomeNodeName against a 90k XML file (measured with System.nanoTime():

      measured library : Xalan 2.7.0 | Xalan 2.7.1 | Saxon-HE 9.3 | jaxen 1.1.3
      --------------------------------------------------------------------------------
      without optimisation : 10400ms | 4717ms | | 25500ms
      reusing XPathFactory : 5995ms | 2829ms | |
      reusing XPath : 5900ms | 2890ms | |
      reusing XPathExpression : 5800ms | 2915ms | 16000ms | 25000ms
      adding the JVM param : 1163ms | 761ms | n/a |

        Issue Links

          Activity

          Hide
          Lukas Eder added a comment -

          This issue had actually been discovered already in 2009 by Siarhei Pisarenka. He published his discovery in an SAP blog post:
          http://scn.sap.com/community/java/blog/2009/12/04/performance-improvements-in-nw-java-applications-with-xml-processing

          Show
          Lukas Eder added a comment - This issue had actually been discovered already in 2009 by Siarhei Pisarenka. He published his discovery in an SAP blog post: http://scn.sap.com/community/java/blog/2009/12/04/performance-improvements-in-nw-java-applications-with-xml-processing
          Hide
          Fleur Kelpin added a comment -

          It's worse. The lookup opens a ZipFile and this synchronizes my threads across different one-per-thread precompiled xpathexpression evals.

          Show
          Fleur Kelpin added a comment - It's worse. The lookup opens a ZipFile and this synchronizes my threads across different one-per-thread precompiled xpathexpression evals.
          Hide
          Lukas Eder added a comment -

          True, I hadn't thought of that

          Show
          Lukas Eder added a comment - True, I hadn't thought of that
          Hide
          Michael Glavassevich added a comment -

          It's probably not safe to return the same NodeList since those objects returned from the DOM are live [1]. That could allow the NodeList to change after it's been returned from the XPath evaluator.

          [1] http://www.w3.org/TR/DOM-Level-3-Core/core.html#td-live

          Show
          Michael Glavassevich added a comment - It's probably not safe to return the same NodeList since those objects returned from the DOM are live [1] . That could allow the NodeList to change after it's been returned from the XPath evaluator. [1] http://www.w3.org/TR/DOM-Level-3-Core/core.html#td-live
          Hide
          Lukas Eder added a comment -

          @Michael, you're probably right and that's +1 for Xerces. On the other hand, there's no obvious reason why Xalan's XPath of the form "//xxx" should behave any differently from getElementsByTagName("xxx"). It could even return the very same NodeList, and the behaviour would still be correct, even if you iterate through all items...

          Show
          Lukas Eder added a comment - @Michael, you're probably right and that's +1 for Xerces. On the other hand, there's no obvious reason why Xalan's XPath of the form "//xxx" should behave any differently from getElementsByTagName("xxx"). It could even return the very same NodeList, and the behaviour would still be correct, even if you iterate through all items...
          Hide
          Michael Glavassevich added a comment -

          Element e = (Element) document.getElementsByTagName("SomeElementName").item(0);
          String result = ((Element) e).getTextContent();

          is very fast because Xerces' implementation of getElementsByTagName() is lazy. It immediately returns a NodeList which is initially empty and fills it as you call methods on it. If you only call item(0) on it, it only walks up to the first element which matches the tag name you specified. If that happens to be the root element or one very early in the document it will return quickly.

          Show
          Michael Glavassevich added a comment - Element e = (Element) document.getElementsByTagName("SomeElementName").item(0); String result = ((Element) e).getTextContent(); is very fast because Xerces' implementation of getElementsByTagName() is lazy. It immediately returns a NodeList which is initially empty and fills it as you call methods on it. If you only call item(0) on it, it only walks up to the first element which matches the tag name you specified. If that happens to be the root element or one very early in the document it will return quickly.
          Hide
          Murray Williams added a comment -

          I get a similar improvement with my app when setting -Dcom.sun.org.apache.xml.internal.dtm.DTMManager=com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault

          • Thanks!
          Show
          Murray Williams added a comment - I get a similar improvement with my app when setting -Dcom.sun.org.apache.xml.internal.dtm.DTMManager=com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault Thanks!

            People

            • Assignee:
              Unassigned
              Reporter:
              Lukas Eder
            • Votes:
              22 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:

                Development