Xerces2-J
  1. Xerces2-J
  2. XERCESJ-1276

Improve performance of XML Schema Identity-constraint validation --- XMLSchemaValidator$ValueStoreBase.contains() is painfully slow.

    Details

      Description

      Under certain conditions, the contains() method in XMLSchemaValidator$ValueStoreBase can cripple the performance of parsing and validation.

      I'm not sure what those conditions are, but as a guideline figure I was using JAXB2 to deserialize a 22meg XML file. Without schema validation, it took 5 seconds. With validation, it took over 3 minutes (JDK 1.5.0_10 on win32). My profiler pointed the finger squarely at that method XMLSchemaValidator.

      Suspicions were aroused further when seeing this comment in the source:

      public boolean contains() {
      // REVISIT: we can improve performance by using hash codes, instead of
      // traversing global vector that could be quite large.

      This is present in Xerces 2.6.2 contained with JDK1.5.0_10, and also in the source for 2.9.1.

      1. XMLSchemaValidator.java
        192 kB
        Natan Cox
      2. Xerces-J-src.2.11.0_patch1276.txt
        18 kB
        Jens Dittrich
      3. xerces-value-store.txt
        18 kB
        Chris Simmons
      4. xerces-binaries-patched-over-2.11.0.zip
        1.38 MB
        Deepak Kumar

        Activity

        Hide
        Natan Cox added a comment - - edited

        I too had the same problem. The reason is that for comparison it loops over a Vector.

        I created a fix that improves performance from 5000s to 27s on my laptop for a big file.
        I have an xs:key constraint with 500.000 values.

        Solution

        • I added a fValueSet to do faster lookups.
        • I filled and cleared everywhere where fValues is filled or cleared.
        • And I added a "isContainsCandidate" check to rule out any candidates who do not have all values in fValueSet.
        • If we cannot rule out (which should be the exception) the main algorithm kicks in again.

        Note: I took 2.11.0 code as base!
        And if formatting is not to Xerces-J standard: I sincerely apologize.

        Show
        Natan Cox added a comment - - edited I too had the same problem. The reason is that for comparison it loops over a Vector. I created a fix that improves performance from 5000s to 27s on my laptop for a big file. I have an xs:key constraint with 500.000 values. Solution I added a fValueSet to do faster lookups. I filled and cleared everywhere where fValues is filled or cleared. And I added a "isContainsCandidate" check to rule out any candidates who do not have all values in fValueSet. If we cannot rule out (which should be the exception) the main algorithm kicks in again. Note: I took 2.11.0 code as base! And if formatting is not to Xerces-J standard: I sincerely apologize.
        Hide
        Jens Dittrich added a comment - - edited

        All,

        I did some work on the Xerces-J 2.11 sources and did the following changes:

        • implement a TreeSet<List<Comparable>> for storing tuples of unique constraints and key constraints in XMLSchemaValidator$ValuationStoreBase. (Update: somehow lax: I am storing the value tuples occuring in unique/key constraints in List<Comparable>, e.g., the fields are [@attribute, .] I am storing 2-tuples in instances of the mentioned class. The Classes are determined from the xsd types and may be e.g., xsd:string and xsd:list (of xsd:decimal). The TreeSet contains tuples similar to database indexes.)
        • I rather decided to not use a HashSet as my use case requires dynamically extendable datastructures as I am validating using JAXB on the fly and I do not know the number of entries. As I understand the HashMap::resize implementation it runs with O, right?
        • I made any interface in org.apache.xerces.xs.datatypes extending java.util.Comparable and implemented the respective operations in org.apache.xerces.impl.dv.* when required via lexicographical comparism, see ObjectListImpl.compareTo for instance.
        • I had to add some casts in the class XSSimpleTypeDecl and one in SchemaGrammar. That is where I expect ClassCastExceptions during the Xerces Unit Tests that are not available to me.

        Regards,
        Jens.

        Show
        Jens Dittrich added a comment - - edited All, I did some work on the Xerces-J 2.11 sources and did the following changes: implement a TreeSet<List<Comparable>> for storing tuples of unique constraints and key constraints in XMLSchemaValidator$ValuationStoreBase. (Update: somehow lax: I am storing the value tuples occuring in unique/key constraints in List<Comparable>, e.g., the fields are [@attribute, .] I am storing 2-tuples in instances of the mentioned class. The Classes are determined from the xsd types and may be e.g., xsd:string and xsd:list (of xsd:decimal). The TreeSet contains tuples similar to database indexes.) I rather decided to not use a HashSet as my use case requires dynamically extendable datastructures as I am validating using JAXB on the fly and I do not know the number of entries. As I understand the HashMap::resize implementation it runs with O , right? I made any interface in org.apache.xerces.xs.datatypes extending java.util.Comparable and implemented the respective operations in org.apache.xerces.impl.dv.* when required via lexicographical comparism, see ObjectListImpl.compareTo for instance. I had to add some casts in the class XSSimpleTypeDecl and one in SchemaGrammar. That is where I expect ClassCastExceptions during the Xerces Unit Tests that are not available to me. Regards, Jens.
        Hide
        Jens Dittrich added a comment -

        Michael, can you please patch 2.11 (downloadable from web) and run your unit tests. Please inform me on the outcome, see my other comment for reference.

        Jens.

        Show
        Jens Dittrich added a comment - Michael, can you please patch 2.11 (downloadable from web) and run your unit tests. Please inform me on the outcome, see my other comment for reference. Jens.
        Hide
        Chris Simmons added a comment -

        Here's another patch. I took a different approach, I created a Value class to encapsulate the components being stored and put them in a hash set which fixes the bottleneck without requiring values to implement an ordering (and its probably quicker anyway).

        There is a subtle difference that is presumably be a bug either in my patch or the original code. See the TODO in the patch (line 3826). The old code copied the values but not the value types whereas now they are inevitably paired together. They can't both be right. Was this deliberate or an oversight?

        Show
        Chris Simmons added a comment - Here's another patch. I took a different approach, I created a Value class to encapsulate the components being stored and put them in a hash set which fixes the bottleneck without requiring values to implement an ordering (and its probably quicker anyway). There is a subtle difference that is presumably be a bug either in my patch or the original code. See the TODO in the patch (line 3826). The old code copied the values but not the value types whereas now they are inevitably paired together. They can't both be right. Was this deliberate or an oversight?
        Hide
        Deepak Kumar added a comment - - edited

        I am experiencing very similar problem but with a significantly larger impact, attached is the zip holding binary with pulled patch (which does effective usage of hashCode() and equals()), below is the binary manifest snippet

        Manifest-Version: 1.0
        Ant-Version: Apache Ant(TM) version 1.8.3 compiled on February 26 2012
        Created-By: 1.6.0_32-ea (Sun Microsystems Inc.)

        Problem details:
        ----------------

        I have a compressed input stream file of roughly 25M (24.4Mib) holding xml, compression is achieved using java.util.zip compression/decompression api's with the default strategy, and I am sure the file could go anywhere close to 500M inflated.

        A simple piece of code gets deployed in Tomcat 6.14 - Tomcat 7.0.50 (with java 1.6.30 & java 1.6.32) as a webapp to read-in the compressed file and run an xml parser on it and it takes nearly 30 minutes to parse out fully on a 4-core i5 2.5Ghz processor laptop (nothing in this entire process is parallelized for any kind of optimization reasons). This has been checked and confirmed with explicitly putting the xerces binaries (2.6 and 2.11) to allow xerces to take control of the entire parsing AND even on java default's parsing implementation which is very much the same as seen in xerces.

        During multiple execution below code in xerces has been identified as potential hotspot (via multiple profiling tools) choking up entirely and is happening due to somewhat bad nested looping in the code with significantly larger value indexes (potentially in MB's) and also gets aligned with the comment.

        org.apache.xerces.internal.impl.xs.XMLSchemaValidator#ValueStoreBase.contains()
        // REVISIT: we can improve performance by using hash codes, instead of
        // traversing global vector that could be quite large.
        ..........

        [NOTE] Interestingly the same piece of code runs perfectly (with both jdk and xerces implementation) within a minute via Eclipse and even on the very plain \" java -classpath ... ParserTest \" without any significant JVM hotspot indications which makes a matter of worry on whether Tomcat internally is doing something during the entire parsing???

        As of now I am able to run it within a minute inside Tomcat also, binary pairs can be used as a drop-in replacement for people facing such problem.

        [ATTENTION] On a different angle with the existing xerces binaries if the application attempts to re-process the xmls, even in a different thread, then it severly impacts the execution of other operational threads, thus the entire webapp appears to start freezing randomly, and strangely takes even much higher time to do the parsing (close to 2x time) even with enough memory allocation. I am not sure whether the issue will persist with other other application servers like glassfish or jetty OR it's purely binded to Tomcat.

        --Deepak

        Show
        Deepak Kumar added a comment - - edited I am experiencing very similar problem but with a significantly larger impact, attached is the zip holding binary with pulled patch (which does effective usage of hashCode() and equals()), below is the binary manifest snippet Manifest-Version: 1.0 Ant-Version: Apache Ant(TM) version 1.8.3 compiled on February 26 2012 Created-By: 1.6.0_32-ea (Sun Microsystems Inc.) Problem details: ---------------- I have a compressed input stream file of roughly 25M (24.4Mib) holding xml, compression is achieved using java.util.zip compression/decompression api's with the default strategy, and I am sure the file could go anywhere close to 500M inflated. A simple piece of code gets deployed in Tomcat 6.14 - Tomcat 7.0.50 (with java 1.6.30 & java 1.6.32) as a webapp to read-in the compressed file and run an xml parser on it and it takes nearly 30 minutes to parse out fully on a 4-core i5 2.5Ghz processor laptop (nothing in this entire process is parallelized for any kind of optimization reasons). This has been checked and confirmed with explicitly putting the xerces binaries (2.6 and 2.11) to allow xerces to take control of the entire parsing AND even on java default's parsing implementation which is very much the same as seen in xerces. During multiple execution below code in xerces has been identified as potential hotspot (via multiple profiling tools) choking up entirely and is happening due to somewhat bad nested looping in the code with significantly larger value indexes (potentially in MB's) and also gets aligned with the comment. org.apache.xerces.internal.impl.xs.XMLSchemaValidator#ValueStoreBase.contains() // REVISIT: we can improve performance by using hash codes, instead of // traversing global vector that could be quite large. .......... [NOTE] Interestingly the same piece of code runs perfectly (with both jdk and xerces implementation) within a minute via Eclipse and even on the very plain \" java -classpath ... ParserTest \" without any significant JVM hotspot indications which makes a matter of worry on whether Tomcat internally is doing something during the entire parsing??? As of now I am able to run it within a minute inside Tomcat also, binary pairs can be used as a drop-in replacement for people facing such problem. [ATTENTION] On a different angle with the existing xerces binaries if the application attempts to re-process the xmls, even in a different thread, then it severly impacts the execution of other operational threads, thus the entire webapp appears to start freezing randomly, and strangely takes even much higher time to do the parsing (close to 2x time) even with enough memory allocation. I am not sure whether the issue will persist with other other application servers like glassfish or jetty OR it's purely binded to Tomcat. --Deepak

          People

          • Assignee:
            Unassigned
            Reporter:
            Kenny MacLeod
          • Votes:
            6 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:

              Development