Xerces2-J
  1. Xerces2-J
  2. XERCESJ-970

Large comments are extremely slow to parse

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.2.0, 2.2.1, 2.3.0, 2.4.0, 2.5.0, 2.6.0, 2.6.1, 2.6.2
    • Fix Version/s: None
    • Component/s: XNI
    • Labels:
      None
    • Environment:
      Windows XP running Java 1.4.2

      Description

      Very large comments drastically increase the parsing time for both SAX and DOM implementations. Running the sax.Counter and dom.Counter samples with a 410KB file where the entire thing is uncommented results in parse times in the 100ms to 300ms range. However, if I comment out 95% of the file and run the same samples the parse times jump to between 40 and 50 seconds. I ran the same samples using the Aelfred parser shipped with Saxon 7.9 and, while the file with the large comment was slower than without the comment, it jumped by only 100ms or so.

      I briefly compared the code between the two parsers, and they don't look significantly different when it comes to handling comments. The only main difference I noticed was around low/high byte character checks. I suspect it is an inefficiency in the XMLStringBuffer class, but I'm not seeing anything.

      1. comments.txt
        3 kB
        Chris Simmons

        Activity

        Hide
        Michael Glavassevich added a comment -

        What happens if you try parsing the file multiple times in the same JVM? If you're only doing a single run, my guess is that the performance difference may be due to class loading, intialization and some other one time costs.

        When an XMLStringBuffer reaches its capacity it is doubled in size (memory allocation), all of the characters from the old buffer are copied to the new one and the old buffer is replaced with the new one (garbage collection). On the first parse the buffer used for holding the text of a comment would be resized over 10 times (its initial capcity is 32 characters) to accomodate 410 KB. On subsequent parses the buffer would already be sufficiently large to handle comments of this size.

        Show
        Michael Glavassevich added a comment - What happens if you try parsing the file multiple times in the same JVM? If you're only doing a single run, my guess is that the performance difference may be due to class loading, intialization and some other one time costs. When an XMLStringBuffer reaches its capacity it is doubled in size (memory allocation), all of the characters from the old buffer are copied to the new one and the old buffer is replaced with the new one (garbage collection). On the first parse the buffer used for holding the text of a comment would be resized over 10 times (its initial capcity is 32 characters) to accomodate 410 KB. On subsequent parses the buffer would already be sufficiently large to handle comments of this size.
        Hide
        Sean Griffin added a comment -

        You're right, the second time parsing was much faster. I ran the parsing through a profiler and noticed that the problem is localized to the XMLEntityScanner.scanData(String, XMLStringBuffer) method.

        Show
        Sean Griffin added a comment - You're right, the second time parsing was much faster. I ran the parsing through a profiler and noticed that the problem is localized to the XMLEntityScanner.scanData(String, XMLStringBuffer) method.
        Hide
        Henning Moll added a comment - - edited

        I stumbled upon the same problem, and here is a small example to reproduce it. Please add Xerces2 explicitly to the classpath or make sure that you use a JRE5. The intregrated version of Xerces in JRE6 does NOT show the problem (which means, that this bug seems to be fixed there):

        import java.io.BufferedInputStream;
        import java.io.FileInputStream;
        import javax.xml.parsers.SAXParser;
        import javax.xml.parsers.SAXParserFactory;

        import org.xml.sax.InputSource;
        import org.xml.sax.helpers.DefaultHandler;

        public class TTT extends DefaultHandler {

        public static void main(String[] args) throws Exception

        { new TTT().parse(null); }

        public void parse(String[] args) throws Exception

        { long before = System.currentTimeMillis(); InputSource source = new InputSource(new BufferedInputStream(new FileInputStream("garbage.xml"))); SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser parser = factory.newSAXParser(); parser.parse(source, this); System.out.println(System.currentTimeMillis() - before); }

        }

        create a dummy xml-file "garbage.xml" with the following content:
        <?xml version='1.0' encoding='UTF-8' standalone='no'?>
        <blubb>
        ...
        [about 2Gig of CDATA]
        ...
        </blubb>

        The runtime of the example is very short with this version of the xml file. Now change the content to
        <?xml version='1.0' encoding='UTF-8' standalone='no'?>
        <!--blubb>
        ...
        [about 2Gig of CDATA]
        ...
        </blubb-->

        Only one big comment. But the runtime increases extremly (on my system about 110ms vs. 50k(!)ms

        The interesting part: The bundled Xerces of JRE5 does show this behaviour. The one from JRE6 does NOT.

        Show
        Henning Moll added a comment - - edited I stumbled upon the same problem, and here is a small example to reproduce it. Please add Xerces2 explicitly to the classpath or make sure that you use a JRE5. The intregrated version of Xerces in JRE6 does NOT show the problem (which means, that this bug seems to be fixed there): import java.io.BufferedInputStream; import java.io.FileInputStream; import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import org.xml.sax.InputSource; import org.xml.sax.helpers.DefaultHandler; public class TTT extends DefaultHandler { public static void main(String[] args) throws Exception { new TTT().parse(null); } public void parse(String[] args) throws Exception { long before = System.currentTimeMillis(); InputSource source = new InputSource(new BufferedInputStream(new FileInputStream("garbage.xml"))); SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser parser = factory.newSAXParser(); parser.parse(source, this); System.out.println(System.currentTimeMillis() - before); } } create a dummy xml-file "garbage.xml" with the following content: <?xml version='1.0' encoding='UTF-8' standalone='no'?> <blubb> ... [about 2Gig of CDATA] ... </blubb> The runtime of the example is very short with this version of the xml file. Now change the content to <?xml version='1.0' encoding='UTF-8' standalone='no'?> <!--blubb> ... [about 2Gig of CDATA] ... </blubb--> Only one big comment. But the runtime increases extremly (on my system about 110ms vs. 50k(!)ms The interesting part: The bundled Xerces of JRE5 does show this behaviour. The one from JRE6 does NOT.
        Hide
        Chris Simmons added a comment -

        Actually the buffer doesn't double in size when you append a char array, it grows linearly. This char array method is what's used when parsing comments.

        The attached patch includes a test harness and fix.

        On my box without the fix it takes over 8 seconds to parse a large comment, with the fix 1/8 second.

        We have a real life example where a client commented out 40k lines of XML with a similar resulting performance hit.

        Show
        Chris Simmons added a comment - Actually the buffer doesn't double in size when you append a char array, it grows linearly. This char array method is what's used when parsing comments. The attached patch includes a test harness and fix. On my box without the fix it takes over 8 seconds to parse a large comment, with the fix 1/8 second. We have a real life example where a client commented out 40k lines of XML with a similar resulting performance hit.
        Hide
        Chris Simmons added a comment -

        I think the execution time will scale with the square of the size of the comment. Since it scales the buffer linearly it will do order n array copies. The array is also growing linearly so an average array copy will take a time linear in n.

        Show
        Chris Simmons added a comment - I think the execution time will scale with the square of the size of the comment. Since it scales the buffer linearly it will do order n array copies. The array is also growing linearly so an average array copy will take a time linear in n.

          People

          • Assignee:
            Unassigned
            Reporter:
            Sean Griffin
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development