Uploaded image for project: 'ODF Toolkit'
  1. ODF Toolkit

PERFORMANCE: Reuse XMLFactrories, currently created too frequently during insertNode and setTextContent.



    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 0.6.2-incubating
    • odfdom
    • odfdom-java-0.8.11-incubating-SNAPSHOT, jdk1.8.0_77, MSWin7


      This spreadsheet filling and copying test runs more slowly than desired.

      One problem that contributes to the slow performance is that
      DOMRDFaParser is created too frequently, potentially for every cell,
      during insertNode and setTextContent.

      Possible fixes, that reuse the DOMRDFaParser and its XMLOutputFactory and XMLEventFactory, can reduce the run time to 36-40% of the current runtime on 0.8.11 snapshot.


      Profiling the test (in NetBeans) reveals the filesystem is being accessed proportional to the number of cells.

      A culprit appears to be that
      is called by

      • org.odftoolkit.odfdom.dom.element.text.TextParagraphElementBase.onInsertNode ()
      • org.odftoolkit.odfdom.dom.element.table.TableTableCellElementBase.onInsertNode ()
      • org.odftoolkit.odfdom.dom.element.text.TextParagraphElementBase.setTextContent (String)

      On every call updateInContentMetaDataCache calls

      • org.odftoolkit.odfdom.pkg.rdfa.DOMRDFaParser.createInstance(this.sink)
        which in turn calls
        • javax.xml.stream.XMLEventFactory.newInstance()
        • javax.xml.stream.XMLOutputFactory.newInstance()
        • org.odftoolkit.odfdom.pkg.rdfa.DOMRDFaParser(JenaSink, XMLOutputFactory, XMLEVentFactory, URIExtractor)
          which in turn calls
          • org.odftoolkit.odfdom.pkg.rdfa.RDFaParser(JenaSink, XMLOutputFactory, XMLEVentFactory, URIExtractor)
            which in turn calls
            • net.rootdev.javardfa.Parser(StatementSink)
              which in turn AGAIN calls
              • javax.xml.stream.XMLEventFactory.newInstance()
              • javax.xml.stream.XMLOutputFactory.newInstance()

      The newInstance methods are expensive because they access the file system looking for service providers. Repeating such expensive calls for every cell is one reason this test case is slow.


      A. One approach is to address three opportunities to improve:
      • 1. Make RDFaParser share its XMLEventFactory, XMLOutputFactory, and URIExtractor with its superclass javardfa.Parser, so that they are not duplicated within each parser.
      • 2. Reuse XMLEventFactory and XMLOutputFactory from JenaSink like URIExtractor, so they do not have to be recreated on every call that creates a SAXRDFaParser or DOMRDFaParser.
      • 3. Reuse SAXRDFaParser and DOMRDFaParser from the same JenaSink, so they do not have to be recreated on every call that needs an RDFa parser.
      B. Another approach is to do 1 and 2 without 3.

      This will eliminate repeatedly constructing the expensive factories for each cell, but leaves the repeated construction of the parsers.

      C. A simpler approach is to only reuse DOMRDFaParser:

      This will greatly reduce the frequency of creating the factories (including the duplicate ones). (SAXRDFaParser is only called by OdfFileDom in the initialize method, so it is not important to reuse.)


      Reducing computation:

      Rough figures from brief runs (not repeated runs):
      B reduces time to about 40% of the original.
      C reduces time to about 37% of the original.
      A reduces time to about 36% of the original.

      'A' saves more computation than 'B' or 'C'.

      Minimizing change:

      The RDFa parsers are all code that overrides library superclass net.rootdev.javardfa.Parser with some modifications. The superclass was not designed for the modifications, so there is a large duplication of code. To preserve the similarities between the superclass code and the copied subclass code, it may be best to minimize modifications to copied code in the subclass parsers.

      Approach 'C' meets this concern better than approach 'A' or 'B'.

      Safety of reusing parsers:

      Code inspection reveals fields:

      • sink: The primary result of the parser is to emitTriples, writing to the JenaSink that is already reused.
      • settings: The settings are currently not modified. (In fact RDFaParser.settings is not modifiable: the superclass enable/disable methods do not write the subclass settings, so the settings in use would not be changed even if they were called.)
      • extractor: is already reused.
      • locator: is initialized by setBase, which is called before each parse in OdfFileDom.
      • context: also initialized by setBase, which is called before each parse in OdfFileDom.
      • literalCollector: contains a stack of states grows on start tags and shrinks on end tags. If an exception is thrown for one parse, then it could be in an invalid state for the next parse.

      So reusing a parser looks safe as long as a patch resets the literalCollector when a reusable parser throws an exception. Since LiteralCollector is a library class with private fields, its internal state is not accessible to be reset. So the easiest approach may be to assume that parse exceptions occur rarely: simply forget the broken parser when an exception occurs, and recreate it the next time the parser is needed.

      Approach 'B' is safer than 'A' or 'C' because the parsers are still created for every cell.

      (If these criteria are all important, I suggest approach 'C' unless problems are later uncovered, then back off to approach 'B' if they cannot be fixed.)


        1. Issue-424-patches.zip
          6 kB

        Issue Links



              svanteschubert Svante Schubert
              nimarukan Nimarukan
              0 Vote for this issue
              2 Start watching this issue