[ODFTOOLKIT-424] PERFORMANCE: Reuse XMLFactrories, currently created too frequently during insertNode and setTextContent. - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.6.2-incubating
Component/s: odfdom
Labels:
- patch
- performance
Environment:
odfdom-java-0.8.11-incubating-SNAPSHOT, jdk1.8.0_77, MSWin7

Description

https://issues.apache.org/jira/secure/attachment/12795379/odftoolkit-performance-test.zip
This spreadsheet filling and copying test runs more slowly than desired.

One problem that contributes to the slow performance is that
DOMRDFaParser is created too frequently, potentially for every cell,
during insertNode and setTextContent.

Possible fixes, that reuse the DOMRDFaParser and its XMLOutputFactory and XMLEventFactory, can reduce the run time to 36-40% of the current runtime on 0.8.11 snapshot.

DIAGNOSIS

Profiling the test (in NetBeans) reveals the filesystem is being accessed proportional to the number of cells.

A culprit appears to be that
org.odftoolkit.odfdom.pkg.OdfFileDom.updateInContentMetadataCache(org.w3c.dom.Node)
is called by

org.odftoolkit.odfdom.dom.element.text.TextParagraphElementBase.onInsertNode ()
org.odftoolkit.odfdom.dom.element.table.TableTableCellElementBase.onInsertNode ()
org.odftoolkit.odfdom.dom.element.text.TextParagraphElementBase.setTextContent (String)

On every call updateInContentMetaDataCache calls

org.odftoolkit.odfdom.pkg.rdfa.DOMRDFaParser.createInstance(this.sink)
which in turn calls
- javax.xml.stream.XMLEventFactory.newInstance()
- javax.xml.stream.XMLOutputFactory.newInstance()
- org.odftoolkit.odfdom.pkg.rdfa.DOMRDFaParser(JenaSink, XMLOutputFactory, XMLEVentFactory, URIExtractor)
  which in turn calls
  - org.odftoolkit.odfdom.pkg.rdfa.RDFaParser(JenaSink, XMLOutputFactory, XMLEVentFactory, URIExtractor)
    which in turn calls
    - net.rootdev.javardfa.Parser(StatementSink)
      which in turn AGAIN calls
      - javax.xml.stream.XMLEventFactory.newInstance()
      - javax.xml.stream.XMLOutputFactory.newInstance()

The newInstance methods are expensive because they access the file system looking for service providers. Repeating such expensive calls for every cell is one reason this test case is slow.

POSSIBLE FIXES:

A. One approach is to address three opportunities to improve:

1. Make RDFaParser share its XMLEventFactory, XMLOutputFactory, and URIExtractor with its superclass javardfa.Parser, so that they are not duplicated within each parser.
2. Reuse XMLEventFactory and XMLOutputFactory from JenaSink like URIExtractor, so they do not have to be recreated on every call that creates a SAXRDFaParser or DOMRDFaParser.
3. Reuse SAXRDFaParser and DOMRDFaParser from the same JenaSink, so they do not have to be recreated on every call that needs an RDFa parser.

B. Another approach is to do 1 and 2 without 3.

This will eliminate repeatedly constructing the expensive factories for each cell, but leaves the repeated construction of the parsers.

C. A simpler approach is to only reuse DOMRDFaParser:

This will greatly reduce the frequency of creating the factories (including the duplicate ones). (SAXRDFaParser is only called by OdfFileDom in the initialize method, so it is not important to reuse.)

POSSIBLE CRITERIA:

Reducing computation:

Rough figures from brief runs (not repeated runs):
B reduces time to about 40% of the original.
C reduces time to about 37% of the original.
A reduces time to about 36% of the original.

'A' saves more computation than 'B' or 'C'.

Minimizing change:

The RDFa parsers are all code that overrides library superclass net.rootdev.javardfa.Parser with some modifications. The superclass was not designed for the modifications, so there is a large duplication of code. To preserve the similarities between the superclass code and the copied subclass code, it may be best to minimize modifications to copied code in the subclass parsers.

Approach 'C' meets this concern better than approach 'A' or 'B'.

Safety of reusing parsers:

Code inspection reveals fields:

sink: The primary result of the parser is to emitTriples, writing to the JenaSink that is already reused.

settings: The settings are currently not modified. (In fact RDFaParser.settings is not modifiable: the superclass enable/disable methods do not write the subclass settings, so the settings in use would not be changed even if they were called.)

extractor: is already reused.

locator: is initialized by setBase, which is called before each parse in OdfFileDom.

context: also initialized by setBase, which is called before each parse in OdfFileDom.

literalCollector: contains a stack of states grows on start tags and shrinks on end tags. If an exception is thrown for one parse, then it could be in an invalid state for the next parse.

So reusing a parser looks safe as long as a patch resets the literalCollector when a reusable parser throws an exception. Since LiteralCollector is a library class with private fields, its internal state is not accessible to be reset. So the easiest approach may be to assume that parse exceptions occur rarely: simply forget the broken parser when an exception occurs, and recreate it the next time the parser is needed.

Approach 'B' is safer than 'A' or 'C' because the parsers are still created for every cell.

(If these criteria are all important, I suggest approach 'C' unless problems are later uncovered, then back off to approach 'B' if they cannot be fixed.)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Issue-424-patches.zip
07/Apr/16 22:28
6 kB
Nimarukan

Issue Links

is depended upon by

ODFTOOLKIT-422 PERFORMANCE ISSUE: 'format load', 'put data' & (!) "APPEND SHEET" (!)

Open

PERFORMANCE: Reuse XMLFactrories, currently created too frequently during insertNode and setTextContent.

Details

Description

DIAGNOSIS

POSSIBLE FIXES:

A. One approach is to address three opportunities to improve:

B. Another approach is to do 1 and 2 without 3.

C. A simpler approach is to only reuse DOMRDFaParser:

POSSIBLE CRITERIA:

Reducing computation:

Minimizing change:

Safety of reusing parsers:

Attachments

Attachments

Issue Links

Activity

People

Dates