[ANY23-76] Improve runtime of the Microformat extractor on documents with many relations. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Trivial
Resolution: Fixed
Affects Version/s: 0.7.0
Fix Version/s: 0.7.0
Component/s: core
Labels:
None

Description

For some large documents with many Microformat tuples the extensive use of XPath in the DomUtils class cause Microformat extraction to be slow. I've market this as trivial as it's a corner case.

To reproduce the problem the patch addresses, run the Microformat extractor on the folloing url:
http://en.wikipedia.org/wiki/List_of_Nike_missile_locations

I include a patch that improves performance at the cost of code simplicity. I hope someone who is more involved in the project can decide if it's a good idea to use the patch or not, or maybe address this issue in another way.. The patch replaces commonly used XPath queries with DOM tree traversals. Eg. getting all nodes with 'class' attributes. On my machine the time to parse the given document is reduced from around 105 seconds to 14 seconds.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MicroformatSpeed.patch
16/Apr/12 10:40
6 kB
Timothy Potter

Activity

People

Assignee:: Michele Mostarda

Reporter:: Timothy Potter

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 16/Apr/12 10:36

Updated:: 25/Jun/12 16:59

Resolved:: 21/Apr/12 14:07