Apache Any23
  1. Apache Any23
  2. ANY23-75

Improve runtime of the Microdata extractor on documents with many relations.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.7.0
    • Component/s: None
    • Labels:
      None

      Description

      I've been running Any23 on a big web crawler dump. I found for certain documents with a lot of Microdata relations the method MicrodataParser.getItemProps() becomes very slow. As a result, processing one document can take several minutes. An example of a problematic page can be seen here: http://dreamtime.fftunes.com/

      I'll attach a patch for the method that greatly improves the performance of this method. I was wondering if someone could have a look at it and include it in the next release if possible.

        Issue Links

          Activity

          Hide
          Lewis John McGibbney added a comment -

          Setting for 0.7.0-incubating.
          Is it possible for you to explain a bit about the patch and the underlying reason as to why the existing parser implementation seems to clog up? Also this is really trivial but can you please have a look at the coding format if it differs or not. My initial thoughts of the patch are great, it's a nice one to have caught, but some additional explanation would really help us out. Thank you very much. Lewis

          Show
          Lewis John McGibbney added a comment - Setting for 0.7.0-incubating. Is it possible for you to explain a bit about the patch and the underlying reason as to why the existing parser implementation seems to clog up? Also this is really trivial but can you please have a look at the coding format if it differs or not. My initial thoughts of the patch are great, it's a nice one to have caught, but some additional explanation would really help us out. Thank you very much. Lewis
          Hide
          Michele Mostarda added a comment -

          I read the patch, it sounds really clear, it is an optimization to improve performances while processing detecting nested node paths. I will integrate it in a couple of days.

          Show
          Michele Mostarda added a comment - I read the patch, it sounds really clear, it is an optimization to improve performances while processing detecting nested node paths. I will integrate it in a couple of days.
          Hide
          Timothy Potter added a comment -

          Thanks Michele.

          I'll explain the change anyway:

          The current implementation is slow on pages with lots of items because it does nested iterations over all itemscope and itemprops under the given scopeNode. In the inner loop it builds XPaths strings for each node to test if the nodes are related. This test condition is inherent in the tree structure of the DOM. The patch changes the code to do a traversal of the DOM tree limited to nodes only in the given scopeNode. Reading the code itself is probably the best way to understanding the change.

          Show
          Timothy Potter added a comment - Thanks Michele. I'll explain the change anyway: The current implementation is slow on pages with lots of items because it does nested iterations over all itemscope and itemprops under the given scopeNode. In the inner loop it builds XPaths strings for each node to test if the nodes are related. This test condition is inherent in the tree structure of the DOM. The patch changes the code to do a traversal of the DOM tree limited to nodes only in the given scopeNode. Reading the code itself is probably the best way to understanding the change.
          Hide
          Michele Mostarda added a comment -

          Fixed @1345154

          Show
          Michele Mostarda added a comment - Fixed @1345154
          Hide
          Hudson added a comment -

          Integrated in Any23-trunk #220 (See https://builds.apache.org/job/Any23-trunk/220/)
          Improved MicrodataParser performances. Related to issue #ANY23-75. (Revision 1345154)

          Result = SUCCESS
          mostarda :
          Files :

          • /incubator/any23/trunk/core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java
          Show
          Hudson added a comment - Integrated in Any23-trunk #220 (See https://builds.apache.org/job/Any23-trunk/220/ ) Improved MicrodataParser performances. Related to issue # ANY23-75 . (Revision 1345154) Result = SUCCESS mostarda : Files : /incubator/any23/trunk/core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java
          Hide
          Lewis John McGibbney added a comment -

          Bulk close for 0.7.0-incubating release

          Show
          Lewis John McGibbney added a comment - Bulk close for 0.7.0-incubating release

            People

            • Assignee:
              Michele Mostarda
              Reporter:
              Timothy Potter
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development