Apache Any23
  1. Apache Any23
  2. ANY23-58

HCardExtractor infinite loop and memory exhaustion

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.7.0
    • Component/s: core
    • Labels:
      None
    • Environment:
      OpenJDK Runtime Environment (IcedTea6 1.11pre) (6b23~pre11-0ubuntu1.11.10.2) on Ubuntu

      Description

      The HCardExtractor creates an infinite loop which will lead to memory exhaustion in the method fixIncludes(), specifically in the line node.appendChild(header.cloneNode(true)); on some HTML files. Attached is a test case and an example HTML file.

      1. HCardFailTest.java
        1.0 kB
        Hannes Mühleisen
      2. fail.html
        188 kB
        Hannes Mühleisen

        Activity

        Hide
        Hannes Mühleisen added a comment -

        Problematic HTML file

        Show
        Hannes Mühleisen added a comment - Problematic HTML file
        Hide
        Hannes Mühleisen added a comment -

        JUnit test case

        Show
        Hannes Mühleisen added a comment - JUnit test case
        Hide
        Lewis John McGibbney added a comment -

        Hi Hannes. Off the top of your head, do you have a suggestion for a fix that we could possibly phase into 0.7.0-incubating release?
        If this is not the case then I think we should mark this for 0.8.0-incubating and get on with the task of pushing a release. I agree this is a major bug, but as I've not looked into it in any great detail there's not too much more I can comment on it without proposing to bump it to the next development cycle.

        Show
        Lewis John McGibbney added a comment - Hi Hannes. Off the top of your head, do you have a suggestion for a fix that we could possibly phase into 0.7.0-incubating release? If this is not the case then I think we should mark this for 0.8.0-incubating and get on with the task of pushing a release. I agree this is a major bug, but as I've not looked into it in any great detail there's not too much more I can comment on it without proposing to bump it to the next development cycle.
        Hide
        Hannes Mühleisen added a comment -

        Hi Lewis, sorry, but I have no fix ready. For our use case, i removed the call to fixIncludes(), but this might not be the right thing to do in general. Its fine with me if this is pushed to a later release.

        Show
        Hannes Mühleisen added a comment - Hi Lewis, sorry, but I have no fix ready. For our use case, i removed the call to fixIncludes(), but this might not be the right thing to do in general. Its fine with me if this is pushed to a later release.
        Hide
        Michele Mostarda added a comment -

        Issue has been reproduced:

        It seems a Xerces related problem. Investigating.

        Mar 23, 2012 11:34:40 AM org.apache.any23.rdf.PopularPrefixes getPrefixes
        INFO: Loading prefixes from /org/apache/any23/prefixes/prefixes.properties
        Mar 23, 2012 11:34:40 AM org.apache.any23.extractor.SingleDocumentExtraction run
        INFO: Processing http://bob.example.com/
        
        java.lang.OutOfMemoryError: Java heap space
        	at org.apache.xerces.dom.NodeImpl.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ChildNode.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source)
        	at org.apache.html.dom.HTMLTableRowElementImpl.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source)
        	at org.apache.html.dom.HTMLTableSectionElementImpl.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source)
        	at org.apache.html.dom.HTMLTableElementImpl.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source)
        	at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source)
        
        Show
        Michele Mostarda added a comment - Issue has been reproduced: It seems a Xerces related problem. Investigating. Mar 23, 2012 11:34:40 AM org.apache.any23.rdf.PopularPrefixes getPrefixes INFO: Loading prefixes from /org/apache/any23/prefixes/prefixes.properties Mar 23, 2012 11:34:40 AM org.apache.any23.extractor.SingleDocumentExtraction run INFO: Processing http: //bob.example.com/ java.lang.OutOfMemoryError: Java heap space at org.apache.xerces.dom.NodeImpl.cloneNode(Unknown Source) at org.apache.xerces.dom.ChildNode.cloneNode(Unknown Source) at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source) at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source) at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source) at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source) at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source) at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source) at org.apache.html.dom.HTMLTableRowElementImpl.cloneNode(Unknown Source) at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source) at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source) at org.apache.html.dom.HTMLTableSectionElementImpl.cloneNode(Unknown Source) at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source) at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source) at org.apache.html.dom.HTMLTableElementImpl.cloneNode(Unknown Source) at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source) at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source) at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source) at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source) at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source) at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source) at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source) at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source) at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source) at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source) at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source) at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source) at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source) at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source) at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source) at org.apache.xerces.dom.ElementImpl.cloneNode(Unknown Source) at org.apache.xerces.dom.ParentNode.cloneNode(Unknown Source)
        Hide
        Lewis John McGibbney added a comment -

        OK Thanks Hannes, I'll mark for 0.8.0-incubating for the time being and if some solution comes up in the meantime then we can commit before 0.7.0-incubating release.

        Thanks

        Show
        Lewis John McGibbney added a comment - OK Thanks Hannes, I'll mark for 0.8.0-incubating for the time being and if some solution comes up in the meantime then we can commit before 0.7.0-incubating release. Thanks
        Hide
        Michele Mostarda added a comment -

        Fixed @ r1304362 .

        Show
        Michele Mostarda added a comment - Fixed @ r1304362 .
        Hide
        Hudson added a comment -

        Integrated in Any23-trunk #144 (See https://builds.apache.org/job/Any23-trunk/144/)
        Fixed issue with loop while computing inclusions in HCardExtractor (ANY23-58).
        Added regression test in HCardExtractorTest .
        While fixing this issue another bug about extractor Issue reporting
        has been discovered and fixed (ANY23-62). (Revision 1304362)

        Result = UNSTABLE
        mostarda :
        Files :

        • /incubator/any23/trunk/core/src/main/java/org/apache/any23/extractor/ExtractionResultImpl.java
        • /incubator/any23/trunk/core/src/main/java/org/apache/any23/extractor/html/HCardExtractor.java
        • /incubator/any23/trunk/core/src/test/java/org/apache/any23/extractor/ExtractionResultImplTest.java
        • /incubator/any23/trunk/core/src/test/java/org/apache/any23/extractor/html/HCardExtractorTest.java
        • /incubator/any23/trunk/core/src/test/resources/microformats/hcard/infinite-loop.html
        Show
        Hudson added a comment - Integrated in Any23-trunk #144 (See https://builds.apache.org/job/Any23-trunk/144/ ) Fixed issue with loop while computing inclusions in HCardExtractor ( ANY23-58 ). Added regression test in HCardExtractorTest . While fixing this issue another bug about extractor Issue reporting has been discovered and fixed ( ANY23-62 ). (Revision 1304362) Result = UNSTABLE mostarda : Files : /incubator/any23/trunk/core/src/main/java/org/apache/any23/extractor/ExtractionResultImpl.java /incubator/any23/trunk/core/src/main/java/org/apache/any23/extractor/html/HCardExtractor.java /incubator/any23/trunk/core/src/test/java/org/apache/any23/extractor/ExtractionResultImplTest.java /incubator/any23/trunk/core/src/test/java/org/apache/any23/extractor/html/HCardExtractorTest.java /incubator/any23/trunk/core/src/test/resources/microformats/hcard/infinite-loop.html
        Hide
        Lewis John McGibbney added a comment -

        Great work Michele big +1

        Show
        Lewis John McGibbney added a comment - Great work Michele big +1
        Hide
        Hannes Mühleisen added a comment -

        +1 from me, too! thanks!

        Show
        Hannes Mühleisen added a comment - +1 from me, too! thanks!
        Hide
        Lewis John McGibbney added a comment -

        Bulk close for 0.7.0-incubating release

        Show
        Lewis John McGibbney added a comment - Bulk close for 0.7.0-incubating release

          People

          • Assignee:
            Michele Mostarda
            Reporter:
            Hannes Mühleisen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development