Uploaded image for project: 'Apache Any23 (Retired)'
  1. Apache Any23 (Retired)
  2. ANY23-115

Empty spans seem to break ANY23

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.7.0
    • 0.9.0
    • html-scraper, microdata
    • None
    • Any23.org public scraper

    Description

      One of the 2 thousand URLs with the problem:
      http://www.oceanexpert.net/viewMemberRecord.php?&memberID=20045

      The piece of HTML creating the problem seems to be:
      <h1>
      Details of<span itemprop="name"> <span itemprop="honorificPrefix"></span> <span itemprop="givenName">Laury</span>  <span itemprop="familyName">Miller</span></span>
      </h1>
      (this may disappear as we may workaround the problem)

      Error message:
      Internal error.
      ================================================================
      java.lang.IllegalArgumentException: Invalid content ''
      at org.apache.any23.extractor.microdata.ItemPropValue.<init>(ItemPropValue.java:89)
      at org.apache.any23.extractor.microdata.MicrodataParser.getPropertyValue(MicrodataParser.java:341)
      at org.apache.any23.extractor.microdata.MicrodataParser.getItemProps(MicrodataParser.java:394)
      at org.apache.any23.extractor.microdata.MicrodataParser.getItemScope(MicrodataParser.java:471)
      at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:186)
      at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:203)
      at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:100)
      at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:62)
      at org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:477)
      at org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:260)
      at org.apache.any23.Any23.extract(Any23.java:294)
      at org.apache.any23.Any23.extract(Any23.java:446)
      at org.apache.any23.servlet.WebResponder.runExtraction(WebResponder.java:113)
      at org.apache.any23.servlet.Servlet.doGet(Servlet.java:74)
      at javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
      at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
      at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
      at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
      at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
      at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
      at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
      at com.googlecode.psiprobe.Tomcat60AgentValve.invoke(Tomcat60AgentValve.java:30)
      at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
      at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
      at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
      at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
      at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
      at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
      at java.lang.Thread.run(Thread.java:662)
      ================================================================

      Attachments

        1. 0001-ANY23-115-Empty-spans-seem-to-break-ANY23.patch
          2 kB
          Lewis John McGibbney
        2. json-pretty-printer.html
          10 kB
          Lewis John McGibbney

        Issue Links

          Activity

            People

              Unassigned Unassigned
              christophedupriez Christophe Dupriez
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: