Details
Description
Good afternoon,
This web page (see below) when crawled by ManifoldCF causes severe errors in Solr and causes ManifoldCF to abort the current job.
I verified the error by sending the URL to tika-app 1.2 and 1.3.
I can't find any kind of a fix for this.
Please advise...
P.S. can you also provide a list of all tika supporting jars? (i.e. poi, jempbox etc etc)
Thanks,
Here's the HTML
<div id="leftcol"> <ul> <li><a href="/mission/sec/sec.html"> Security and Information Sciences Home ›</a> </li> <li><a href="/mission/sec/publications/-publications.html">Publications ›</a> </li> <li><a href="/mission/sec/corpora/corpora.html">Corpora ›</a> </li> <li><a href="/mission/sec/softwaretools/tools.html">Software Tools ›</a></li> <li><a href="/mission/sec/CSO/CSO.html"> Systems and Operations ›</a> <ul> <li><a href="/mission/sec/publications/-publications.html">Publications ›</a></li> <li><a href="/mission/sec/CSO/biographies/CSObios.html">Biographies ›</a></li> </ul> </li> <li><a href="/mission/sec/CST/CST.html"> Systems and Technology ›</a> </li> <li><a href="/mission/sec/CSA/CSA.html"> System Assessments ›</a> </li> <li><a href="/mission/sec/HLT/HLT.html">Human Language Technology ›</a> <li><a href="/mission/sec/computing/computing.html">Computing and Analytics ›</a></li> </ul> </div>
Here's the error:
Apr 03, 2013 4:23:23 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: XML parse error at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:581) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:936) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1686) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.tika.exception.TikaException: XML parse error at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) ... 21 more Caused by: org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 105; The entity "nbsp" was referenced, but not declared. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368) at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1861) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2994) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:489) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:302) at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72) ... 25 more
Attachments
Issue Links
- is related to
-
TIKA-1102 Can we add <div> to the list of heuristics for bad html fragments?
- Resolved