Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1101

XML parse error caused by org.xml.sax.SAXParseException;The entity "nbsp" was referenced, but not declared

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 1.2, 1.3
    • 1.2, 1.3
    • None
    • None
    • I'm using solr 4.0 final with tika 1.2 and ManifoldCF v1.2 dev on tomcat 7 (RHL)

    Description

      Good afternoon,
      This web page (see below) when crawled by ManifoldCF causes severe errors in Solr and causes ManifoldCF to abort the current job.
      I verified the error by sending the URL to tika-app 1.2 and 1.3.
      I can't find any kind of a fix for this.
      Please advise...
      P.S. can you also provide a list of all tika supporting jars? (i.e. poi, jempbox etc etc)
      Thanks,

      Here's the HTML

      <div id="leftcol">
      	  <ul>
              <li><a href="/mission/sec/sec.html"> Security and Information Sciences Home&nbsp;&rsaquo;</a>        </li>
              <li><a href="/mission/sec/publications/-publications.html">Publications&nbsp;&rsaquo;</a> </li>
              <li><a href="/mission/sec/corpora/corpora.html">Corpora&nbsp;&rsaquo;</a> </li>
              <li><a href="/mission/sec/softwaretools/tools.html">Software Tools&nbsp;&rsaquo;</a></li>
              <li><a href="/mission/sec/CSO/CSO.html"> Systems and Operations&nbsp;&rsaquo;</a>
                <ul>
                  <li><a href="/mission/sec/publications/-publications.html">Publications &rsaquo;</a></li>
                  <li><a href="/mission/sec/CSO/biographies/CSObios.html">Biographies&nbsp;&rsaquo;</a></li>
                </ul>
              </li>
              <li><a href="/mission/sec/CST/CST.html"> Systems and Technology&nbsp;&rsaquo;</a> </li>
              <li><a href="/mission/sec/CSA/CSA.html"> System Assessments&nbsp;&rsaquo;</a> </li>
      	    <li><a href="/mission/sec/HLT/HLT.html">Human Language Technology&nbsp;&rsaquo;</a>
      <li><a href="/mission/sec/computing/computing.html">Computing and Analytics&nbsp;&rsaquo;</a></li>
        </ul>
      </div>
      

      Here's the error:

      Apr 03, 2013 4:23:23 PM org.apache.solr.common.SolrException log
      SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: XML parse error
      	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
      	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
      	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
      	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
      	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
      	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
      	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
      	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
      	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
      	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
      	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:581)
      	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
      	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
      	at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:936)
      	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
      	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
      	at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004)
      	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
      	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1686)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
      	at java.lang.Thread.run(Thread.java:722)
      Caused by: org.apache.tika.exception.TikaException: XML parse error
      	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
      	... 21 more
      Caused by: org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 105; The entity "nbsp" was referenced, but not declared.
      	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
      	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
      	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
      	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
      	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
      	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1861)
      	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2994)
      	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607)
      	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
      	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:489)
      	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835)
      	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
      	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
      	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
      	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
      	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:302)
      	at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
      	at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
      	... 25 more
      

      Attachments

        Issue Links

          Activity

            People

              kkrugler Kenneth William Krugler
              dmorana David Morana
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: