Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1655

Web connector - UnsupportedEncodingException utf-8

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • ManifoldCF 2.17
    • ManifoldCF 2.18
    • Web connector
    • None

    Description

      When crawling some sites (for instance this one: http://www.antibes-juanlespins.com/ ) the job manages to index some documents, but the stops with the following error code:
      Error: IO error: utf-8; filename=rseventspro_rss20_56.xml

      Here is one the MCF stacktrace:
      Exception tossed: IO error: utf-8; filename=rseventspro_rss20_56.xml
      org.apache.manifoldcf.core.interfaces.ManifoldCFException: IO error: utf-8; filename=rseventspro_rss20_56.xml
      at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4203) ~[?:?]
      at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:3855) ~[?:?]
      at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:746) ~[?:?]
      at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?]
      Caused by: java.io.UnsupportedEncodingException: utf-8; filename=rseventspro_rss20_56.xml
      at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71) ~[?:1.8.0_212]
      at java.io.InputStreamReader.<init>(InputStreamReader.java:100) ~[?:1.8.0_212]
      at org.apache.manifoldcf.connectorcommon.fuzzyml.DecodingByteReceiver.dealWithBytes(DecodingByteReceiver.java:47) ~[?:?]
      at org.apache.manifoldcf.connectorcommon.fuzzyml.BOMEncodingDetector.dealWithRemainder(BOMEncodingDetector.java:250) ~[?:?]
      at org.apache.manifoldcf.connectorcommon.fuzzyml.SingleByteReceiver.dealWithBytes(SingleByteReceiver.java:52) ~[?:?]
      at org.apache.manifoldcf.connectorcommon.fuzzyml.Parser.parseWithCharsetDetection(Parser.java:74) ~[?:?]
      at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4174) ~[?:?]
      ... 3 more

      Attachments

        Activity

          People

            kwright@metacarta.com Karl Wright
            julienFL Julien Massiera
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: