Nutch
  1. Nutch
  2. NUTCH-1016

Strip UTF-8 non-character codepoints

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Component/s: indexer
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:

      SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
              at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
              at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
      

      Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]

      Please comment!

      1. NUTCH-1016-1.4-4.patch
        3 kB
        Markus Jelsma
      2. NUTCH-1016-2.0.patch
        3 kB
        Markus Jelsma

        Issue Links

          Activity

          Markus Jelsma created issue -
          Hide
          Markus Jelsma added a comment -

          Patch for 1.4.

          Show
          Markus Jelsma added a comment - Patch for 1.4.
          Markus Jelsma made changes -
          Field Original Value New Value
          Attachment NUTCH-1016-1.4.patch [ 12483964 ]
          Markus Jelsma made changes -
          Attachment NUTCH-1016-1.4.patch [ 12483964 ]
          Hide
          Markus Jelsma added a comment -

          Silly me again, the patch was wrong. changed OR's to AND's!

          This patch also includes more verbose output of the SolrWriter class. Handy for batches of many thousands of documents. This patch doesn't include change to log4j.properties though.

          Should i get rid of the logging? Keep it?

          Show
          Markus Jelsma added a comment - Silly me again, the patch was wrong. changed OR's to AND's! This patch also includes more verbose output of the SolrWriter class. Handy for batches of many thousands of documents. This patch doesn't include change to log4j.properties though. Should i get rid of the logging? Keep it?
          Markus Jelsma made changes -
          Attachment NUTCH-1016-1.4-2.patch [ 12483966 ]
          Markus Jelsma made changes -
          Attachment NUTCH-1016-1.4-2.patch [ 12483966 ]
          Hide
          Markus Jelsma added a comment -

          New patch also includes checking for non-printable control characters.

          Show
          Markus Jelsma added a comment - New patch also includes checking for non-printable control characters.
          Markus Jelsma made changes -
          Attachment NUTCH-1016-1.4-3.patch [ 12484441 ]
          Markus Jelsma made changes -
          Attachment NUTCH-1016-1.4-3.patch [ 12484441 ]
          Hide
          Markus Jelsma added a comment -

          Previous patch included debug line to stdout. Removed now.

          Show
          Markus Jelsma added a comment - Previous patch included debug line to stdout. Removed now.
          Markus Jelsma made changes -
          Attachment NUTCH-1016-1.4-4.patch [ 12484442 ]
          Hide
          Markus Jelsma added a comment -

          If there are no objections i'd like to commit this issue tomorrow.

          Show
          Markus Jelsma added a comment - If there are no objections i'd like to commit this issue tomorrow.
          Hide
          Markus Jelsma added a comment -

          Committed for 1.4 in rev. 1141500.

          Show
          Markus Jelsma added a comment - Committed for 1.4 in rev. 1141500.
          Markus Jelsma made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Markus Jelsma added a comment -

          Patch for 2.0.

          Show
          Markus Jelsma added a comment - Patch for 2.0.
          Markus Jelsma made changes -
          Attachment NUTCH-1016-2.0.patch [ 12484764 ]
          Hide
          Markus Jelsma added a comment -

          Accidentally resolved. Issue stays open for 2.0 for the change is untested there.

          Show
          Markus Jelsma added a comment - Accidentally resolved. Issue stays open for 2.0 for the change is untested there.
          Markus Jelsma made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Markus Jelsma made changes -
          Fix Version/s 1.4 [ 12316519 ]
          Markus Jelsma made changes -
          Fix Version/s 1.4 [ 12316519 ]
          Markus Jelsma made changes -
          Link This issue is superceded by NUTCH-1026 [ NUTCH-1026 ]
          Hide
          Markus Jelsma added a comment -

          Resolved for 1.4.

          Show
          Markus Jelsma added a comment - Resolved for 1.4.
          Markus Jelsma made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Fix Version/s 2.0 [ 12314893 ]
          Resolution Fixed [ 1 ]
          Hide
          Markus Jelsma added a comment -

          Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220

          Show
          Markus Jelsma added a comment - Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220
          Markus Jelsma made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Christian Johnsson added a comment -

          Does this apply to 1.5 RC1? (Stumbled upon the error a couple of times)

          Show
          Christian Johnsson added a comment - Does this apply to 1.5 RC1? (Stumbled upon the error a couple of times)
          Hide
          Markus Jelsma added a comment -

          It is resolved for Nutch 1.4.

          Show
          Markus Jelsma added a comment - It is resolved for Nutch 1.4.
          Hide
          Christian Johnsson added a comment -

          Ok, never got the error before with 1.5rc1, It started this morning. Been running for 1 week without errors.
          May 9, 2012 1:46:31 PM org.apache.solr.common.SolrException log
          SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1427640, byte #1564649)
          at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
          at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
          at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
          at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
          at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:301)
          at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157)
          at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
          at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
          at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
          at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
          at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
          at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
          at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
          at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
          at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
          at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
          at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
          at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
          at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
          at java.lang.Thread.run(Thread.java:636)
          Caused by: java.io.CharConversionException: Invalid UTF-8 character 0xffff at char #1427640, byte #1564649)
          at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335)
          at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249)
          at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
          at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
          at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
          at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
          at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
          at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
          at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
          at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
          ... 21 more

          and

          May 9, 2012 1:46:36 PM org.apache.solr.common.SolrException log
          SEVERE: java.lang.RuntimeException: [was class java.io.IOException] Invalid CRLF
          at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
          at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
          at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
          at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
          at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:301)
          at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157)
          at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
          at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
          at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
          at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
          at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
          at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
          at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
          at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
          at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
          at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
          at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
          at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
          at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
          at java.lang.Thread.run(Thread.java:636)
          Caused by: java.io.IOException: Invalid CRLF
          at org.apache.coyote.http11.filters.ChunkedInputFilter.parseCRLF(ChunkedInputFilter.java:352)
          at org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:151)
          at org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:710)
          at org.apache.coyote.Request.doRead(Request.java:427)
          at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:304)
          at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:419)
          at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:327)
          at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:162)
          at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:365)
          at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:110)
          at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
          at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
          at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
          at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
          at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
          at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
          at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
          at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
          ... 21 more

          Show
          Christian Johnsson added a comment - Ok, never got the error before with 1.5rc1, It started this morning. Been running for 1 week without errors. May 9, 2012 1:46:31 PM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1427640, byte #1564649) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:301) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.CharConversionException: Invalid UTF-8 character 0xffff at char #1427640, byte #1564649) at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249) at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628) at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126) at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649) ... 21 more and May 9, 2012 1:46:36 PM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.io.IOException] Invalid CRLF at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:301) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.IOException: Invalid CRLF at org.apache.coyote.http11.filters.ChunkedInputFilter.parseCRLF(ChunkedInputFilter.java:352) at org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:151) at org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:710) at org.apache.coyote.Request.doRead(Request.java:427) at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:304) at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:419) at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:327) at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:162) at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:365) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:110) at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628) at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126) at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649) ... 21 more
          Hide
          Christian Johnsson added a comment -

          Seems like it's my end as usual
          Rebooted the entire cluster and replaced the crawldb with yesterdays. Then tried the same segments that didn't work and now it work automagically.
          Thank you for response and a great work!

          Show
          Christian Johnsson added a comment - Seems like it's my end as usual Rebooted the entire cluster and replaced the crawldb with yesterdays. Then tried the same segments that didn't work and now it work automagically. Thank you for response and a great work!
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          2d 20h 28m 1 Markus Jelsma 30/Jun/11 13:14
          Resolved Resolved Reopened Reopened
          9m 19s 1 Markus Jelsma 30/Jun/11 13:23
          Reopened Reopened Resolved Resolved
          3h 55m 1 Markus Jelsma 30/Jun/11 17:19
          Resolved Resolved Closed Closed
          172d 19h 10m 1 Markus Jelsma 20/Dec/11 11:30

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development