Nutch
  1. Nutch
  2. NUTCH-1026

Strip UTF-8 non-character codepoints

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: nutchgora
    • Fix Version/s: nutchgora
    • Component/s: indexer
    • Labels:
      None

      Description

      During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:

      SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
              at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
              at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
      

      Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]

      Please comment!

        Issue Links

          Activity

          Markus Jelsma created issue -
          Markus Jelsma made changes -
          Field Original Value New Value
          Link This issue supercedes NUTCH-1016 [ NUTCH-1016 ]
          Markus Jelsma made changes -
          Assignee Markus Jelsma [ markus17 ]
          Hide
          Lewis John McGibbney added a comment -

          Set and Classify

          Show
          Lewis John McGibbney added a comment - Set and Classify
          Lewis John McGibbney made changes -
          Fix Version/s 2.1 [ 12321040 ]
          Fix Version/s nutchgora [ 12314893 ]
          Hide
          Ferdy Galema added a comment -

          When indexing a huge dataset I ran into this issue too. The patch in NUTCH-1016 works fine. (Thanks Markus!) I verified and tested this. Committed at nutchgora.

          Minor note: The patch checks for invalid chars ONLY on the "content" field of the NutchDocument. But since the problem is most likely to only occur on this field, it is okay for now.

          Show
          Ferdy Galema added a comment - When indexing a huge dataset I ran into this issue too. The patch in NUTCH-1016 works fine. (Thanks Markus!) I verified and tested this. Committed at nutchgora. Minor note: The patch checks for invalid chars ONLY on the "content" field of the NutchDocument. But since the problem is most likely to only occur on this field, it is okay for now.
          Ferdy Galema made changes -
          Status Open [ 1 ] Closed [ 6 ]
          Fix Version/s nutchgora [ 12314893 ]
          Fix Version/s 2.1 [ 12321040 ]
          Resolution Fixed [ 1 ]
          Hide
          Markus Jelsma added a comment -

          Great!

          Show
          Markus Jelsma added a comment - Great!
          Hide
          Hudson added a comment -

          Integrated in Nutch-nutchgora #249 (See https://builds.apache.org/job/Nutch-nutchgora/249/)
          NUTCH-1026 Strip UTF-8 non-character codepoints (Revision 1336643)

          Result = SUCCESS
          ferdy :
          Files :

          • /nutch/branches/nutchgora/CHANGES.txt
          • /nutch/branches/nutchgora/conf/log4j.properties
          • /nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
          Show
          Hudson added a comment - Integrated in Nutch-nutchgora #249 (See https://builds.apache.org/job/Nutch-nutchgora/249/ ) NUTCH-1026 Strip UTF-8 non-character codepoints (Revision 1336643) Result = SUCCESS ferdy : Files : /nutch/branches/nutchgora/CHANGES.txt /nutch/branches/nutchgora/conf/log4j.properties /nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/solr/SolrWriter.java

            People

            • Assignee:
              Unassigned
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development