[NUTCH-1016] Strip UTF-8 non-character codepoints - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.3
Fix Version/s: 1.4
Component/s: indexer
Labels:
None

Patch Info:

Patch Available

Description

During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:

SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
        at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
        at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)

Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]

Please comment!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1016-1.4-4.patch
28/Jun/11 13:59
3 kB
Markus Jelsma
NUTCH-1016-2.0.patch
30/Jun/11 12:19
3 kB
Markus Jelsma

Issue Links

is superceded by

NUTCH-1026 Strip UTF-8 non-character codepoints

Closed

Activity

People

Assignee:: Markus Jelsma

Reporter:: Markus Jelsma

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 27/Jun/11 15:45

Updated:: 09/May/12 14:06

Resolved:: 30/Jun/11 16:19