[NUTCH-2254] Charset issues when using -addBinaryContent and -base64 options - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.11
Fix Version/s: None
Component/s: indexer
Labels:
None

Description

The bug is reproducible with these steps:

find a site with cp1252 encoded pages like "http://www.ilsole24ore.com/" and characters with accents (byte representation >127, like [àèéìòù])
start a crawl on that site indexing on Solr with options -addBinaryContent -base64
find a document inside the newly indexed Solr collection with those accented characters
get the base64 binary representation for said html page and decode it back to raw binary, save it

The file obtained will have invalid characters, which are neither UTF-8 nor cp1252.

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

base64-nutch.patch
21/Apr/16 12:40
0.8 kB
Federico Bonelli

Issue Links

is related to

NUTCH-1807 avoid methods relying on system-specific default locale / charset

Open

NUTCH-1785 Ability to index raw content

Closed

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Federico Bonelli

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 21/Apr/16 12:36

Updated:: 28/Jan/21 14:04

Resolved:: 27/Apr/16 20:59