Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
1.11
-
None
-
None
Description
The bug is reproducible with these steps:
- find a site with cp1252 encoded pages like "http://www.ilsole24ore.com/" and characters with accents (byte representation >127, like [àèéìòù])
- start a crawl on that site indexing on Solr with options -addBinaryContent -base64
- find a document inside the newly indexed Solr collection with those accented characters
- get the base64 binary representation for said html page and decode it back to raw binary, save it
The file obtained will have invalid characters, which are neither UTF-8 nor cp1252.
Attachments
Attachments
Issue Links
- is related to
-
NUTCH-1807 avoid methods relying on system-specific default locale / charset
- Open
-
NUTCH-1785 Ability to index raw content
- Closed