[NUTCH-2773] SegmentReader (-dump or -get): show HTML content as UTF-8 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Implemented
Affects Version/s: 1.16
Fix Version/s: 1.17
Component/s: segment
Labels:
None

Patch Info:

Patch Available

Description

SegmentReader dumps resp. the output shown by -get is first converted to Java strings and then shown using UTF-8 as output encoding. The HTML page content is hold by the container class "Content" as byte[] and if another charset than UTF-8 is used as original page encoding, the output of SegmentReader may look flawed. The reader could use the encoding already detected by the parser (if available) and try to properly recode the HTML page content to UTF-8.

Attachments

Issue Links

links to

GitHub Pull Request #501

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Sebastian Nagel

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Feb/20 10:07

Updated:: 28/Jan/21 13:16

Resolved:: 13/Mar/20 09:09