Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2773

SegmentReader (-dump or -get): show HTML content as UTF-8

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Implemented
    • 1.16
    • 1.17
    • segment
    • None
    • Patch Available

    Description

      SegmentReader dumps resp. the output shown by -get is first converted to Java strings and then shown using UTF-8 as output encoding. The HTML page content is hold by the container class "Content" as byte[] and if another charset than UTF-8 is used as original page encoding, the output of SegmentReader may look flawed. The reader could use the encoding already detected by the parser (if available) and try to properly recode the HTML page content to UTF-8.

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: