Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2254

Charset issues when using -addBinaryContent and -base64 options

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.11
    • None
    • indexer
    • None

    Description

      The bug is reproducible with these steps:

      1. find a site with cp1252 encoded pages like "http://www.ilsole24ore.com/" and characters with accents (byte representation >127, like [àèéìòù])
      2. start a crawl on that site indexing on Solr with options -addBinaryContent -base64
      3. find a document inside the newly indexed Solr collection with those accented characters
      4. get the base64 binary representation for said html page and decode it back to raw binary, save it

      The file obtained will have invalid characters, which are neither UTF-8 nor cp1252.

      Attachments

        1. base64-nutch.patch
          0.8 kB
          Federico Bonelli

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              fedechicco Federico Bonelli
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: