Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2599

charset detection issue with parse-tika

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 1.21
    • parser
    • None
    • plugin.includes: protocol-http|parse-tika

    Description

      Here is an example page that is displayed correctly in web browsers, but is decoded with the wrong charset in nutch : https://gerardbouchar.github.io/html-encoding-example/index.html

       

      This page's contents are encoded in UTF-8, it is served with HTTP headers indicating that it is in UTF-8, but it contains a bogus HTML meta tag indicating that is encoded in ISO-8859-1.

       

      This is a tricky case, but there is a W3C specification about how to handle it. It clearly states that the HTTP header (transport layer information) should have precedence over the HTML meta tag (obtained in byte stream prescanning). Browsers do respect the spec, but the tika parser doesn't.

       

      Looking at the source code, it looks like the charset information is not even extracted from the HTTP headers.

       

      HTTP/1.1 200 OK
      Content-Type: text/html; charset=utf-8
      
      
      <!doctype html>
      <html>
        <head>
          <meta charset="iso-8859-1">
        </head>
        <body>
          <a href="/">français</a>
        </body>
      </html>
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              gbouchar Gerard Bouchar
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: