Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2599

charset detection issue with parse-tika

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1.19
    • Component/s: parser
    • Labels:
      None
    • Environment:
      plugin.includes: protocol-http|parse-tika

      Description

      Here is an example page that is displayed correctly in web browsers, but is decoded with the wrong charset in nutch : https://gerardbouchar.github.io/html-encoding-example/index.html

       

      This page's contents are encoded in UTF-8, it is served with HTTP headers indicating that it is in UTF-8, but it contains a bogus HTML meta tag indicating that is encoded in ISO-8859-1.

       

      This is a tricky case, but there is a W3C specification about how to handle it. It clearly states that the HTTP header (transport layer information) should have precedence over the HTML meta tag (obtained in byte stream prescanning). Browsers do respect the spec, but the tika parser doesn't.

       

      Looking at the source code, it looks like the charset information is not even extracted from the HTTP headers.

       

      HTTP/1.1 200 OK
      Content-Type: text/html; charset=utf-8
      
      
      <!doctype html>
      <html>
        <head>
          <meta charset="iso-8859-1">
        </head>
        <body>
          <a href="/">français</a>
        </body>
      </html>
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                gbouchar Gerard Bouchar
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: