Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2599

charset detection issue with parse-tika

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 1.20
    • parser
    • None
    • plugin.includes: protocol-http|parse-tika

    Description

      Here is an example page that is displayed correctly in web browsers, but is decoded with the wrong charset in nutch : https://gerardbouchar.github.io/html-encoding-example/index.html

       

      This page's contents are encoded in UTF-8, it is served with HTTP headers indicating that it is in UTF-8, but it contains a bogus HTML meta tag indicating that is encoded in ISO-8859-1.

       

      This is a tricky case, but there is a W3C specification about how to handle it. It clearly states that the HTTP header (transport layer information) should have precedence over the HTML meta tag (obtained in byte stream prescanning). Browsers do respect the spec, but the tika parser doesn't.

       

      Looking at the source code, it looks like the charset information is not even extracted from the HTTP headers.

       

      HTTP/1.1 200 OK
      Content-Type: text/html; charset=utf-8
      
      
      <!doctype html>
      <html>
        <head>
          <meta charset="iso-8859-1">
        </head>
        <body>
          <a href="/">français</a>
        </body>
      </html>
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            gbouchar Gerard Bouchar

            Dates

              Created:
              Updated:

              Slack

                Issue deployment