Details
Description
Here is an example page that is displayed correctly in web browsers, but is decoded with the wrong charset in nutch : https://gerardbouchar.github.io/html-encoding-example/index.html
This page's contents are encoded in UTF-8, it is served with HTTP headers indicating that it is in UTF-8, but it contains a bogus HTML meta tag indicating that is encoded in ISO-8859-1.
This is a tricky case, but there is a W3C specification about how to handle it. It clearly states that the HTTP header (transport layer information) should have precedence over the HTML meta tag (obtained in byte stream prescanning). Browsers do respect the spec, but the tika parser doesn't.
Looking at the source code, it looks like the charset information is not even extracted from the HTTP headers.
HTTP/1.1 200 OK Content-Type: text/html; charset=utf-8 <!doctype html> <html> <head> <meta charset="iso-8859-1"> </head> <body> <a href="/">français</a> </body> </html>
Attachments
Issue Links
- depends upon
-
TIKA-2671 HtmlEncodingDetector doesnt take provided metadata into account
- Open