[NUTCH-2599] charset detection issue with parse-tika - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 1.21
Component/s: parser
Labels:
None

Environment:

plugin.includes: protocol-http|parse-tika

Description

Here is an example page that is displayed correctly in web browsers, but is decoded with the wrong charset in nutch : https://gerardbouchar.github.io/html-encoding-example/index.html

This page's contents are encoded in UTF-8, it is served with HTTP headers indicating that it is in UTF-8, but it contains a bogus HTML meta tag indicating that is encoded in ISO-8859-1.

This is a tricky case, but there is a W3C specification about how to handle it. It clearly states that the HTTP header (transport layer information) should have precedence over the HTML meta tag (obtained in byte stream prescanning). Browsers do respect the spec, but the tika parser doesn't.

Looking at the source code, it looks like the charset information is not even extracted from the HTTP headers.

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8


<!doctype html>
<html>
  <head>
    <meta charset="iso-8859-1">
  </head>
  <body>
    <a href="/">français</a>
  </body>
</html>

Attachments

Issue Links

depends upon

TIKA-2671 HtmlEncodingDetector doesnt take provided metadata into account

Open

Activity

People

Assignee:: Unassigned

Reporter:: Gerard Bouchar

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 15/Jun/18 13:52

Updated:: 30/Mar/24 17:19