[ANY23-411] Use Content-Type to help determine encoding - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3
Fix Version/s: 2.3
Component/s: encoding
Labels:
None

Description

Incredibly enough, it seems that our encoding detector does not take the Content-Type header into account at all when trying to guess a document's charset encoding!

This has caused a problem for me with the page: http://w3c.github.io/microdata-rdf/tests/0065.html

Even though the Content-Type header is set to "text/html; charset=utf-8", we're guessing the charset to be: "IBM500", which in turn renders the page into complete gibberish.

This must be a bug in Tika, because even when I set the declared encoding of the charset detector to UTF-8, IBM500 is still the most confident result.

Cf. https://issues.apache.org/jira/browse/TIKA-2771

Attachments

Activity

People

Assignee:: Hans Brende

Reporter:: Hans Brende

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 25/Oct/18 21:11

Updated:: 01/Nov/18 17:37

Resolved:: 25/Oct/18 22:43