[ANY23-385] Improve charset detection for (x)html documents - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3
Fix Version/s: 2.3
Component/s: encoding
Labels:
None

Description

When attempting to detect a document's encoding, our TikaEncodingDetector does not take into account the following elements which may occur in html/xhtml documents:

HTML:
<meta http-equiv="content-type" content="text/html; charset=xyz"/>

HTML5:
<meta charset="xyz">

XHTML:
<?xml encoding='xyz'?>

In addition, the TikaEncodingDetector only sniffs the first 12000 bytes of the document, meaning that if, for example, the first UTF-8 encoded character occurs later than that, the detector may misidentify the encoding as ISO-8859-1 or Windows-1252 instead of UTF-8 (even if UTF-8 were specified in the meta charset element of the page.)

I have seen this problem occur with, e.g., the webpage http://losangeles.eventful.com/events/september (where the UTF-8 charset was properly specified at the top of the page, but the first UTF-8 encoded characters occurred far past the 12000 byte mark in JSON-LD content towards the bottom of the page, causing the TikaEncodingDetector to misidentify the encoding as ISO-8859-1, causing certain JSON-LD strings to come out looking like gibberish).

Attachments

Issue Links

links to

GitHub Pull Request #115

Activity

People

Assignee:: Hans Brende

Reporter:: Hans Brende

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 05/Aug/18 18:17

Updated:: 06/Aug/18 00:54

Resolved:: 05/Aug/18 23:52