Description
When attempting to detect a document's encoding, our TikaEncodingDetector does not take into account the following elements which may occur in html/xhtml documents:
HTML:
<meta http-equiv="content-type" content="text/html; charset=xyz"/>
HTML5:
<meta charset="xyz">
XHTML:
<?xml encoding='xyz'?>
In addition, the TikaEncodingDetector only sniffs the first 12000 bytes of the document, meaning that if, for example, the first UTF-8 encoded character occurs later than that, the detector may misidentify the encoding as ISO-8859-1 or Windows-1252 instead of UTF-8 (even if UTF-8 were specified in the meta charset element of the page.)
I have seen this problem occur with, e.g., the webpage http://losangeles.eventful.com/events/september (where the UTF-8 charset was properly specified at the top of the page, but the first UTF-8 encoded characters occurred far past the 12000 byte mark in JSON-LD content towards the bottom of the page, causing the TikaEncodingDetector to misidentify the encoding as ISO-8859-1, causing certain JSON-LD strings to come out looking like gibberish).
Attachments
Issue Links
- links to