[TIKA-3612] Update StandardHtmlEncodingDetector to follow the living standard - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.1.0
Fix Version/s: None
Component/s: detector
Labels:
None

Description

StandardHtmlEncodingDetector uses 3 heuristics to detect the encoding of a HTML document:

BOM
Content-Type HTTP header
HTML <meta> tag

The "living standard", 13.2.3.2 Determining the character encoding has evolved since then and is based one a longer chain of steps/approaches, including char/byte statistics ("The user agent may attempt to autodetect the character encoding from applying frequency analysis or other algorithms to the data stream.") and (definitely useful) a list of fall-back encodings based on the content language and if the document is not encoded using one of the UTF encodings.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Sebastian Nagel

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 05/Dec/21 22:08

Updated:: 05/Dec/21 22:08