Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.1.0
-
None
-
None
Description
StandardHtmlEncodingDetector uses 3 heuristics to detect the encoding of a HTML document:
- BOM
- Content-Type HTTP header
- HTML <meta> tag
The "living standard", 13.2.3.2 Determining the character encoding has evolved since then and is based one a longer chain of steps/approaches, including char/byte statistics ("The user agent may attempt to autodetect the character encoding from applying frequency analysis or other algorithms to the data stream.") and (definitely useful) a list of fall-back encodings based on the content language and if the document is not encoded using one of the UTF encodings.