Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Our newer standards based detector follows the standard and automatically modifies a charset header of "UTF-16" (and some variants) to "UTF-8". This appears to help quite a bit.
LegacyHTMLDetector | StandardHTMLDetector | NumCommonTokensLegacy | NumCommonTokensStandard | Delta |
---|---|---|---|---|
UTF-16LE | UTF-8 | 10 | 27351 | 27341 |
UTF-16BE | UTF-8 | 14 | 43565 | 43551 |
UTF-16 | UTF-8 | 6099 | 748024 | 741925 |