Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.
I'm finally getting around to running the comparisons between our legacy HTMLEncodingDetector and the newer StandardHTMLEncodingDetector. More analysis is required, but the newer one is, generally better*. One area for improvement/explanation, though is in the "replacement" encoding.
- There are 1 million more "common words" in text extracted from files with the StandardHtmlEncodingDetector than with only our legacy. There are 133M common words in our legacy extracts so that's less than 1% improvement.