Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Not A Problem
-
0.5
-
None
-
None
-
None
Description
Charset detection for HTML pages doesn't work all that well currently, due to the amount of text that's HTML markup at the beginning of the document.
A simple solution would be to skip over the first 2K (assuming the document is long enough) before passing bytes to ICU4J.
A more complex solution would be to scan for title and body tags, and pass bytes found in each.