Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
transferred from:
http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
submitted by:
Jungshik Shin
this is a follow-up to bug 993380 (figure out 'charset'
from the meta tag).
Although we can cover a lot of ground using the 'C-T'
header field in in the HTTP header and the
corresponding meta tag in html documents (and in case
of XML, we have to use a similar but a different
'parsing'), in the wild, there are a lot of documents
without any information about the character encoding
used. Browsers like Mozilla and search engines like
Google use character encoding detectors to deal with
these 'unlabelled' documents.
Mozilla's character encoding detector is GPL/MPL'd and
we might be able to port it to Java. Unfortunately,
it's not fool-proof. However, along with some other
heuristic used by Mozilla and elsewhere, it'll be
possible to achieve a high rate of the detection.
The following page has links to some other related pages.
http://trainedmonkey.com/week/2004/26
In addition to the character encoding detection, we
also need to detect the language of a document, which
is even harder and should be a separate bug (although
it's related).