[NUTCH-25] needs 'character encoding' detector - ASF JIRA

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: None
Labels:
None

Description

transferred from:
http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
submitted by:
Jungshik Shin

this is a follow-up to bug 993380 (figure out 'charset'
from the meta tag).

Although we can cover a lot of ground using the 'C-T'
header field in in the HTTP header and the
corresponding meta tag in html documents (and in case
of XML, we have to use a similar but a different
'parsing'), in the wild, there are a lot of documents
without any information about the character encoding
used. Browsers like Mozilla and search engines like
Google use character encoding detectors to deal with
these 'unlabelled' documents.

Mozilla's character encoding detector is GPL/MPL'd and
we might be able to port it to Java. Unfortunately,
it's not fool-proof. However, along with some other
heuristic used by Mozilla and elsewhere, it'll be
possible to achieve a high rate of the detection.

The following page has links to some other related pages.

http://trainedmonkey.com/week/2004/26

In addition to the character encoding detection, we
also need to detect the language of a document, which
is even harder and should be a separate bug (although
it's related).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Manage Attachments

EncodingDetector_additive.java
02/Aug/07 09:27
13 kB
Dogacan Guney
EncodingDetector.java
24/Jul/07 18:31
11 kB
Doug Cook
NUTCH-25_draft.patch
21/May/07 20:47
7 kB
Dogacan Guney
NUTCH-25_v2.patch
01/Aug/07 09:22
26 kB
Dogacan Guney
NUTCH-25_v3.patch
08/Aug/07 12:31
27 kB
Dogacan Guney
NUTCH-25_v4.patch
23/Aug/07 07:25
27 kB
Dogacan Guney
NUTCH-25.patch
21/Jul/07 16:01
9 kB
Dogacan Guney
patch
24/Jul/07 17:02
11 kB
Doug Cook

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Dogacan Guney

Reporter:: Stefan Groschupf

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Mar/05 23:53

Updated:: 10/Apr/09 12:29

Resolved:: 26/Sep/07 14:05

Agile

View on Board

needs 'character encoding' detector

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment