Nutch
  1. Nutch
  2. NUTCH-25

needs 'character encoding' detector

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None

      Description

      transferred from:
      http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
      submitted by:
      Jungshik Shin

      this is a follow-up to bug 993380 (figure out 'charset'
      from the meta tag).

      Although we can cover a lot of ground using the 'C-T'
      header field in in the HTTP header and the
      corresponding meta tag in html documents (and in case
      of XML, we have to use a similar but a different
      'parsing'), in the wild, there are a lot of documents
      without any information about the character encoding
      used. Browsers like Mozilla and search engines like
      Google use character encoding detectors to deal with
      these 'unlabelled' documents.

      Mozilla's character encoding detector is GPL/MPL'd and
      we might be able to port it to Java. Unfortunately,
      it's not fool-proof. However, along with some other
      heuristic used by Mozilla and elsewhere, it'll be
      possible to achieve a high rate of the detection.

      The following page has links to some other related pages.

      http://trainedmonkey.com/week/2004/26

      In addition to the character encoding detection, we
      also need to detect the language of a document, which
      is even harder and should be a separate bug (although
      it's related).

      1. NUTCH-25_v4.patch
        27 kB
        Doğacan Güney
      2. NUTCH-25_v3.patch
        27 kB
        Doğacan Güney
      3. EncodingDetector_additive.java
        13 kB
        Doğacan Güney
      4. NUTCH-25_v2.patch
        26 kB
        Doğacan Güney
      5. EncodingDetector.java
        11 kB
        Doug Cook
      6. patch
        11 kB
        Doug Cook
      7. NUTCH-25.patch
        9 kB
        Doğacan Güney
      8. NUTCH-25_draft.patch
        7 kB
        Doğacan Güney

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Doğacan Güney
            Reporter:
            Stefan Groschupf
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development