Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-25

needs 'character encoding' detector

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None

      Description

      transferred from:
      http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
      submitted by:
      Jungshik Shin

      this is a follow-up to bug 993380 (figure out 'charset'
      from the meta tag).

      Although we can cover a lot of ground using the 'C-T'
      header field in in the HTTP header and the
      corresponding meta tag in html documents (and in case
      of XML, we have to use a similar but a different
      'parsing'), in the wild, there are a lot of documents
      without any information about the character encoding
      used. Browsers like Mozilla and search engines like
      Google use character encoding detectors to deal with
      these 'unlabelled' documents.

      Mozilla's character encoding detector is GPL/MPL'd and
      we might be able to port it to Java. Unfortunately,
      it's not fool-proof. However, along with some other
      heuristic used by Mozilla and elsewhere, it'll be
      possible to achieve a high rate of the detection.

      The following page has links to some other related pages.

      http://trainedmonkey.com/week/2004/26

      In addition to the character encoding detection, we
      also need to detect the language of a document, which
      is even harder and should be a separate bug (although
      it's related).

        Attachments

        1. patch
          11 kB
          Doug Cook
        2. NUTCH-25.patch
          9 kB
          Doğacan Güney
        3. NUTCH-25_v4.patch
          27 kB
          Doğacan Güney
        4. NUTCH-25_v3.patch
          27 kB
          Doğacan Güney
        5. NUTCH-25_v2.patch
          26 kB
          Doğacan Güney
        6. NUTCH-25_draft.patch
          7 kB
          Doğacan Güney
        7. EncodingDetector.java
          11 kB
          Doug Cook
        8. EncodingDetector_additive.java
          13 kB
          Doğacan Güney

          Activity

            People

            • Assignee:
              dogacan Doğacan Güney
              Reporter:
              joa23 Stefan Groschupf
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: