Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-25

needs 'character encoding' detector

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.0.0
    • None
    • None

    Description

      transferred from:
      http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
      submitted by:
      Jungshik Shin

      this is a follow-up to bug 993380 (figure out 'charset'
      from the meta tag).

      Although we can cover a lot of ground using the 'C-T'
      header field in in the HTTP header and the
      corresponding meta tag in html documents (and in case
      of XML, we have to use a similar but a different
      'parsing'), in the wild, there are a lot of documents
      without any information about the character encoding
      used. Browsers like Mozilla and search engines like
      Google use character encoding detectors to deal with
      these 'unlabelled' documents.

      Mozilla's character encoding detector is GPL/MPL'd and
      we might be able to port it to Java. Unfortunately,
      it's not fool-proof. However, along with some other
      heuristic used by Mozilla and elsewhere, it'll be
      possible to achieve a high rate of the detection.

      The following page has links to some other related pages.

      http://trainedmonkey.com/week/2004/26

      In addition to the character encoding detection, we
      also need to detect the language of a document, which
      is even harder and should be a separate bug (although
      it's related).

      Attachments

        1. EncodingDetector_additive.java
          13 kB
          Dogacan Guney
        2. EncodingDetector.java
          11 kB
          Doug Cook
        3. NUTCH-25_draft.patch
          7 kB
          Dogacan Guney
        4. NUTCH-25_v2.patch
          26 kB
          Dogacan Guney
        5. NUTCH-25_v3.patch
          27 kB
          Dogacan Guney
        6. NUTCH-25_v4.patch
          27 kB
          Dogacan Guney
        7. NUTCH-25.patch
          9 kB
          Dogacan Guney
        8. patch
          11 kB
          Doug Cook

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dogacan Dogacan Guney
            joa23 Stefan Groschupf
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment