Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-25

needs 'character encoding' detector



    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:


      transferred from:
      submitted by:
      Jungshik Shin

      this is a follow-up to bug 993380 (figure out 'charset'
      from the meta tag).

      Although we can cover a lot of ground using the 'C-T'
      header field in in the HTTP header and the
      corresponding meta tag in html documents (and in case
      of XML, we have to use a similar but a different
      'parsing'), in the wild, there are a lot of documents
      without any information about the character encoding
      used. Browsers like Mozilla and search engines like
      Google use character encoding detectors to deal with
      these 'unlabelled' documents.

      Mozilla's character encoding detector is GPL/MPL'd and
      we might be able to port it to Java. Unfortunately,
      it's not fool-proof. However, along with some other
      heuristic used by Mozilla and elsewhere, it'll be
      possible to achieve a high rate of the detection.

      The following page has links to some other related pages.


      In addition to the character encoding detection, we
      also need to detect the language of a document, which
      is even harder and should be a separate bug (although
      it's related).


        1. patch
          11 kB
          Doug Cook
        2. NUTCH-25.patch
          9 kB
          Dogacan Guney
        3. NUTCH-25_v4.patch
          27 kB
          Dogacan Guney
        4. NUTCH-25_v3.patch
          27 kB
          Dogacan Guney
        5. NUTCH-25_v2.patch
          26 kB
          Dogacan Guney
        6. NUTCH-25_draft.patch
          7 kB
          Dogacan Guney
        7. EncodingDetector.java
          11 kB
          Doug Cook
        8. EncodingDetector_additive.java
          13 kB
          Dogacan Guney



            • Assignee:
              dogacan Dogacan Guney
              joa23 Stefan Groschupf
            • Votes:
              1 Vote for this issue
              3 Start watching this issue


              • Created: