Nutch
  1. Nutch
  2. NUTCH-25

needs 'character encoding' detector

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None

      Description

      transferred from:
      http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
      submitted by:
      Jungshik Shin

      this is a follow-up to bug 993380 (figure out 'charset'
      from the meta tag).

      Although we can cover a lot of ground using the 'C-T'
      header field in in the HTTP header and the
      corresponding meta tag in html documents (and in case
      of XML, we have to use a similar but a different
      'parsing'), in the wild, there are a lot of documents
      without any information about the character encoding
      used. Browsers like Mozilla and search engines like
      Google use character encoding detectors to deal with
      these 'unlabelled' documents.

      Mozilla's character encoding detector is GPL/MPL'd and
      we might be able to port it to Java. Unfortunately,
      it's not fool-proof. However, along with some other
      heuristic used by Mozilla and elsewhere, it'll be
      possible to achieve a high rate of the detection.

      The following page has links to some other related pages.

      http://trainedmonkey.com/week/2004/26

      In addition to the character encoding detection, we
      also need to detect the language of a document, which
      is even harder and should be a separate bug (although
      it's related).

      1. patch
        11 kB
        Doug Cook
      2. NUTCH-25.patch
        9 kB
        Doğacan Güney
      3. NUTCH-25_v4.patch
        27 kB
        Doğacan Güney
      4. NUTCH-25_v3.patch
        27 kB
        Doğacan Güney
      5. NUTCH-25_v2.patch
        26 kB
        Doğacan Güney
      6. NUTCH-25_draft.patch
        7 kB
        Doğacan Güney
      7. EncodingDetector.java
        11 kB
        Doug Cook
      8. EncodingDetector_additive.java
        13 kB
        Doğacan Güney

        Activity

        Sami Siren made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Sami Siren added a comment -

        closing issues for released version

        Show
        Sami Siren added a comment - closing issues for released version
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Nutch-Nightly #222 (See http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/222/ )
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Nutch-Nightly #219 (See http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/219/ )
        Doğacan Güney made changes -
        Resolution Fixed [ 1 ]
        Status Open [ 1 ] Resolved [ 5 ]
        Hide
        Doğacan Güney added a comment -

        I am committing the latest patch with some changes:

        • Added a unit test case
        • Removed thread local stuff. Instead EncodingDetector is instanced for every Parser.getParse.
        • Removed per-charset confidence values. We don't use them right now. Doug, I assume you may not like this one. I removed them to simplify the patch a bit. If you feel that they are useful, we can add them (and other features) later on.

        As I mentioned before, this may not be the perfect encoding detection system but it is definitely better than what we have now.

        Note that encoding auto-detection is disabled by default. See property encodingdetector.charset.min.confidence .

        Committed in rev. 579656.

        Show
        Doğacan Güney added a comment - I am committing the latest patch with some changes: Added a unit test case Removed thread local stuff. Instead EncodingDetector is instanced for every Parser.getParse. Removed per-charset confidence values. We don't use them right now. Doug, I assume you may not like this one. I removed them to simplify the patch a bit. If you feel that they are useful, we can add them (and other features) later on. As I mentioned before, this may not be the perfect encoding detection system but it is definitely better than what we have now. Note that encoding auto-detection is disabled by default. See property encodingdetector.charset.min.confidence . Committed in rev. 579656.
        Doğacan Güney made changes -
        Attachment NUTCH-25_v4.patch [ 12364391 ]
        Hide
        Doğacan Güney added a comment -

        New version, I am going to commit this one after a couple of days if there are no objections.

        • Don't merge confidences in EncodingDetector as it seems merging them
          gives worse results.

        OK, I though this would result in better matches but it seems I was mistaken .

        • Read from a file in main() instead of stdin.

        There is still some stuff not completely discussed (such as when to add/not add different confidence values), but I don't see those blocking this from going in. Even though this latest patch may not be optimal, it is still a big improvement over what we have now. So unless there is a big bad bug somewhere or there is an easy obvious improvement that can be done, I am going to commit this patch (note that current javadocs are wrong, I am going to update them before commit). We can then discuss how to improve encoding detection on different issues.

        Show
        Doğacan Güney added a comment - New version, I am going to commit this one after a couple of days if there are no objections. Don't merge confidences in EncodingDetector as it seems merging them gives worse results. OK, I though this would result in better matches but it seems I was mistaken . Read from a file in main() instead of stdin. There is still some stuff not completely discussed (such as when to add/not add different confidence values), but I don't see those blocking this from going in. Even though this latest patch may not be optimal, it is still a big improvement over what we have now. So unless there is a big bad bug somewhere or there is an easy obvious improvement that can be done, I am going to commit this patch (note that current javadocs are wrong, I am going to update them before commit). We can then discuss how to improve encoding detection on different issues.
        Doğacan Güney made changes -
        Attachment NUTCH-25_v3.patch [ 12363405 ]
        Hide
        Doğacan Güney added a comment -

        Here is a new version.

        • Code style cleanups (use '} else {' instead of else on next line).
        • Add confidences from different clues pointing to same encoding.
        • Check if encoding passes the threshold.
        • Add a utility getThreshold(String charset) method that returns the
          global threshold if charset-specific threshold is unavailable.
        • Make clues a ThreadLocal variable for thread-safety.
        Show
        Doğacan Güney added a comment - Here is a new version. Code style cleanups (use '} else {' instead of else on next line). Add confidences from different clues pointing to same encoding. Check if encoding passes the threshold. Add a utility getThreshold(String charset) method that returns the global threshold if charset-specific threshold is unavailable. Make clues a ThreadLocal variable for thread-safety.
        Doğacan Güney made changes -
        Attachment EncodingDetector_additive.java [ 12363030 ]
        Hide
        Doğacan Güney added a comment -

        Here is an EncodingDetector that merges confidences (adds confidences from same charsets) in guessEncoding.

        Show
        Doğacan Güney added a comment - Here is an EncodingDetector that merges confidences (adds confidences from same charsets) in guessEncoding.
        Hide
        Doğacan Güney added a comment - - edited

        > At a very quick look, one potential drawback of the private EncodingClue + addClue/clearClues interface is that because
        > EncodingDetector now keeps internal state, it is no longer safe to call the same EncodingDetector from different threads
        > (though I'm not sure if ICU4J's CharsetDetector is thread-safe anyway, so this may already have been a potential problem). Not
        > sure if this is an issue with the parsers or not, but will take a look.

        Good point. It may be an issue if parsing during fetching is enabled (I think multiple threads parse content if fetcher is run in parsing mode). It should be enough to change 'clues' (and CharsetDetector if need be) to be a ThreadLocal, right?

        Show
        Doğacan Güney added a comment - - edited > At a very quick look, one potential drawback of the private EncodingClue + addClue/clearClues interface is that because > EncodingDetector now keeps internal state, it is no longer safe to call the same EncodingDetector from different threads > (though I'm not sure if ICU4J's CharsetDetector is thread-safe anyway, so this may already have been a potential problem). Not > sure if this is an issue with the parsers or not, but will take a look. Good point. It may be an issue if parsing during fetching is enabled (I think multiple threads parse content if fetcher is run in parsing mode). It should be enough to change 'clues' (and CharsetDetector if need be) to be a ThreadLocal, right?
        Hide
        Doug Cook added a comment -

        Cool – will take a look at the new patch (and will try to make stripGarbage more robust as I get some bandwidth to work on it; it definitely helped in my tests).

        At a very quick look, one potential drawback of the private EncodingClue + addClue/clearClues interface is that because EncodingDetector now keeps internal state, it is no longer safe to call the same EncodingDetector from different threads (though I'm not sure if ICU4J's CharsetDetector is thread-safe anyway, so this may already have been a potential problem). Not sure if this is an issue with the parsers or not, but will take a look.

        Show
        Doug Cook added a comment - Cool – will take a look at the new patch (and will try to make stripGarbage more robust as I get some bandwidth to work on it; it definitely helped in my tests). At a very quick look, one potential drawback of the private EncodingClue + addClue/clearClues interface is that because EncodingDetector now keeps internal state, it is no longer safe to call the same EncodingDetector from different threads (though I'm not sure if ICU4J's CharsetDetector is thread-safe anyway, so this may already have been a potential problem). Not sure if this is an issue with the parsers or not, but will take a look.
        Doğacan Güney made changes -
        Attachment NUTCH-25_v2.patch [ 12362962 ]
        Hide
        Doğacan Güney added a comment - - edited

        ı cleaned up your latest patch and updated it for latest trunk (also added some changes):

        • Uses Java 5 generics.
        • Respects 80 char boundary (for EncodingDetector).
        • Moves parseCharacterEncoding and resolveEncodingAlias from StringUtil to EncodingDetector. I think they make more sense in EncodingDetector.
        • EncodingClue class is no longer public.
        • Adds EncodingDetector.addClue methods instead. EncodingDetector.addClue eliminates null values also calls resolveEncodingAlias and stores 'resolved' alias.
        • Clients now must call EncodingDetector.clearClues before asking it to detect encoding for a new content to EncodingDetector. Otherwise older clues may affect EncodingDetector's judgement.
        • I also moved 'header' detection to EncodingDetector.autoDetectClues. extracting charset from header is needed in a couple of plugins so this eliminates some code duplication.
        • I removed stripGarbage method for now. As I said before, I am not sure how it will behave when given UTF-16 (or other non-byte oriented encodings) documents. So I changed EncodingDetector to use icu4j's own filtering function. However, Doug, if your tests are showing that stripGarbage performs better, feel free to add it back.
        • Update parse-html, feed and parse-text plugins to use EncodingDetector.
        Show
        Doğacan Güney added a comment - - edited ı cleaned up your latest patch and updated it for latest trunk (also added some changes): Uses Java 5 generics. Respects 80 char boundary (for EncodingDetector). Moves parseCharacterEncoding and resolveEncodingAlias from StringUtil to EncodingDetector. I think they make more sense in EncodingDetector. EncodingClue class is no longer public. Adds EncodingDetector.addClue methods instead. EncodingDetector.addClue eliminates null values also calls resolveEncodingAlias and stores 'resolved' alias. Clients now must call EncodingDetector.clearClues before asking it to detect encoding for a new content to EncodingDetector. Otherwise older clues may affect EncodingDetector's judgement. I also moved 'header' detection to EncodingDetector.autoDetectClues. extracting charset from header is needed in a couple of plugins so this eliminates some code duplication. I removed stripGarbage method for now. As I said before, I am not sure how it will behave when given UTF-16 (or other non-byte oriented encodings) documents. So I changed EncodingDetector to use icu4j's own filtering function. However, Doug, if your tests are showing that stripGarbage performs better, feel free to add it back. Update parse-html, feed and parse-text plugins to use EncodingDetector.
        Hide
        Doug Cook added a comment -

        > Can you provide a link on icu4j's language detection?

        http://www.icu-project.org/apiref/icu4j/

        It's still part of CharsetDetector. The CharsetMatch object(s) returned by detect() or detectAll() provide a getLanguage() method. I was wondering why my return set had a number of different CharsetMatch objects returned, all with the same encoding guess, all with different confidences; then I realized it's because these are guesses for different languages. For example, for a page in german, you might see:

        2007-07-25 15:16:16,536 DEBUG parse.EncodingDetector (EncodingDetector.java:autoDetectClues(204)) - enc=windows-1252,de (81% confidence)
        2007-07-25 15:16:16,542 DEBUG parse.EncodingDetector (EncodingDetector.java:autoDetectClues(204)) - enc=windows-1252,nl (50% confidence)
        2007-07-25 15:16:16,544 DEBUG parse.EncodingDetector (EncodingDetector.java:autoDetectClues(204)) - enc=windows-1252,da (41% confidence)
        2007-07-25 15:16:16,544 DEBUG parse.EncodingDetector (EncodingDetector.java:autoDetectClues(204)) - enc=windows-1252,fr (38% confidence)
        (etc)

        I'm not sure how good the guesses are, but for the few examples I looked at, it was spot on.

        Still thinking about all the other stuff-

        d

        Show
        Doug Cook added a comment - > Can you provide a link on icu4j's language detection? http://www.icu-project.org/apiref/icu4j/ It's still part of CharsetDetector. The CharsetMatch object(s) returned by detect() or detectAll() provide a getLanguage() method. I was wondering why my return set had a number of different CharsetMatch objects returned, all with the same encoding guess, all with different confidences; then I realized it's because these are guesses for different languages. For example, for a page in german, you might see: 2007-07-25 15:16:16,536 DEBUG parse.EncodingDetector (EncodingDetector.java:autoDetectClues(204)) - enc=windows-1252,de (81% confidence) 2007-07-25 15:16:16,542 DEBUG parse.EncodingDetector (EncodingDetector.java:autoDetectClues(204)) - enc=windows-1252,nl (50% confidence) 2007-07-25 15:16:16,544 DEBUG parse.EncodingDetector (EncodingDetector.java:autoDetectClues(204)) - enc=windows-1252,da (41% confidence) 2007-07-25 15:16:16,544 DEBUG parse.EncodingDetector (EncodingDetector.java:autoDetectClues(204)) - enc=windows-1252,fr (38% confidence) (etc) I'm not sure how good the guesses are, but for the few examples I looked at, it was spot on. Still thinking about all the other stuff- d
        Hide
        Doğacan Güney added a comment -

        [snip snip]

        > Internal to guessEncoding, we could certainly add the clue values if it turns out that helps us make a better guess.

        > Combining clues prior to guessEncoding is throwing away information – clues might be additive, but they might not (two highly correlated pieces of data won't
        > be additive, and inversely correlated features will even be "subtractive"). [...]

        This is what I was talking about. We can allow users to specifiy 'additiveness' of clues but that may make API unnecessarily complex. I think for now just adding confidence values in guessEncoding should be good enough.

        > [...] Ideally someone could make a large-ish test set, judge the "real" encoding for all the examples, do the statistics, and find out how all the (detected encoding, header value, metatags) interact. A guessEncoding based on statistical modeling would be pretty sweet. When I was working for a certain
        > large search company, this is how we would typically tackle a problem like this. [snip snip]

        This is one of the things that would incredibly benefit nutch. Unfortunately, I don't think that we have nearly enough resources for it .

        > It's worth adding that CharsetDetector also detects languages, and a few examples I looked at seemed pretty good. It seems a shame to throw away that
        > information, especially when I know Nutch's built-in language detection makes a fair number of mistakes (though in part because it trusts the page
        > metatags, which are often wrong). Another bit of food for thought.

        Sami Siren suggested this a while ago but I didn't see where icu4j does the language detection (sorry Sami!) . Can you provide a link on icu4j's language detection?

        I agree with you that most of the mistakes language detection makes comes from its 'trusting' nature. I would actually go a bit further and say this: Any code (at least for nutch) that trusts input without validating it is inherently wrong . Because we are dealing with Web here and that's just the way of things in WWW. This includes, off the top of my head, encoding detection, language detection, content-type (mime-type) detection.

        Btw, I forgot to say this in my previous comment, so here it is:

        • stripGarbage method won't work for non-byte oriented encodings (such as UTF-16). UTF-16 uses at least to bytes for a single character and it is possible that the first or second byte of it may be '<' even though the represented character is something else.

        Mozilla has some code used for detecting byte orders (there is a link somewhere in parse-html). I actually ported that code to java but never got to test it. If I can find the patch, it may be useful to add it to EncodingDetector.

        Also, I am not an expert on charsets, but I think for all byte oriented encodings, the first 127 (or so) characters are the same. So you can 'cast' the given byte array to ASCII safely (I am not suggesting that you should, just saying that it is doable).

        Show
        Doğacan Güney added a comment - [snip snip] > Internal to guessEncoding, we could certainly add the clue values if it turns out that helps us make a better guess. > Combining clues prior to guessEncoding is throwing away information – clues might be additive, but they might not (two highly correlated pieces of data won't > be additive, and inversely correlated features will even be "subtractive"). [...] This is what I was talking about. We can allow users to specifiy 'additiveness' of clues but that may make API unnecessarily complex. I think for now just adding confidence values in guessEncoding should be good enough. > [...] Ideally someone could make a large-ish test set, judge the "real" encoding for all the examples, do the statistics, and find out how all the (detected encoding, header value, metatags) interact. A guessEncoding based on statistical modeling would be pretty sweet. When I was working for a certain > large search company, this is how we would typically tackle a problem like this. [snip snip] This is one of the things that would incredibly benefit nutch. Unfortunately, I don't think that we have nearly enough resources for it . > It's worth adding that CharsetDetector also detects languages, and a few examples I looked at seemed pretty good. It seems a shame to throw away that > information, especially when I know Nutch's built-in language detection makes a fair number of mistakes (though in part because it trusts the page > metatags, which are often wrong). Another bit of food for thought. Sami Siren suggested this a while ago but I didn't see where icu4j does the language detection (sorry Sami!) . Can you provide a link on icu4j's language detection? I agree with you that most of the mistakes language detection makes comes from its 'trusting' nature. I would actually go a bit further and say this: Any code (at least for nutch) that trusts input without validating it is inherently wrong . Because we are dealing with Web here and that's just the way of things in WWW. This includes, off the top of my head, encoding detection, language detection, content-type (mime-type) detection. Btw, I forgot to say this in my previous comment, so here it is: stripGarbage method won't work for non-byte oriented encodings (such as UTF-16). UTF-16 uses at least to bytes for a single character and it is possible that the first or second byte of it may be '<' even though the represented character is something else. Mozilla has some code used for detecting byte orders (there is a link somewhere in parse-html). I actually ported that code to java but never got to test it. If I can find the patch, it may be useful to add it to EncodingDetector. Also, I am not an expert on charsets, but I think for all byte oriented encodings, the first 127 (or so) characters are the same. So you can 'cast' the given byte array to ASCII safely (I am not suggesting that you should, just saying that it is doable).
        Hide
        Doug Cook added a comment -

        Doğacan,

        Thanks for the quick feedback.

        > * EncodingDetector api is way too open. IMO, EncodingClue should be a private static
        > class (users can pass a clue like detector.addClue(value, source, confidence)), EncodingDetector
        > should not expose clues ever (for example, autoDetectClues should return void [or perhaps a
        > boolean indicating the success of autodetect]) and store clues internally.

        Good point. I had thought that callers might want to manipulate the list, but this is probably unlikely, and my current approach certainly allows more for caller screwup through playing with the passed List. It's an easy fix to make, and it cleans up the calling code a little bit, too. I'll fix that.

        If in the future, the callers need to manipulate the list, we can just add an interface for that.

        > * code:
        >
        > public boolean meetsThreshold()

        { > Integer mt = (Integer) thresholds.get(value); > int myThreshold = (mt != null) ? mt.intValue() : minConfidence; // use global value if no encoding-specific value found > > return (confidence < 0 || (minConfidence >= 0 && confidence>=myThreshold)); > }


        >
        > Why does meetsTreshold return true if confidence < 0?

        Negative confidence values have special semantics. It means "use me if you get to me in the list, and ignore the threshold." These semantics are necessary to emulate the prior behavior (where, for example, header values would always be used, if present, in preference to 'sniffed' meta-tags). Not that the prior behavior was perfect, but I think it's a useful construct: a value which, if present, should be used regardless of confidence thresholds.

        > * If users specify an encoding clue with no confidence then we should give it a default
        > positive confidence instead of -1. Of course, confidence value needs to be very very small, maybe just +1.

        Hopefully this design choice makes more sense in light of the previous comment. The -1 has special semantics, meaning "I don't have a threshold."

        > * It would be nice to "stack" clues. Assume that autodetection returned 2 possible encodings:
        > ISO-8859-1 with 50 confidence and UTF-8 with 45 confidence. If I add a new clue (say, coming from
        > http header) for UTF-8 with +6 confidence, overall confidence for UTF-8 should now be 51.

        I'm not sure if you mean actually combine the clues in the list, or just add the values in guessEncoding.

        Architecturally I think it's better to keep all the clues intact until the final "guess" is made. I've tried to make guessEncoding the place where all the policy decisions are made, the method to be overridden if someone has a better guessing algorithm. Internal to guessEncoding, we could certainly add the clue values if it turns out that helps us make a better guess.

        Combining clues prior to guessEncoding is throwing away information – clues might be additive, but they might not (two highly correlated pieces of data won't be additive, and inversely correlated features will even be "subtractive"). Ideally someone could make a large-ish test set, judge the "real" encoding for all the examples, do the statistics, and find out how all the (detected encoding, header value, metatags) interact. A guessEncoding based on statistical modeling would be pretty sweet. When I was working for a certain large search company, this is how we would typically tackle a problem like this. I'm certain that's how CharsetDetector was created in the first place.

        In the mean time, the simple algorithm provided seems to do reasonably well (it does very nearly what your version did, which seems like a fine place to start).

        It's worth adding that CharsetDetector also detects languages, and a few examples I looked at seemed pretty good. It seems a shame to throw away that information, especially when I know Nutch's built-in language detection makes a fair number of mistakes (though in part because it trusts the page metatags, which are often wrong). Another bit of food for thought.

        > * This is mostly my personal nit, but Java 5 style generics would be nice.

        Ah, you caught me. I'm still working in a 1.4-ish environment.

        > About contributing stuff back: [...]

        Many thanks. This is pretty much what I'd assumed; unfortunately it will be a while before I have time and can afford the risk of bringing 0.9 changes into my local installation. But of course, the longer I wait, the more difficult the merge will be Oh well, I'll get there! There are a couple of important bugfixes for which I'll try to make the time to port earlier.

        D

        Show
        Doug Cook added a comment - Doğacan, Thanks for the quick feedback. > * EncodingDetector api is way too open. IMO, EncodingClue should be a private static > class (users can pass a clue like detector.addClue(value, source, confidence)), EncodingDetector > should not expose clues ever (for example, autoDetectClues should return void [or perhaps a > boolean indicating the success of autodetect]) and store clues internally. Good point. I had thought that callers might want to manipulate the list, but this is probably unlikely, and my current approach certainly allows more for caller screwup through playing with the passed List. It's an easy fix to make, and it cleans up the calling code a little bit, too. I'll fix that. If in the future, the callers need to manipulate the list, we can just add an interface for that. > * code: > > public boolean meetsThreshold() { > Integer mt = (Integer) thresholds.get(value); > int myThreshold = (mt != null) ? mt.intValue() : minConfidence; // use global value if no encoding-specific value found > > return (confidence < 0 || (minConfidence >= 0 && confidence>=myThreshold)); > } > > Why does meetsTreshold return true if confidence < 0? Negative confidence values have special semantics. It means "use me if you get to me in the list, and ignore the threshold." These semantics are necessary to emulate the prior behavior (where, for example, header values would always be used, if present, in preference to 'sniffed' meta-tags). Not that the prior behavior was perfect, but I think it's a useful construct: a value which, if present, should be used regardless of confidence thresholds. > * If users specify an encoding clue with no confidence then we should give it a default > positive confidence instead of -1. Of course, confidence value needs to be very very small, maybe just +1. Hopefully this design choice makes more sense in light of the previous comment. The -1 has special semantics, meaning "I don't have a threshold." > * It would be nice to "stack" clues. Assume that autodetection returned 2 possible encodings: > ISO-8859-1 with 50 confidence and UTF-8 with 45 confidence. If I add a new clue (say, coming from > http header) for UTF-8 with +6 confidence, overall confidence for UTF-8 should now be 51. I'm not sure if you mean actually combine the clues in the list, or just add the values in guessEncoding. Architecturally I think it's better to keep all the clues intact until the final "guess" is made. I've tried to make guessEncoding the place where all the policy decisions are made, the method to be overridden if someone has a better guessing algorithm. Internal to guessEncoding, we could certainly add the clue values if it turns out that helps us make a better guess. Combining clues prior to guessEncoding is throwing away information – clues might be additive, but they might not (two highly correlated pieces of data won't be additive, and inversely correlated features will even be "subtractive"). Ideally someone could make a large-ish test set, judge the "real" encoding for all the examples, do the statistics, and find out how all the (detected encoding, header value, metatags) interact. A guessEncoding based on statistical modeling would be pretty sweet. When I was working for a certain large search company, this is how we would typically tackle a problem like this. I'm certain that's how CharsetDetector was created in the first place. In the mean time, the simple algorithm provided seems to do reasonably well (it does very nearly what your version did, which seems like a fine place to start). It's worth adding that CharsetDetector also detects languages, and a few examples I looked at seemed pretty good. It seems a shame to throw away that information, especially when I know Nutch's built-in language detection makes a fair number of mistakes (though in part because it trusts the page metatags, which are often wrong). Another bit of food for thought. > * This is mostly my personal nit, but Java 5 style generics would be nice. Ah, you caught me. I'm still working in a 1.4-ish environment. > About contributing stuff back: [...] Many thanks. This is pretty much what I'd assumed; unfortunately it will be a while before I have time and can afford the risk of bringing 0.9 changes into my local installation. But of course, the longer I wait, the more difficult the merge will be Oh well, I'll get there! There are a couple of important bugfixes for which I'll try to make the time to port earlier. D
        Hide
        Doğacan Güney added a comment -

        Overall I think the idea behind EncodingDetector is very solid. I will take a better look at your patch, but here are a couple of comments after a quick review:

        • EncodingDetector api is way too open. IMO, EncodingClue should be a private static class (users can pass a clue like detector.addClue(value, source, confidence)), EncodingDetector should not expose clues ever (for example, autoDetectClues should return void [or perhaps a boolean indicating the success of autodetect]) and store clues internally.
        • code:

        public boolean meetsThreshold()

        { Integer mt = (Integer) thresholds.get(value); int myThreshold = (mt != null) ? mt.intValue() : minConfidence; // use global value if no encoding-specific value found return (confidence < 0 || (minConfidence >= 0 && confidence>=myThreshold)); }

        Why does meetsTreshold return true if confidence < 0?

        • If users specify an encoding clue with no confidence then we should give it a default positive confidence instead of -1. Of course, confidence value needs to be very very small, maybe just +1.
        • It would be nice to "stack" clues. Assume that autodetection returned 2 possible encodings: ISO-8859-1 with 50 confidence and UTF-8 with 45 confidence. If I add a new clue (say, coming from http header) for UTF-8 with +6 confidence, overall confidence for UTF-8 should now be 51.
        • This is mostly my personal nit, but Java 5 style generics would be nice.

        About contributing stuff back: The article at http://wiki.apache.org/nutch/HowToContribute is a good starting point but it assumes that you will be working on trunk. I am not sure how you can 'forward-port' your changes from an older version besides doing it manually. One approach may be to first backport a part of the trunk to your local installation, change the code then do a "diff -pu" (against backported version). Since trunk contains newer features and bug fixes you will also be getting them for free this way .

        Show
        Doğacan Güney added a comment - Overall I think the idea behind EncodingDetector is very solid. I will take a better look at your patch, but here are a couple of comments after a quick review: EncodingDetector api is way too open. IMO, EncodingClue should be a private static class (users can pass a clue like detector.addClue(value, source, confidence)), EncodingDetector should not expose clues ever (for example, autoDetectClues should return void [or perhaps a boolean indicating the success of autodetect] ) and store clues internally. code: public boolean meetsThreshold() { Integer mt = (Integer) thresholds.get(value); int myThreshold = (mt != null) ? mt.intValue() : minConfidence; // use global value if no encoding-specific value found return (confidence < 0 || (minConfidence >= 0 && confidence>=myThreshold)); } Why does meetsTreshold return true if confidence < 0? If users specify an encoding clue with no confidence then we should give it a default positive confidence instead of -1. Of course, confidence value needs to be very very small, maybe just +1. It would be nice to "stack" clues. Assume that autodetection returned 2 possible encodings: ISO-8859-1 with 50 confidence and UTF-8 with 45 confidence. If I add a new clue (say, coming from http header) for UTF-8 with +6 confidence, overall confidence for UTF-8 should now be 51. This is mostly my personal nit, but Java 5 style generics would be nice. About contributing stuff back: The article at http://wiki.apache.org/nutch/HowToContribute is a good starting point but it assumes that you will be working on trunk. I am not sure how you can 'forward-port' your changes from an older version besides doing it manually. One approach may be to first backport a part of the trunk to your local installation, change the code then do a "diff -pu" (against backported version). Since trunk contains newer features and bug fixes you will also be getting them for free this way .
        Doug Cook made changes -
        Attachment EncodingDetector.java [ 12362459 ]
        Hide
        Doug Cook added a comment -

        I cleaned up EncodingDetector a little; here's a functionally identical, but cleaner, version.

        Show
        Doug Cook added a comment - I cleaned up EncodingDetector a little; here's a functionally identical, but cleaner, version.
        Doug Cook made changes -
        Attachment EncodingDetector.java [ 12362456 ]
        Doug Cook made changes -
        Attachment patch [ 12362455 ]
        Attachment EncodingDetector.java [ 12362456 ]
        Hide
        Doug Cook added a comment -

        OK, I've got more data, and a proposed solution.

        I created a test set with a number of problem cases and their correct answers. In digging through the "mistakes" the encoding detector made, I found a few different root causes. Most of these fell into the following 3 categories.

        1) Mixed encodings in the document itself (a "mistake" on the part of the author, though there may still be a "right" encoding guess that gets most of the document).
        Ex: http://www.franz-keller.de/8860.html (mostly in UTF-8 with one ISO-8859-1 "copyright" character in the footer).
        Ex: http://www.vinography.com/archives/2006/05/the_rejudgement_of_paris_resul.html (mostly UTF-8 with a couple iso-8859-1 arrows in the header)

        2) CSS and/or javascript (and maybe HTML tags) throwing off the detector.
        Ex: http://www.systembolaget.se/Uppslagsbok/Kartbok/Italien/NorraItalien/NorraItalien.htm
        Ex: http://www.buscamaniban.com/fr/patrimoine/coeur-armagnac.php

        3) The detector having problems with short documents or ones that contain few multibyte characters (being statistical, the less data it has, the more mistakes it will make).
        Ex: http://forum.winereport.com/ita/index.php?showtopic=1924&st=90 (detector thinks this is big5 @ 100% confidence)

        Solutions:

        I've attached a class, EncodingDetector, that seems to solve most of these problems. It also moves the detection code out of the Content class.

        Problem 2) The detector has a simple filter for HTML tags, but the CharsetDetector documentation strongly recommends writing one's own. So I did this; see the stripGarbage function in the EncodingDetector class. It's quick & dirty, and clears out much of the garbage that causes detection problems. I'm sure it's not perfect, but it seems to do the job.

        Problem 3) Detection is inherently imprecise; there will always be errors. But I've tried to make it easier to work around them or to build a better heuristic "guesser" based upon all the clues we have (not just the text, but the headers & metatags). One key is to use detectAll and look at all the possible encodings rather than just the first one returned. For example, with the big5 problem noted above, the detector got big5@100% and also utf-8@100%. (According to the authors, when multiple detectors tie, they are returned in alphabetical order!) EncodingDetector allows different confidence thresholds for different encodings (no reason to assume that they all work equally well). So one simple workaround is to set the threshold for big5 to 101 (meaning use only when there are no other alternatives), and now EncodingDetector returns utf-8@100% for this doc; I don't have much big5 in my collection.

        Long-term there are more sophisticated solutions, but I think the high-level architecture is right, at any rate: get all the data from CharsetDetector, get all the other "clues" (HTTP header, HTML metatags), and combine them flexibly to make an overall guess for the doc. This way we're not throwing out any data early; we have everything available to the final guessing algorithm (simple though the provided one be).

        Problem 1) I don't think there's an easy solution to this. But fixing problem (2) seemed to improve the performance of problem (1), presumably because th e detector is getting cleaner input.

        The small test shows significant improvement with the changes. I'm running a full test now.

        Not sure what the best way to provide this is. I'm attaching a patch for TextParser and HtmlParser to use EncodingDetector, though you will likely have to apply these by hand, since my local tree is (roughly) 0.8.1 plus a ton of local changes. I'll also attach EncodingDetector as a separate file. If this doesn't work, or there is an easier way, please let me know; I'm relatively new to contributing stuff back, so I may need some coaching. (Also, if there is an easy-ish way, that would be good, since I have lots of other local mods that are probably generally useful, and I can start contributing those back as I have time).

        Show
        Doug Cook added a comment - OK, I've got more data, and a proposed solution. I created a test set with a number of problem cases and their correct answers. In digging through the "mistakes" the encoding detector made, I found a few different root causes. Most of these fell into the following 3 categories. 1) Mixed encodings in the document itself (a "mistake" on the part of the author, though there may still be a "right" encoding guess that gets most of the document). Ex: http://www.franz-keller.de/8860.html (mostly in UTF-8 with one ISO-8859-1 "copyright" character in the footer). Ex: http://www.vinography.com/archives/2006/05/the_rejudgement_of_paris_resul.html (mostly UTF-8 with a couple iso-8859-1 arrows in the header) 2) CSS and/or javascript (and maybe HTML tags) throwing off the detector. Ex: http://www.systembolaget.se/Uppslagsbok/Kartbok/Italien/NorraItalien/NorraItalien.htm Ex: http://www.buscamaniban.com/fr/patrimoine/coeur-armagnac.php 3) The detector having problems with short documents or ones that contain few multibyte characters (being statistical, the less data it has, the more mistakes it will make). Ex: http://forum.winereport.com/ita/index.php?showtopic=1924&st=90 (detector thinks this is big5 @ 100% confidence) Solutions: I've attached a class, EncodingDetector, that seems to solve most of these problems. It also moves the detection code out of the Content class. Problem 2) The detector has a simple filter for HTML tags, but the CharsetDetector documentation strongly recommends writing one's own. So I did this; see the stripGarbage function in the EncodingDetector class. It's quick & dirty, and clears out much of the garbage that causes detection problems. I'm sure it's not perfect, but it seems to do the job. Problem 3) Detection is inherently imprecise; there will always be errors. But I've tried to make it easier to work around them or to build a better heuristic "guesser" based upon all the clues we have (not just the text, but the headers & metatags). One key is to use detectAll and look at all the possible encodings rather than just the first one returned. For example, with the big5 problem noted above, the detector got big5@100% and also utf-8@100%. (According to the authors, when multiple detectors tie, they are returned in alphabetical order!) EncodingDetector allows different confidence thresholds for different encodings (no reason to assume that they all work equally well). So one simple workaround is to set the threshold for big5 to 101 (meaning use only when there are no other alternatives), and now EncodingDetector returns utf-8@100% for this doc; I don't have much big5 in my collection. Long-term there are more sophisticated solutions, but I think the high-level architecture is right, at any rate: get all the data from CharsetDetector, get all the other "clues" (HTTP header, HTML metatags), and combine them flexibly to make an overall guess for the doc. This way we're not throwing out any data early; we have everything available to the final guessing algorithm (simple though the provided one be). Problem 1) I don't think there's an easy solution to this. But fixing problem (2) seemed to improve the performance of problem (1), presumably because th e detector is getting cleaner input. The small test shows significant improvement with the changes. I'm running a full test now. Not sure what the best way to provide this is. I'm attaching a patch for TextParser and HtmlParser to use EncodingDetector, though you will likely have to apply these by hand, since my local tree is (roughly) 0.8.1 plus a ton of local changes. I'll also attach EncodingDetector as a separate file. If this doesn't work, or there is an easier way, please let me know; I'm relatively new to contributing stuff back, so I may need some coaching. (Also, if there is an easy-ish way, that would be good, since I have lots of other local mods that are probably generally useful, and I can start contributing those back as I have time).
        Hide
        Doug Cook added a comment -

        As far as the problem cases, I'm running a test now on my test DB (the ~60K doc one), and I'm going to take a random sample of the discrepancies between detected/reported/sniffed, look at the correct value for each, and see if there is a heuristic we can use to combine all 3 and do a little better than just using the detection on its own. Perhaps this is what Mozilla does.

        I'll also play with setDeclaredEncoding and see if that helps at all on the larger data set. (I didn't know there was one, thanks for pointing that out! That's what I get for not looking at the icu4j docs

        I've integrated detection into the TextParser as well, and rewritten the choosing logic in HtmlParser (both using unsurprisingly similar code, which suggests a utility class, as you suggest as well). Testing those now.

        It's not a bad idea to move detection out of the Content class; this could be part of the proposed utility class for character detection. Thus, this class could encapsulate (a) running charset detection, and (b) choosing the most likely "correct" charset for a document given a number of inputs (detected, reported, etc. depending on content type). Then the code duplication across different parsers would be minimal; in fact, their current code might get shorter, if we have the right abstraction.

        d

        Show
        Doug Cook added a comment - As far as the problem cases, I'm running a test now on my test DB (the ~60K doc one), and I'm going to take a random sample of the discrepancies between detected/reported/sniffed, look at the correct value for each, and see if there is a heuristic we can use to combine all 3 and do a little better than just using the detection on its own. Perhaps this is what Mozilla does. I'll also play with setDeclaredEncoding and see if that helps at all on the larger data set. (I didn't know there was one, thanks for pointing that out! That's what I get for not looking at the icu4j docs I've integrated detection into the TextParser as well, and rewritten the choosing logic in HtmlParser (both using unsurprisingly similar code, which suggests a utility class, as you suggest as well). Testing those now. It's not a bad idea to move detection out of the Content class; this could be part of the proposed utility class for character detection. Thus, this class could encapsulate (a) running charset detection, and (b) choosing the most likely "correct" charset for a document given a number of inputs (detected, reported, etc. depending on content type). Then the code duplication across different parsers would be minimal; in fact, their current code might get shorter, if we have the right abstraction. d
        Hide
        Doğacan Güney added a comment - - edited

        Doug, thanks for the (very) detailed feedback! This is incredibly helpful.

        > I did find a small number of cases where high-ish (>50%) confidence detection was wrong:
        > http://viniform.typepad.fr/dn/2006/10/mise_jour_du_cl.html
        > http://www.buscamaniban.com/fr/patrimoine/coeur-armagnac.php
        > http://www.lafite.com/en/html/Corporate/1.html
        > http://www.franz-keller.de/8860.html
        > http://www.vinesnwines.org/?m=200605

        Unfortunately, it seems there is not much we can do about these. I tried adding a detector.setDeclaredEncoding("UTF-8") before detection and it didn't help (UTF-8 confidence is surprisingly low, around 25). I also tried jchardet ( http://jchardet.sourceforge.net/ ) with these pages and it doesn't detect them as UTF-8 either, which is strange considering that mozilla does detect them correctly.

        > Architecturally I think we should store the detected encoding AND the confidence in all cases (even when low),
        > instead of storing it only when the confidence meets some threshold. That way the decision of which value to use
        > can be made later, in the parser, which can make a "smart"
        > decision based upon all the data that's available (detected, sniffed, reported, plus confidence value on
        > detection). Then, for example, if there is no sniffed or reported value, we could use the detected value, even
        > if the confidence is low (especially useful in the TextParser). We could also make decisions like "the confidence
        > is medium, but the same value is both sniffed and reported, so let's trust that instead," which might fix some of
        > the detection problem cases.

        Good idea but implementation-wise I would suggest that we rip out the detection code from Content.java and move it into parse-html (and whatever else wants to detect encoding). There will be some code duplication but this way parse-html can get all the possible matches (via detector.detectAll) and then use sniffed and reported to make a decision. What do you think?

        Edit: I realized I was unnecessarily repeating a part of what you were saying.

        Show
        Doğacan Güney added a comment - - edited Doug, thanks for the (very) detailed feedback! This is incredibly helpful. > I did find a small number of cases where high-ish (>50%) confidence detection was wrong: > http://viniform.typepad.fr/dn/2006/10/mise_jour_du_cl.html > http://www.buscamaniban.com/fr/patrimoine/coeur-armagnac.php > http://www.lafite.com/en/html/Corporate/1.html > http://www.franz-keller.de/8860.html > http://www.vinesnwines.org/?m=200605 Unfortunately, it seems there is not much we can do about these. I tried adding a detector.setDeclaredEncoding("UTF-8") before detection and it didn't help (UTF-8 confidence is surprisingly low, around 25). I also tried jchardet ( http://jchardet.sourceforge.net/ ) with these pages and it doesn't detect them as UTF-8 either, which is strange considering that mozilla does detect them correctly. > Architecturally I think we should store the detected encoding AND the confidence in all cases (even when low), > instead of storing it only when the confidence meets some threshold. That way the decision of which value to use > can be made later, in the parser, which can make a "smart" > decision based upon all the data that's available (detected, sniffed, reported, plus confidence value on > detection). Then, for example, if there is no sniffed or reported value, we could use the detected value, even > if the confidence is low (especially useful in the TextParser). We could also make decisions like "the confidence > is medium, but the same value is both sniffed and reported, so let's trust that instead," which might fix some of > the detection problem cases. Good idea but implementation-wise I would suggest that we rip out the detection code from Content.java and move it into parse-html (and whatever else wants to detect encoding). There will be some code duplication but this way parse-html can get all the possible matches (via detector.detectAll) and then use sniffed and reported to make a decision. What do you think? Edit: I realized I was unnecessarily repeating a part of what you were saying.
        Hide
        Doug Cook added a comment -

        Not sure where this belongs architecturally and aesthetically – will think about that.

        The relevance test results look good – overall at least as good as prior.

        The histogram of confidence values from ICU4J on a ~60K doc test DB looks something like:
        0-9 6
        10-19 440
        20-29 2466
        30-39 7724
        40-49 11372
        50-59 10791
        60-69 9583
        70-79 4519
        80-89 4479
        90-99 386

        I did find a small number of cases where high-ish (>50%) confidence detection was wrong:
        http://viniform.typepad.fr/dn/2006/10/mise_jour_du_cl.html
        http://www.buscamaniban.com/fr/patrimoine/coeur-armagnac.php
        http://www.lafite.com/en/html/Corporate/1.html
        http://www.franz-keller.de/8860.html
        http://www.vinesnwines.org/?m=200605

        In all these cases, ICU4J guessed Latin-1, while the page was (correctly) reported or sniffed to be UTF-8. That said, overall ICU4J seems to perform quite well. In addition to the overall relevance tests, I used a search for the word fragment "teau," which occurs frequently when the word Château is parsed with the wrong encoding (making Ch + garbage + teau). Prior to the patch I saw 102 occurrences; afterwards I saw 69 occurrences. And many of these 69 seemed to be on pages where the page had mixed encodings, or had typos, so it shows up that way even in the browser. Also, many of the remaining pages were text files or RSS feeds (parsed by TextParser, which I haven't yet adapted to use the encoding detection; doing that now).

        Architecturally I think we should store the detected encoding AND the confidence in all cases (even when low), instead of storing it only when the confidence meets some threshold. That way the decision of which value to use can be made later, in the parser, which can make a "smart" decision based upon all the data that's available (detected, sniffed, reported, plus confidence value on detection). Then, for example, if there is no sniffed or reported value, we could use the detected value, even if the confidence is low (especially useful in the TextParser). We could also make decisions like "the confidence is medium, but the same value is both sniffed and reported, so let's trust that instead," which might fix some of the detection problem cases.

        Hope this all makes sense. I'll keep plugging away at this today and report back on what I find. Thanks for all the help and quick responses.

        Doug

        By "reported," I mean in the HTTP header, and by "sniffed," I mean specified in the page metatags (since this is the term used in the code).

        Show
        Doug Cook added a comment - Not sure where this belongs architecturally and aesthetically – will think about that. The relevance test results look good – overall at least as good as prior. The histogram of confidence values from ICU4J on a ~60K doc test DB looks something like: 0-9 6 10-19 440 20-29 2466 30-39 7724 40-49 11372 50-59 10791 60-69 9583 70-79 4519 80-89 4479 90-99 386 I did find a small number of cases where high-ish (>50%) confidence detection was wrong: http://viniform.typepad.fr/dn/2006/10/mise_jour_du_cl.html http://www.buscamaniban.com/fr/patrimoine/coeur-armagnac.php http://www.lafite.com/en/html/Corporate/1.html http://www.franz-keller.de/8860.html http://www.vinesnwines.org/?m=200605 In all these cases, ICU4J guessed Latin-1, while the page was (correctly) reported or sniffed to be UTF-8. That said, overall ICU4J seems to perform quite well. In addition to the overall relevance tests, I used a search for the word fragment "teau," which occurs frequently when the word Château is parsed with the wrong encoding (making Ch + garbage + teau). Prior to the patch I saw 102 occurrences; afterwards I saw 69 occurrences. And many of these 69 seemed to be on pages where the page had mixed encodings, or had typos, so it shows up that way even in the browser. Also, many of the remaining pages were text files or RSS feeds (parsed by TextParser, which I haven't yet adapted to use the encoding detection; doing that now). Architecturally I think we should store the detected encoding AND the confidence in all cases (even when low), instead of storing it only when the confidence meets some threshold. That way the decision of which value to use can be made later, in the parser, which can make a "smart" decision based upon all the data that's available (detected, sniffed, reported, plus confidence value on detection). Then, for example, if there is no sniffed or reported value, we could use the detected value, even if the confidence is low (especially useful in the TextParser). We could also make decisions like "the confidence is medium, but the same value is both sniffed and reported, so let's trust that instead," which might fix some of the detection problem cases. Hope this all makes sense. I'll keep plugging away at this today and report back on what I find. Thanks for all the help and quick responses. Doug By "reported," I mean in the HTTP header, and by "sniffed," I mean specified in the page metatags (since this is the term used in the code).
        Doğacan Güney made changes -
        Issue Type Wish [ 5 ] New Feature [ 2 ]
        Priority Trivial [ 5 ] Major [ 3 ]
        Assignee Doğacan Güney [ dogacan ]
        Fix Version/s 1.0.0 [ 12312443 ]
        Hide
        Doğacan Güney added a comment -

        This should be something that we fix before 1.0.

        Show
        Doğacan Güney added a comment - This should be something that we fix before 1.0.
        Doğacan Güney made changes -
        Attachment NUTCH-25.patch [ 12362290 ]
        Hide
        Doğacan Güney added a comment -

        New version of the patch.

        • Catch icu4j exceptions and ignore them so that it doesn't bring down the whole crawl.
        • Add logging to parse-html to indicate which how it detected encoding.
        • Cleanup parse-html to remove a couple of warnings

        Btw, looking at this patch now, I am not sure Content.java is the right place to detect encoding. The reasons that I gave in my earlier comment are still valid but it is weird (from a design point of view) for content to detect its own encoding.

        We can pull out encoding detection to a utility class and change plugins to use it. But there would be unnecessary code duplication.

        Any suggestions?

        Show
        Doğacan Güney added a comment - New version of the patch. Catch icu4j exceptions and ignore them so that it doesn't bring down the whole crawl. Add logging to parse-html to indicate which how it detected encoding. Cleanup parse-html to remove a couple of warnings Btw, looking at this patch now, I am not sure Content.java is the right place to detect encoding. The reasons that I gave in my earlier comment are still valid but it is weird (from a design point of view) for content to detect its own encoding. We can pull out encoding detection to a utility class and change plugins to use it. But there would be unnecessary code duplication. Any suggestions?
        Hide
        Doug Cook added a comment -

        Oops, spoke to soon. On running a more extensive test, I saw quite a few ArrayIndexOutOfBound errors coming from ICU4J. Most were for index 0, some were not.

        The index 0 ones seem explainable by passing in content that is too short (see: http://bugs.icu-project.org/trac/ticket/5596). This was easily fixed. Then there were problems from non-zero indices; I don't understand why these happen, but in any case, they should not cause the entire fetch to fail, so I added a try/catch around the call to ICU4J; failures will now fall back to the previous methods (the response header or sniffing, as appropriate).

        The new check follows. When this crawl finishes I will look for any more subtle errors in my relevance tests.

        String encoding = null;
        if (minConfidence >= 0 && DETECTABLES.contains(getContentType()) && content.length > 4) {
        detector.enableInputFilter(true);
        detector.setText(content);
        CharsetMatch match = null;
        try

        { match = detector.detect(); }

        catch (Exception e) {}

        if (LOG.isTraceEnabled())

        { LOG.trace("Detected: confidence="+match.getConfidence()); }

        if (match != null && match.getConfidence() >= minConfidence)
        encoding = match.getName();
        }

        if (encoding != null)

        { metadata.set(Metadata.DETECTED_ENCODING, encoding); }

        }

        Show
        Doug Cook added a comment - Oops, spoke to soon. On running a more extensive test, I saw quite a few ArrayIndexOutOfBound errors coming from ICU4J. Most were for index 0, some were not. The index 0 ones seem explainable by passing in content that is too short (see: http://bugs.icu-project.org/trac/ticket/5596 ). This was easily fixed. Then there were problems from non-zero indices; I don't understand why these happen, but in any case, they should not cause the entire fetch to fail, so I added a try/catch around the call to ICU4J; failures will now fall back to the previous methods (the response header or sniffing, as appropriate). The new check follows. When this crawl finishes I will look for any more subtle errors in my relevance tests. String encoding = null; if (minConfidence >= 0 && DETECTABLES.contains(getContentType()) && content.length > 4) { detector.enableInputFilter(true); detector.setText(content); CharsetMatch match = null; try { match = detector.detect(); } catch (Exception e) {} if (LOG.isTraceEnabled()) { LOG.trace("Detected: confidence="+match.getConfidence()); } if (match != null && match.getConfidence() >= minConfidence) encoding = match.getName(); } if (encoding != null) { metadata.set(Metadata.DETECTED_ENCODING, encoding); } }
        Show
        Doug Cook added a comment - I should also add that a significant number of the URLs seem to have been fixed by the inherent inclusion of Renaud's patch for NUTCH-369 – this seems very useful. (Thanks, Renaud!) Between the charset detection and telling Neko to ignore the specified character set, things are MUCH better. Here are some good test cases: http://www.just-drinks.com/blogdetail.aspx?ID=1230 http://www.boissetamerica.com/products/ProductDetails.aspx?PrdId=104 http://www.austincc.edu/bhay/Regionalitaly.doc http://www.cnr.it/istituti/Istituto_Articoli_conv.html?cds=106&id=18158 http://www.winereviewonline.com/wine_reviews.cfm?nCountryID=2&archives=1 http://www.ngr.ucdavis.edu/varietyview.cfm?varietynum=2942&setdisclaimer=yes http://www.info.wien.at/article.asp?IDArticle=3811 http://www.iniap.min-agricultura.pt/projectos_detail.aspx?uni=7&id_projecto=872 http://www.finewinepress.com/digital/addfav.php?pid=5&ref=displayimage.php%3Falbum%3Dtopn%26cat%3D0%26pos%3D62
        Hide
        Doug Cook added a comment -

        Hi, Doğacan.

        My sincere apologies for the slow response, especially given the alacrity with which you whipped up that patch.

        I had to back-port the patch to my 0.81 environment for testing, so I can't 100% guarantee that your patch works as-is on 0.9.

        At any rate, in my environment, it seems to work pretty well, at least in my limited testing, and I didn't see any obvious problems on code review. I was using a 50% confidence threshold and most of the time the detection code kicked in (with the correct answer). All of the documents I was having problems with were fine.

        There seemed to be a typo in the patch; there's a try statement missing here, if I read correctly, but I just put in a try and took out the funky isTraceEnabled(), and all was well:

        • true);
        • } catch (SAXException e) {}
          + LOG.isTraceEnabled());
          + parser.setProperty("http://cyberneko.org/html/properties/default-encoding", defaultCharEncoding);
          + parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true);
          + } catch (SAXException e) { + LOG.trace(e); + }

        My only (minor) suggestion would be to change the LOG.trace statements in HtmlParser to note how they determined the encoding, e.g.:

        if (LOG.isTraceEnabled())

        { LOG.trace(base + ": setting encoding to (DETECTED) " + encoding); }

        That way one can look at the logs and see how often each of the 3 methods (detection, response header, sniffing) is used.

        Thanks again for the patch; it's good stuff, and useful.

        Show
        Doug Cook added a comment - Hi, Doğacan. My sincere apologies for the slow response, especially given the alacrity with which you whipped up that patch. I had to back-port the patch to my 0.81 environment for testing, so I can't 100% guarantee that your patch works as-is on 0.9. At any rate, in my environment, it seems to work pretty well, at least in my limited testing, and I didn't see any obvious problems on code review. I was using a 50% confidence threshold and most of the time the detection code kicked in (with the correct answer). All of the documents I was having problems with were fine. There seemed to be a typo in the patch; there's a try statement missing here, if I read correctly, but I just put in a try and took out the funky isTraceEnabled(), and all was well: true); } catch (SAXException e) {} + LOG.isTraceEnabled()); + parser.setProperty("http://cyberneko.org/html/properties/default-encoding", defaultCharEncoding); + parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true); + } catch (SAXException e) { + LOG.trace(e); + } My only (minor) suggestion would be to change the LOG.trace statements in HtmlParser to note how they determined the encoding, e.g.: if (LOG.isTraceEnabled()) { LOG.trace(base + ": setting encoding to (DETECTED) " + encoding); } That way one can look at the logs and see how often each of the 3 methods (detection, response header, sniffing) is used. Thanks again for the patch; it's good stuff, and useful.
        Hide
        Doğacan Güney added a comment -

        Doug, have you been able to look at my patch?

        Show
        Doğacan Güney added a comment - Doug, have you been able to look at my patch?
        Hide
        Doug Cook added a comment -

        Thanks! I'll take a look at your proposed patch... (that was fast! ask and ye shall receive...)

        Show
        Doug Cook added a comment - Thanks! I'll take a look at your proposed patch... (that was fast! ask and ye shall receive...)
        Doğacan Güney made changes -
        Field Original Value New Value
        Attachment NUTCH-25_draft.patch [ 12357801 ]
        Hide
        Doğacan Güney added a comment -

        Well, something like this should work...

        + Adds a new configurable parser.charset.autodetect.min.confidence, Nutch will set encoding to detected encoding if detection confidence is greater than this value. Auto-detection is disabled if value is negative.

        + Adds charset auto-detection logic to Content.java. Uses icu4j(so you need to put icu4j's jar under lib to try this).

        + If auto-detection is confident enough, it puts detected encoding to Content's Metadata. Plugin parse-html is updated to see this and set encoding accordingly.

        + Uses some code from NUTCH-487 and NUTCH-369 (Thanks, Renaud Richardet and Marcin Okraszewski). There is a bug in current parse-html code that if an html page specifies an encoding, Neko ignores auto-detected encoding and assumes that the encoding specified in page is true.

        I didn't want to do auto-detection in parse-html because other plugins (like xml feed parsing plugins) may also need this. Also, IMHO, doing it in ParseSegment or ParseUtil wouldn't work, because I may not use those.

        Show
        Doğacan Güney added a comment - Well, something like this should work... + Adds a new configurable parser.charset.autodetect.min.confidence, Nutch will set encoding to detected encoding if detection confidence is greater than this value. Auto-detection is disabled if value is negative. + Adds charset auto-detection logic to Content.java. Uses icu4j(so you need to put icu4j's jar under lib to try this). + If auto-detection is confident enough, it puts detected encoding to Content's Metadata. Plugin parse-html is updated to see this and set encoding accordingly. + Uses some code from NUTCH-487 and NUTCH-369 (Thanks, Renaud Richardet and Marcin Okraszewski). There is a bug in current parse-html code that if an html page specifies an encoding, Neko ignores auto-detected encoding and assumes that the encoding specified in page is true. I didn't want to do auto-detection in parse-html because other plugins (like xml feed parsing plugins) may also need this. Also, IMHO, doing it in ParseSegment or ParseUtil wouldn't work, because I may not use those.
        Hide
        Ken Krugler added a comment -

        I use ICU for most issues like this. They have a charset detector - see http://krugle.com/kse/files/cvs/source.icu-project.org/icu/icu4j/src/com/ibm/icu/text/CharsetDetector.java. I don't know how well it compares to jchardet, though.

        Show
        Ken Krugler added a comment - I use ICU for most issues like this. They have a charset detector - see http://krugle.com/kse/files/cvs/source.icu-project.org/icu/icu4j/src/com/ibm/icu/text/CharsetDetector.java . I don't know how well it compares to jchardet, though.
        Hide
        Doug Cook added a comment -

        We might want to think about raising the priority of this. I've seen encoding problems affect quite a few documents. Sometimes this is obvious, because it shows up the abstract, but often it is subtle, and simply affects recall.

        Here's an example.

        I have indexed the document:
        http://www.winereviewonline.com/wine_reviews.cfm?nCountryID=2&archives=1

        This document is in UTF-8, but the header says it is in iso-8859-1 (this seems fairly common!). Because of this, a few characters get screwed up, and if I search for "Les Vignes du Soir", I won't find it, because it is being indexed as “Les Vignes du Soir”, since it uses curly quotes.

        I've seen enough instances of problems like this to make me worry that it is causing significant recall problems.

        If anyone has a ready solution for this, please let me know. If not, I'll try to get to it (and contribute back the changes once I get the chance...). Is jchardet still the best Java option out there?

        Show
        Doug Cook added a comment - We might want to think about raising the priority of this. I've seen encoding problems affect quite a few documents. Sometimes this is obvious, because it shows up the abstract, but often it is subtle, and simply affects recall. Here's an example. I have indexed the document: http://www.winereviewonline.com/wine_reviews.cfm?nCountryID=2&archives=1 This document is in UTF-8, but the header says it is in iso-8859-1 (this seems fairly common!). Because of this, a few characters get screwed up, and if I search for "Les Vignes du Soir", I won't find it, because it is being indexed as “Les Vignes du Soir”, since it uses curly quotes. I've seen enough instances of problems like this to make me worry that it is causing significant recall problems. If anyone has a ready solution for this, please let me know. If not, I'll try to get to it (and contribute back the changes once I get the chance...). Is jchardet still the best Java option out there?
        Hide
        Chris Fellows added a comment -

        This was last updated May '05. Has this charset and language detection been integrated into Nutch yet?

        If not, at what point should the detection happen? Fetching, parsing, etc. If this hasn't been fixed any leads into where to insert the detection would helpful.

        Show
        Chris Fellows added a comment - This was last updated May '05. Has this charset and language detection been integrated into Nutch yet? If not, at what point should the detection happen? Fetching, parsing, etc. If this hasn't been fixed any leads into where to insert the detection would helpful.
        Hide
        Benedict added a comment -

        There exists a java port of the Mozilla algorithm already:

        http://jchardet.sourceforge.net/

        Show
        Benedict added a comment - There exists a java port of the Mozilla algorithm already: http://jchardet.sourceforge.net/
        Hide
        Nick Lothian added a comment -

        ROME (http://rome.dev.java.net) has an XmlReader which encapsulates most of the detection code required. See http://wiki.java.net/bin/view/Javawsxml/Rome05CharsetEncoding.

        ROME is under the Apache licence.

        Show
        Nick Lothian added a comment - ROME ( http://rome.dev.java.net ) has an XmlReader which encapsulates most of the detection code required. See http://wiki.java.net/bin/view/Javawsxml/Rome05CharsetEncoding . ROME is under the Apache licence.
        Stefan Groschupf created issue -

          People

          • Assignee:
            Doğacan Güney
            Reporter:
            Stefan Groschupf
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development