Thanks for the quick feedback.
> * EncodingDetector api is way too open. IMO, EncodingClue should be a private static
> class (users can pass a clue like detector.addClue(value, source, confidence)), EncodingDetector
> should not expose clues ever (for example, autoDetectClues should return void [or perhaps a
> boolean indicating the success of autodetect]) and store clues internally.
Good point. I had thought that callers might want to manipulate the list, but this is probably unlikely, and my current approach certainly allows more for caller screwup through playing with the passed List. It's an easy fix to make, and it cleans up the calling code a little bit, too. I'll fix that.
If in the future, the callers need to manipulate the list, we can just add an interface for that.
> * code:
> public boolean meetsThreshold()
> Integer mt = (Integer) thresholds.get(value);
> int myThreshold = (mt != null) ? mt.intValue() : minConfidence; // use global value if no encoding-specific value found
> return (confidence < 0 || (minConfidence >= 0 && confidence>=myThreshold));
> Why does meetsTreshold return true if confidence < 0?
Negative confidence values have special semantics. It means "use me if you get to me in the list, and ignore the threshold." These semantics are necessary to emulate the prior behavior (where, for example, header values would always be used, if present, in preference to 'sniffed' meta-tags). Not that the prior behavior was perfect, but I think it's a useful construct: a value which, if present, should be used regardless of confidence thresholds.
> * If users specify an encoding clue with no confidence then we should give it a default
> positive confidence instead of -1. Of course, confidence value needs to be very very small, maybe just +1.
Hopefully this design choice makes more sense in light of the previous comment. The -1 has special semantics, meaning "I don't have a threshold."
> * It would be nice to "stack" clues. Assume that autodetection returned 2 possible encodings:
> ISO-8859-1 with 50 confidence and UTF-8 with 45 confidence. If I add a new clue (say, coming from
> http header) for UTF-8 with +6 confidence, overall confidence for UTF-8 should now be 51.
I'm not sure if you mean actually combine the clues in the list, or just add the values in guessEncoding.
Architecturally I think it's better to keep all the clues intact until the final "guess" is made. I've tried to make guessEncoding the place where all the policy decisions are made, the method to be overridden if someone has a better guessing algorithm. Internal to guessEncoding, we could certainly add the clue values if it turns out that helps us make a better guess.
Combining clues prior to guessEncoding is throwing away information – clues might be additive, but they might not (two highly correlated pieces of data won't be additive, and inversely correlated features will even be "subtractive"). Ideally someone could make a large-ish test set, judge the "real" encoding for all the examples, do the statistics, and find out how all the (detected encoding, header value, metatags) interact. A guessEncoding based on statistical modeling would be pretty sweet. When I was working for a certain large search company, this is how we would typically tackle a problem like this. I'm certain that's how CharsetDetector was created in the first place.
In the mean time, the simple algorithm provided seems to do reasonably well (it does very nearly what your version did, which seems like a fine place to start).
It's worth adding that CharsetDetector also detects languages, and a few examples I looked at seemed pretty good. It seems a shame to throw away that information, especially when I know Nutch's built-in language detection makes a fair number of mistakes (though in part because it trusts the page metatags, which are often wrong). Another bit of food for thought.
> * This is mostly my personal nit, but Java 5 style generics would be nice.
Ah, you caught me. I'm still working in a 1.4-ish environment.
> About contributing stuff back: [...]
Many thanks. This is pretty much what I'd assumed; unfortunately it will be a while before I have time and can afford the risk of bringing 0.9 changes into my local installation. But of course, the longer I wait, the more difficult the merge will be Oh well, I'll get there! There are a couple of important bugfixes for which I'll try to make the time to port earlier.