|
Thomas Morton made changes - 30/May/08 06:56 PM
Otis Gospodnetic made changes - 31/May/08 03:18 AM
I think the dice coefficient would be nice to have. I'm not sure the jaccard index makes sense in the context of spelling correction since order isn't captured. I implemented JaroWinkler since I'm suggesting proper names and it does a good job with those.
With the StringDistance interface defined, anyone can implement the distance measure however they want. What I think would be very useful is weighted version of edit distance with the weights tuned to your target language/domain. Also with support in solr for specifying this parameter in the SpellCheckRequestHandler, changing this just becomes a config change. Thomas - any chance you can write a simple unit test that exercises JaroWinkler?
Thomas Morton made changes - 05/Jun/08 05:31 PM
Updated to include additional unit tests.
Thomas Morton made changes - 05/Jun/08 05:32 PM
Hi Thomas,
This patch doesn't apply for me from the contrib/spellchecker directory. Looks like there was a minor change to the testing code since I created the patch. Updated and re-created patch.
Thomas Morton made changes - 11/Jun/08 03:19 AM
Thomas Morton made changes - 11/Jun/08 03:20 AM
Tom, note the bit about naming patches and reusing patch names on the HowToContribute wiki page.
I see JaroWinklerDistance.java doesn't have ASL on top. Oh, there is something funky about this patch. You created a new class (LevenshteinDistance), but your patch shows it as an edit of TRStringDistance. It should show it as a brand new file. Could you please provide a clean patch? This is why the patch fails to apply. Thanks. I didn't see anything about re-using patch names on the wiki. please advise.
In svn the LevenshteinDistance class is a re-name and edit of the TRStringDistance class. Perhaps the patch doesn't know how to deal with that. I'll change the name back though I think given that there are now going to be more than one of these a more descriptive name makes sense. Added ASL to Jaro class.
Thomas Morton made changes - 11/Jun/08 02:04 PM
Thomas Morton made changes - 11/Jun/08 02:04 PM
I think Otis is just referring to naming patches as something like Patch applies cleanly and the tests pass.
Ideally, there would be standalone tests for each of the distance measures that test them outside the context of spell checking. I think the Jaro-Winkler threshold should be configurable via a setter/constructor. A getter would make sense too, so that one can see what the threshold is. Also, the TRStringDistance explicitly states that it is not thread safe. I believe it is now being used in a non thread-safe manner. FWIW, I see no reason why it can't be made thread-safe. All of those member variables are being allocated in the getDistance method, so no reason not to just make them local variables, I think. Tom, I agree with Grant and I'll assume you'll update the patch.
As for that TRStringDistance -> LevensteinDistance, I'll just commit it as is once the patch is fully ready, and then I'll rename classes in a separate commit. Added tests for JaroWinkler and Levenshtein distances directly.
Added getter/setter for JaroWinker threshold and javadoc. Moved class variables in Levenshtein into method to make it thread-safe. Named patch appropriately.
Thomas Morton made changes - 12/Jun/08 05:11 PM
Thomas Morton made changes - 12/Jun/08 05:11 PM
Attaching a new version (only added ASL 2.0 to StringDistance + typo fix)
Question (why - what does it do?) about this TRStringDistance change:
Otis Gospodnetic made changes - 15/Jun/08 03:57 AM
Hi,
This code used to be in the SpellChecker itself. It just converts the int from the Levenshtein into a value between 0 and 1. 1 being identical, 0 being maximally different. This return value is part of the StringDistance interface and different methods compute this value differently so it's necessary to compute it on a per distance measure basis. Thanks...Tom +1 on committing this. I downloaded the latest and applied, ran the tests, etc. and it looks good.
Committed, thanks Tom.
svn ci CHANGES.txt contrib/spellchecker Also: $ svn ci contrib/spellchecker
Otis Gospodnetic made changes - 18/Jun/08 05:10 AM
Michael McCandless made changes - 11/Oct/08 12:49 PM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Would it be appropriate to add and try Jaccard index and Dice coefficient, too, then?