Issue Details (XML | Word | Printable)

Key: LUCENE-1297
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Otis Gospodnetic
Reporter: Thomas Morton
Votes: 1
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Allow other string distance measures in spellchecker

Created: 30/May/08 06:55 PM   Updated: 11/Oct/08 12:49 PM
Return to search
Component/s: contrib/spellchecker
Affects Version/s: 2.4
Fix Version/s: 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-1297.patch 2008-06-15 03:57 AM Otis Gospodnetic 17 kB
Text File Licensed for inclusion in ASF works LUCENE-1297.patch 2008-06-12 05:11 PM Thomas Morton 17 kB
Environment: n/a

Lucene Fields: Patch Available, New
Resolution Date: 18/Jun/08 05:10 AM


 Description  « Hide
Updated spelling code to allow for other string distance measures to be used.

Created StringDistance interface.
Modified existing Levenshtein distance measure to implement interface (and renamed class).
Verified that change to Levenshtein distance didn't impact runtime performance.
Implemented Jaro/Winkler distance metric
Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Thomas Morton made changes - 30/May/08 06:56 PM
Field Original Value New Value
Attachment string_distance.patch [ 12383131 ]
Otis Gospodnetic made changes - 31/May/08 03:18 AM
Assignee Otis Gospodnetic [ otis ]
Otis Gospodnetic added a comment - 31/May/08 03:29 AM
You read my mind, Thomas.
Would it be appropriate to add and try Jaccard index and Dice coefficient, too, then?

Thomas Morton added a comment - 31/May/08 03:44 PM
I think the dice coefficient would be nice to have. I'm not sure the jaccard index makes sense in the context of spelling correction since order isn't captured. I implemented JaroWinkler since I'm suggesting proper names and it does a good job with those.

With the StringDistance interface defined, anyone can implement the distance measure however they want. What I think would be very useful is weighted version of edit distance with the weights tuned to your target language/domain. Also with support in solr for specifying this parameter in the SpellCheckRequestHandler, changing this just becomes a config change.


Otis Gospodnetic added a comment - 04/Jun/08 03:11 AM
Thomas - any chance you can write a simple unit test that exercises JaroWinkler?

Thomas Morton made changes - 05/Jun/08 05:31 PM
Attachment string_distance.patch [ 12383131 ]
Thomas Morton added a comment - 05/Jun/08 05:32 PM
Updated to include additional unit tests.

Thomas Morton made changes - 05/Jun/08 05:32 PM
Attachment string_distance.patch [ 12383474 ]
Grant Ingersoll added a comment - 11/Jun/08 01:37 AM
Hi Thomas,

This patch doesn't apply for me from the contrib/spellchecker directory.


Thomas Morton added a comment - 11/Jun/08 03:19 AM
Looks like there was a minor change to the testing code since I created the patch. Updated and re-created patch.

Thomas Morton made changes - 11/Jun/08 03:19 AM
Attachment string_distance.patch2 [ 12383801 ]
Thomas Morton made changes - 11/Jun/08 03:20 AM
Attachment string_distance.patch [ 12383474 ]
Otis Gospodnetic added a comment - 11/Jun/08 05:38 AM
Tom, note the bit about naming patches and reusing patch names on the HowToContribute wiki page.

I see JaroWinklerDistance.java doesn't have ASL on top.

Oh, there is something funky about this patch. You created a new class (LevenshteinDistance), but your patch shows it as an edit of TRStringDistance. It should show it as a brand new file. Could you please provide a clean patch? This is why the patch fails to apply.

Thanks.


Thomas Morton added a comment - 11/Jun/08 02:04 PM
I didn't see anything about re-using patch names on the wiki. please advise.

In svn the LevenshteinDistance class is a re-name and edit of the TRStringDistance class. Perhaps the patch doesn't know how to deal with that. I'll change the name back though I think given that there are now going to be more than one of these a more descriptive name makes sense.

Added ASL to Jaro class.


Thomas Morton made changes - 11/Jun/08 02:04 PM
Attachment string_distance3.patch [ 12383831 ]
Thomas Morton made changes - 11/Jun/08 02:04 PM
Attachment string_distance.patch2 [ 12383801 ]
Grant Ingersoll added a comment - 11/Jun/08 02:26 PM - edited

I didn't see anything about re-using patch names on the wiki. please advise.

I think Otis is just referring to naming patches as something like LUCENE-1297.patch and then you just always keep that name, then JIRA takes care of the versioning and it is always clear which patch is the latest. As for the Wiki, I think it is on the Solr wiki, but should be added to the Lucene one, too.


Grant Ingersoll added a comment - 12/Jun/08 01:25 PM
Patch applies cleanly and the tests pass.

Ideally, there would be standalone tests for each of the distance measures that test them outside the context of spell checking.

I think the Jaro-Winkler threshold should be configurable via a setter/constructor. A getter would make sense too, so that one can see what the threshold is.

Also, the TRStringDistance explicitly states that it is not thread safe. I believe it is now being used in a non thread-safe manner. FWIW, I see no reason why it can't be made thread-safe. All of those member variables are being allocated in the getDistance method, so no reason not to just make them local variables, I think.


Otis Gospodnetic added a comment - 12/Jun/08 03:27 PM
Tom, I agree with Grant and I'll assume you'll update the patch.

As for that TRStringDistance -> LevensteinDistance, I'll just commit it as is once the patch is fully ready, and then I'll rename classes in a separate commit.


Thomas Morton added a comment - 12/Jun/08 05:11 PM
Added tests for JaroWinkler and Levenshtein distances directly.
Added getter/setter for JaroWinker threshold and javadoc.
Moved class variables in Levenshtein into method to make it thread-safe.
Named patch appropriately.

Thomas Morton made changes - 12/Jun/08 05:11 PM
Attachment LUCENE-1297.patch [ 12383923 ]
Thomas Morton made changes - 12/Jun/08 05:11 PM
Attachment string_distance3.patch [ 12383831 ]
Otis Gospodnetic added a comment - 15/Jun/08 03:57 AM
Attaching a new version (only added ASL 2.0 to StringDistance + typo fix)

Question (why - what does it do?) about this TRStringDistance change:

  • return p[n];
    + return 1.0f - ((float) p[n] / Math.min(other.length(), sa.length));

Otis Gospodnetic made changes - 15/Jun/08 03:57 AM
Attachment LUCENE-1297.patch [ 12384021 ]
Thomas Morton added a comment - 15/Jun/08 12:46 PM
Hi,
This code used to be in the SpellChecker itself. It just converts the int from the Levenshtein into a value between 0 and 1. 1 being identical, 0 being maximally different. This return value is part of the StringDistance interface and different methods compute this value differently so it's necessary to compute it on a per distance measure basis.

Thanks...Tom


Grant Ingersoll added a comment - 16/Jun/08 01:56 PM
+1 on committing this. I downloaded the latest and applied, ran the tests, etc. and it looks good.

Repository Revision Date User Message
ASF #669085 Wed Jun 18 05:01:57 UTC 2008 otis LUCENE-1297 - Allow other string distance measures for the SpellChecker
Files Changed
ADD /lucene/java/trunk/contrib/spellchecker/src/java/org/apache/lucene/search/spell/StringDistance.java
ADD /lucene/java/trunk/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestLevenshteinDistance.java
MODIFY /lucene/java/trunk/contrib/spellchecker/src/java/org/apache/lucene/search/spell/TRStringDistance.java
MODIFY /lucene/java/trunk/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestSpellChecker.java
MODIFY /lucene/java/trunk/CHANGES.txt
ADD /lucene/java/trunk/contrib/spellchecker/src/java/org/apache/lucene/search/spell/JaroWinklerDistance.java
MODIFY /lucene/java/trunk/contrib/spellchecker/src/java/org/apache/lucene/search/spell/SpellChecker.java
ADD /lucene/java/trunk/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestJaroWinklerDistance.java

Repository Revision Date User Message
ASF #669086 Wed Jun 18 05:09:11 UTC 2008 otis - Renamed TRStringDistance to LevensteinDistance (related to LUCENE-1297)
Files Changed
ADD /lucene/java/trunk/contrib/spellchecker/src/java/org/apache/lucene/search/spell/LevensteinDistance.java (from /lucene/java/trunk/contrib/spellchecker/src/java/org/apache/lucene/search/spell/TRStringDistance.java)
MODIFY /lucene/java/trunk/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestLevenshteinDistance.java
DEL /lucene/java/trunk/contrib/spellchecker/src/java/org/apache/lucene/search/spell/TRStringDistance.java
MODIFY /lucene/java/trunk/contrib/spellchecker/src/java/org/apache/lucene/search/spell/SpellChecker.java

Otis Gospodnetic added a comment - 18/Jun/08 05:10 AM
Committed, thanks Tom.

svn ci CHANGES.txt contrib/spellchecker
Sending CHANGES.txt
Adding contrib/spellchecker/src/java/org/apache/lucene/search/spell/JaroWinklerDistance.java
Sending contrib/spellchecker/src/java/org/apache/lucene/search/spell/SpellChecker.java
Adding contrib/spellchecker/src/java/org/apache/lucene/search/spell/StringDistance.java
Sending contrib/spellchecker/src/java/org/apache/lucene/search/spell/TRStringDistance.java
Adding contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestJaroWinklerDistance.java
Adding contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestLevenshteinDistance.java
Sending contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestSpellChecker.java
Transmitting file data ........
Committed revision 669085.

Also:

$ svn ci contrib/spellchecker
Adding contrib/spellchecker/src/java/org/apache/lucene/search/spell/LevensteinDistance.java
Sending contrib/spellchecker/src/java/org/apache/lucene/search/spell/SpellChecker.java
Deleting contrib/spellchecker/src/java/org/apache/lucene/search/spell/TRStringDistance.java
Sending contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestLevenshteinDistance.java
Transmitting file data ...
Committed revision 669086.


Otis Gospodnetic made changes - 18/Jun/08 05:10 AM
Resolution Fixed [ 1 ]
Status Open [ 1 ] Resolved [ 5 ]
Lucene Fields [Patch Available, New] [New, Patch Available]
Michael McCandless made changes - 11/Oct/08 12:49 PM
Status Resolved [ 5 ] Closed [ 6 ]