Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1297

Allow other string distance measures in spellchecker

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 2.4
    • Component/s: modules/spellchecker
    • Labels:
      None
    • Environment:

      n/a

    • Lucene Fields:
      New, Patch Available

      Description

      Updated spelling code to allow for other string distance measures to be used.

      Created StringDistance interface.
      Modified existing Levenshtein distance measure to implement interface (and renamed class).
      Verified that change to Levenshtein distance didn't impact runtime performance.
      Implemented Jaro/Winkler distance metric
      Modified SpellChecker to take distacne measure as in constructor or in set method and to use interface when calling.

      1. LUCENE-1297.patch
        17 kB
        Otis Gospodnetic
      2. LUCENE-1297.patch
        17 kB
        Thomas Morton

        Activity

        Hide
        otis Otis Gospodnetic added a comment -

        You read my mind, Thomas.
        Would it be appropriate to add and try Jaccard index and Dice coefficient, too, then?

        Show
        otis Otis Gospodnetic added a comment - You read my mind, Thomas. Would it be appropriate to add and try Jaccard index and Dice coefficient, too, then?
        Hide
        tsmorton Thomas Morton added a comment -

        I think the dice coefficient would be nice to have. I'm not sure the jaccard index makes sense in the context of spelling correction since order isn't captured. I implemented JaroWinkler since I'm suggesting proper names and it does a good job with those.

        With the StringDistance interface defined, anyone can implement the distance measure however they want. What I think would be very useful is weighted version of edit distance with the weights tuned to your target language/domain. Also with support in solr for specifying this parameter in the SpellCheckRequestHandler, changing this just becomes a config change.

        Show
        tsmorton Thomas Morton added a comment - I think the dice coefficient would be nice to have. I'm not sure the jaccard index makes sense in the context of spelling correction since order isn't captured. I implemented JaroWinkler since I'm suggesting proper names and it does a good job with those. With the StringDistance interface defined, anyone can implement the distance measure however they want. What I think would be very useful is weighted version of edit distance with the weights tuned to your target language/domain. Also with support in solr for specifying this parameter in the SpellCheckRequestHandler, changing this just becomes a config change.
        Hide
        otis Otis Gospodnetic added a comment -

        Thomas - any chance you can write a simple unit test that exercises JaroWinkler?

        Show
        otis Otis Gospodnetic added a comment - Thomas - any chance you can write a simple unit test that exercises JaroWinkler?
        Hide
        tsmorton Thomas Morton added a comment -

        Updated to include additional unit tests.

        Show
        tsmorton Thomas Morton added a comment - Updated to include additional unit tests.
        Hide
        gsingers Grant Ingersoll added a comment -

        Hi Thomas,

        This patch doesn't apply for me from the contrib/spellchecker directory.

        Show
        gsingers Grant Ingersoll added a comment - Hi Thomas, This patch doesn't apply for me from the contrib/spellchecker directory.
        Hide
        tsmorton Thomas Morton added a comment -

        Looks like there was a minor change to the testing code since I created the patch. Updated and re-created patch.

        Show
        tsmorton Thomas Morton added a comment - Looks like there was a minor change to the testing code since I created the patch. Updated and re-created patch.
        Hide
        otis Otis Gospodnetic added a comment -

        Tom, note the bit about naming patches and reusing patch names on the HowToContribute wiki page.

        I see JaroWinklerDistance.java doesn't have ASL on top.

        Oh, there is something funky about this patch. You created a new class (LevenshteinDistance), but your patch shows it as an edit of TRStringDistance. It should show it as a brand new file. Could you please provide a clean patch? This is why the patch fails to apply.

        Thanks.

        Show
        otis Otis Gospodnetic added a comment - Tom, note the bit about naming patches and reusing patch names on the HowToContribute wiki page. I see JaroWinklerDistance.java doesn't have ASL on top. Oh, there is something funky about this patch. You created a new class (LevenshteinDistance), but your patch shows it as an edit of TRStringDistance. It should show it as a brand new file. Could you please provide a clean patch? This is why the patch fails to apply. Thanks.
        Hide
        tsmorton Thomas Morton added a comment -

        I didn't see anything about re-using patch names on the wiki. please advise.

        In svn the LevenshteinDistance class is a re-name and edit of the TRStringDistance class. Perhaps the patch doesn't know how to deal with that. I'll change the name back though I think given that there are now going to be more than one of these a more descriptive name makes sense.

        Added ASL to Jaro class.

        Show
        tsmorton Thomas Morton added a comment - I didn't see anything about re-using patch names on the wiki. please advise. In svn the LevenshteinDistance class is a re-name and edit of the TRStringDistance class. Perhaps the patch doesn't know how to deal with that. I'll change the name back though I think given that there are now going to be more than one of these a more descriptive name makes sense. Added ASL to Jaro class.
        Hide
        gsingers Grant Ingersoll added a comment - - edited

        I didn't see anything about re-using patch names on the wiki. please advise.

        I think Otis is just referring to naming patches as something like LUCENE-1297.patch and then you just always keep that name, then JIRA takes care of the versioning and it is always clear which patch is the latest. As for the Wiki, I think it is on the Solr wiki, but should be added to the Lucene one, too.

        Show
        gsingers Grant Ingersoll added a comment - - edited I didn't see anything about re-using patch names on the wiki. please advise. I think Otis is just referring to naming patches as something like LUCENE-1297 .patch and then you just always keep that name, then JIRA takes care of the versioning and it is always clear which patch is the latest. As for the Wiki, I think it is on the Solr wiki, but should be added to the Lucene one, too.
        Hide
        gsingers Grant Ingersoll added a comment -

        Patch applies cleanly and the tests pass.

        Ideally, there would be standalone tests for each of the distance measures that test them outside the context of spell checking.

        I think the Jaro-Winkler threshold should be configurable via a setter/constructor. A getter would make sense too, so that one can see what the threshold is.

        Also, the TRStringDistance explicitly states that it is not thread safe. I believe it is now being used in a non thread-safe manner. FWIW, I see no reason why it can't be made thread-safe. All of those member variables are being allocated in the getDistance method, so no reason not to just make them local variables, I think.

        Show
        gsingers Grant Ingersoll added a comment - Patch applies cleanly and the tests pass. Ideally, there would be standalone tests for each of the distance measures that test them outside the context of spell checking. I think the Jaro-Winkler threshold should be configurable via a setter/constructor. A getter would make sense too, so that one can see what the threshold is. Also, the TRStringDistance explicitly states that it is not thread safe. I believe it is now being used in a non thread-safe manner. FWIW, I see no reason why it can't be made thread-safe. All of those member variables are being allocated in the getDistance method, so no reason not to just make them local variables, I think.
        Hide
        otis Otis Gospodnetic added a comment -

        Tom, I agree with Grant and I'll assume you'll update the patch.

        As for that TRStringDistance -> LevensteinDistance, I'll just commit it as is once the patch is fully ready, and then I'll rename classes in a separate commit.

        Show
        otis Otis Gospodnetic added a comment - Tom, I agree with Grant and I'll assume you'll update the patch. As for that TRStringDistance -> LevensteinDistance, I'll just commit it as is once the patch is fully ready, and then I'll rename classes in a separate commit.
        Hide
        tsmorton Thomas Morton added a comment -

        Added tests for JaroWinkler and Levenshtein distances directly.
        Added getter/setter for JaroWinker threshold and javadoc.
        Moved class variables in Levenshtein into method to make it thread-safe.
        Named patch appropriately.

        Show
        tsmorton Thomas Morton added a comment - Added tests for JaroWinkler and Levenshtein distances directly. Added getter/setter for JaroWinker threshold and javadoc. Moved class variables in Levenshtein into method to make it thread-safe. Named patch appropriately.
        Hide
        otis Otis Gospodnetic added a comment -

        Attaching a new version (only added ASL 2.0 to StringDistance + typo fix)

        Question (why - what does it do?) about this TRStringDistance change:

        • return p[n];
          + return 1.0f - ((float) p[n] / Math.min(other.length(), sa.length));
        Show
        otis Otis Gospodnetic added a comment - Attaching a new version (only added ASL 2.0 to StringDistance + typo fix) Question (why - what does it do?) about this TRStringDistance change: return p [n] ; + return 1.0f - ((float) p [n] / Math.min(other.length(), sa.length));
        Hide
        tsmorton Thomas Morton added a comment -

        Hi,
        This code used to be in the SpellChecker itself. It just converts the int from the Levenshtein into a value between 0 and 1. 1 being identical, 0 being maximally different. This return value is part of the StringDistance interface and different methods compute this value differently so it's necessary to compute it on a per distance measure basis.

        Thanks...Tom

        Show
        tsmorton Thomas Morton added a comment - Hi, This code used to be in the SpellChecker itself. It just converts the int from the Levenshtein into a value between 0 and 1. 1 being identical, 0 being maximally different. This return value is part of the StringDistance interface and different methods compute this value differently so it's necessary to compute it on a per distance measure basis. Thanks...Tom
        Hide
        gsingers Grant Ingersoll added a comment -

        +1 on committing this. I downloaded the latest and applied, ran the tests, etc. and it looks good.

        Show
        gsingers Grant Ingersoll added a comment - +1 on committing this. I downloaded the latest and applied, ran the tests, etc. and it looks good.
        Hide
        otis Otis Gospodnetic added a comment -

        Committed, thanks Tom.

        svn ci CHANGES.txt contrib/spellchecker
        Sending CHANGES.txt
        Adding contrib/spellchecker/src/java/org/apache/lucene/search/spell/JaroWinklerDistance.java
        Sending contrib/spellchecker/src/java/org/apache/lucene/search/spell/SpellChecker.java
        Adding contrib/spellchecker/src/java/org/apache/lucene/search/spell/StringDistance.java
        Sending contrib/spellchecker/src/java/org/apache/lucene/search/spell/TRStringDistance.java
        Adding contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestJaroWinklerDistance.java
        Adding contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestLevenshteinDistance.java
        Sending contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestSpellChecker.java
        Transmitting file data ........
        Committed revision 669085.

        Also:

        $ svn ci contrib/spellchecker
        Adding contrib/spellchecker/src/java/org/apache/lucene/search/spell/LevensteinDistance.java
        Sending contrib/spellchecker/src/java/org/apache/lucene/search/spell/SpellChecker.java
        Deleting contrib/spellchecker/src/java/org/apache/lucene/search/spell/TRStringDistance.java
        Sending contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestLevenshteinDistance.java
        Transmitting file data ...
        Committed revision 669086.

        Show
        otis Otis Gospodnetic added a comment - Committed, thanks Tom. svn ci CHANGES.txt contrib/spellchecker Sending CHANGES.txt Adding contrib/spellchecker/src/java/org/apache/lucene/search/spell/JaroWinklerDistance.java Sending contrib/spellchecker/src/java/org/apache/lucene/search/spell/SpellChecker.java Adding contrib/spellchecker/src/java/org/apache/lucene/search/spell/StringDistance.java Sending contrib/spellchecker/src/java/org/apache/lucene/search/spell/TRStringDistance.java Adding contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestJaroWinklerDistance.java Adding contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestLevenshteinDistance.java Sending contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestSpellChecker.java Transmitting file data ........ Committed revision 669085. Also: $ svn ci contrib/spellchecker Adding contrib/spellchecker/src/java/org/apache/lucene/search/spell/LevensteinDistance.java Sending contrib/spellchecker/src/java/org/apache/lucene/search/spell/SpellChecker.java Deleting contrib/spellchecker/src/java/org/apache/lucene/search/spell/TRStringDistance.java Sending contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestLevenshteinDistance.java Transmitting file data ... Committed revision 669086.

          People

          • Assignee:
            otis Otis Gospodnetic
            Reporter:
            tsmorton Thomas Morton
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development