Lucene - Core
  1. Lucene - Core
  2. LUCENE-1550

Add N-Gram String Matching for Spell Checking

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9
    • Component/s: modules/spellchecker
    • Labels:
      None

      Description

      N-Gram version of edit distance based on paper by Grzegorz Kondrak, "N-gram similarity and distance". Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126, Buenos Aires, Argentina, November 2005.
      http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

      1. LUCENE-1550.patch
        14 kB
        Thomas Morton
      2. LUCENE-1550.patch
        13 kB
        Thomas Morton
      3. LUCENE-1550.patch
        11 kB
        Thomas Morton

        Activity

        Hide
        Thomas Morton added a comment -

        Patch includes implementation of n-gram string matching.

        This implementation uses the position-based optimization to compute partial matches of n-gram sub-strings and adds a null-character prefix of size n-1 so that the first character is contained in the same number of n-grams as a middle character. Null-character prefix matches are discounted so that strings with no matching characters will return a distance of 0.

        Includes test cases.

        Fixes javadoc description of StringDistance to reflect that 1 is used for the same string and 0 for non-matching.

        Show
        Thomas Morton added a comment - Patch includes implementation of n-gram string matching. This implementation uses the position-based optimization to compute partial matches of n-gram sub-strings and adds a null-character prefix of size n-1 so that the first character is contained in the same number of n-grams as a middle character. Null-character prefix matches are discounted so that strings with no matching characters will return a distance of 0. Includes test cases. Fixes javadoc description of StringDistance to reflect that 1 is used for the same string and 0 for non-matching.
        Hide
        Grant Ingersoll added a comment -

        not major

        Show
        Grant Ingersoll added a comment - not major
        Hide
        Grant Ingersoll added a comment -

        Hey Tom,

        Few questions:

        1. Do you have recommendations on picking n?
        2. On line 78 or so, can't that be moved up? s1/t1 are calculated on line 47 and not assigned to. Seems like it would be an optimization to return out if they are 0. Also, can it just be:
          if (s1 == 0 || t1 == 0){return 1;};
          

          In fact, all tests still pass when this is moved up to the top. However, I must not be understanding something, as why should:

          public void testEmpty() throws Exception {
              StringDistance nsd = new NGramDistance(1);
              float d = nsd.getDistance("", "al");
              assertEquals(d,1.0f,0.001);
            }
          

          pass?

        Show
        Grant Ingersoll added a comment - Hey Tom, Few questions: Do you have recommendations on picking n? On line 78 or so, can't that be moved up? s1/t1 are calculated on line 47 and not assigned to. Seems like it would be an optimization to return out if they are 0. Also, can it just be: if (s1 == 0 || t1 == 0){ return 1;}; In fact, all tests still pass when this is moved up to the top. However, I must not be understanding something, as why should: public void testEmpty() throws Exception { StringDistance nsd = new NGramDistance(1); float d = nsd.getDistance( "", " al"); assertEquals(d,1.0f,0.001); } pass?
        Hide
        Thomas Morton added a comment -

        2 seems a reasonable default. Experiments in paper should comparable results for bi-grams and tri-grams. Made an empty constructor which sets n=2.

        Yes that can be moved up without penalty.

        That's a bug in the empty case. Should return 0 unless both strings are empty. I ported this bug form the Levenstein Distance code. It's now fixed in both and has unit tests in both. New patch attached.

        Technically NGramDistance(1) is the same thing as LevensteinDistance but LevensteinDistance code is more straight forward and may be slightly faster.

        Show
        Thomas Morton added a comment - 2 seems a reasonable default. Experiments in paper should comparable results for bi-grams and tri-grams. Made an empty constructor which sets n=2. Yes that can be moved up without penalty. That's a bug in the empty case. Should return 0 unless both strings are empty. I ported this bug form the Levenstein Distance code. It's now fixed in both and has unit tests in both. New patch attached. Technically NGramDistance(1) is the same thing as LevensteinDistance but LevensteinDistance code is more straight forward and may be slightly faster.
        Hide
        Grant Ingersoll added a comment -

        Hey Tom,

        The empty string test still fails, namely b/c it is expecting 0.0 instead of -1. Seems like we should just return 0 for the case where one is empty. Although, even that is a bit weird, right? For instance, in the Edit distance case, there still is a notion of the number of edits one needs to make to get from a string to an empty string, right? That is, the distance from "a" to "" seems less than the distance of "abcdef" to "", no?

        I guess I'm fine with 0, but what does the literature suggest for this edge case?

        Show
        Grant Ingersoll added a comment - Hey Tom, The empty string test still fails, namely b/c it is expecting 0.0 instead of -1. Seems like we should just return 0 for the case where one is empty. Although, even that is a bit weird, right? For instance, in the Edit distance case, there still is a notion of the number of edits one needs to make to get from a string to an empty string, right? That is, the distance from "a" to "" seems less than the distance of "abcdef" to "", no? I guess I'm fine with 0, but what does the literature suggest for this edge case?
        Hide
        Thomas Morton added a comment -

        Fixes the empty string case.
        Adds some additional unit tests.

        Show
        Thomas Morton added a comment - Fixes the empty string case. Adds some additional unit tests.
        Hide
        Thomas Morton added a comment -

        The implementations returns a normalized edit distance (normalized by string length) and specifically 1 if the strings are the same and 0 if that are maximally different. 0 in that case makes sense as the number of edits is equal to the number of characters in the longest string, so:

        1- (2 edits /2 length) = 0

        Show
        Thomas Morton added a comment - The implementations returns a normalized edit distance (normalized by string length) and specifically 1 if the strings are the same and 0 if that are maximally different. 0 in that case makes sense as the number of edits is equal to the number of characters in the longest string, so: 1- (2 edits /2 length) = 0
        Hide
        Grant Ingersoll added a comment -

        Committed revision 776704.

        Thanks Tom!

        Show
        Grant Ingersoll added a comment - Committed revision 776704. Thanks Tom!

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Thomas Morton
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development