Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: modules/spellchecker
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The current spellchecker makes an n-gram index of your terms, and queries this for spellchecking.
      The terms that come back from the n-gram query are then re-ranked by an algorithm such as Levenshtein.

      Alternatively, we could just do a levenshtein query directly against the index, then we wouldn't need
      a separate index to rebuild.

      1. LUCENE-2507.patch
        38 kB
        Robert Muir
      2. LUCENE-2507.patch
        32 kB
        Robert Muir
      3. LUCENE-2507.patch
        10 kB
        Robert Muir
      4. LUCENE-2507.patch
        9 kB
        Robert Muir
      5. LUCENE-2507.patch
        9 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        prototype patch that adds 'DirectSpellChecker', with some tests showing use with IndexWriter.getReader()

        For now I only implemented Levenshtein, since its for free But it would be good to support re-ranking against the existing StringDistance metrics, etc.

        The larger problem is that the existing APIs are really built around this idea of a separate index, so I'm open to suggestions as to how this can be better integrated.

        Show
        Robert Muir added a comment - prototype patch that adds 'DirectSpellChecker', with some tests showing use with IndexWriter.getReader() For now I only implemented Levenshtein, since its for free But it would be good to support re-ranking against the existing StringDistance metrics, etc. The larger problem is that the existing APIs are really built around this idea of a separate index, so I'm open to suggestions as to how this can be better integrated.
        Hide
        Robert Muir added a comment -

        there was a bug in conversion between fuzzy term enum's scaling.

        i ran some simple perf tests, this is essentially just as fast as the existing code
        with setMaxEdits(1). but with setMaxEdits(2) is much slower.

        i'll try to think of ways to speed it up... one idea would be to add lev automata with transposition
        support instead of using higher distances, etc.

        Show
        Robert Muir added a comment - there was a bug in conversion between fuzzy term enum's scaling. i ran some simple perf tests, this is essentially just as fast as the existing code with setMaxEdits(1). but with setMaxEdits(2) is much slower. i'll try to think of ways to speed it up... one idea would be to add lev automata with transposition support instead of using higher distances, etc.
        Hide
        Robert Muir added a comment -

        we have sped up this seeking a lot recently, and i improved this patch some:

        • avoid calling docfreq on the suggestions, by using the TermsEnum's docfreq
        • Mike had the idea that we should actually try lower edit distances first. The
          general use case here is a small number of suggestions (e.g. 1), so
          we actually try edit distance=1 first... only if this doesn't give enough suggestions
          do we then try higher distances.

        I think this is a good approach here, because we are getting levenshtein directly,
        so we don't have the problem the n-gram based spellchecker has... (for reference below)

           * <p>As the Lucene similarity that is used to fetch the most relevant n-grammed terms
           * is not the same as the edit distance strategy used to calculate the best
           * matching spell-checked word from the hits that Lucene found, one usually has
           * to retrieve a couple of numSug's in order to get the true best match.
           *
           * <p>I.e. if numSug == 1, don't count on that suggestion being the best one.
           * Thus, you should set this value to <b>at least</b> 5 for a good suggestion.
        

        Since we are actually doing levenshtein, you can safely use lower values for numSug,
        such as numSug=1

        Show
        Robert Muir added a comment - we have sped up this seeking a lot recently, and i improved this patch some: avoid calling docfreq on the suggestions, by using the TermsEnum's docfreq Mike had the idea that we should actually try lower edit distances first. The general use case here is a small number of suggestions (e.g. 1), so we actually try edit distance=1 first... only if this doesn't give enough suggestions do we then try higher distances. I think this is a good approach here, because we are getting levenshtein directly, so we don't have the problem the n-gram based spellchecker has... (for reference below) * <p>As the Lucene similarity that is used to fetch the most relevant n-grammed terms * is not the same as the edit distance strategy used to calculate the best * matching spell-checked word from the hits that Lucene found, one usually has * to retrieve a couple of numSug's in order to get the true best match. * * <p>I.e. if numSug == 1, don't count on that suggestion being the best one. * Thus, you should set this value to <b>at least</b> 5 for a good suggestion. Since we are actually doing levenshtein, you can safely use lower values for numSug, such as numSug=1
        Hide
        Robert Muir added a comment -

        I improved the quality and performance of this spellchecker, integrated it with the other spellchecker APIs,
        and did the Solr side. I think minus some more tests and docs (especially on the various options) its good to go.

        Show
        Robert Muir added a comment - I improved the quality and performance of this spellchecker, integrated it with the other spellchecker APIs, and did the Solr side. I think minus some more tests and docs (especially on the various options) its good to go.
        Hide
        Chris Male added a comment -

        Hey Robert,

        Do you have any benchmarks for this spellchecker? I notice you mention a few times that you improved the performance. Do you know how it compares against the separate index approach?

        Equally, is this spellchecker a conceptual drop in replacement? By that I mean, are the suggestions it generates radically different to the separate index spellcheckers or are they along the same lines?

        Show
        Chris Male added a comment - Hey Robert, Do you have any benchmarks for this spellchecker? I notice you mention a few times that you improved the performance. Do you know how it compares against the separate index approach? Equally, is this spellchecker a conceptual drop in replacement? By that I mean, are the suggestions it generates radically different to the separate index spellcheckers or are they along the same lines?
        Hide
        Robert Muir added a comment -

        Do you have any benchmarks for this spellchecker? I notice you mention a few times that you improved the performance. Do you know how it compares against the separate index approach?

        In general I think the performance is fine. I did a lot of testing against the geonames database (> 2 million unique terms).
        But, it completely depends upon the parameters you set. Here are some that can affect performance and quality:

        • avoid doing work if the query term is already spelled correctly:
          • minQueryLength (example: 4), query words of 3 characters or less are not checked.
            In general, with any metric, the candidates here will mostly be nonsense anyway.
          • maxQueryFrequency (example: 1% or 1): if the query word is high frequency (e.g. appears in more
            than 1% of the documents its assumed to be correct, and no suggestions are given.
            You can also set this to something like 1, if say you have a small product database
            and you feel all your products are spelled completely correct in your index, and you
            don't want to ever suggest anything if the query term is in your products database.
        • avoid doing work examining potentially bad suggestions:
          • maxEdits (example: 1), the majority of misspellings are only 1 distance away.
            So if you lower this from the default "2" to 1, its faster and more "lightweight" in the sense you get less a chance of getting a bad suggestion.
          • minPrefix (example: 1), most misspellings don't occur in the first character.
            For the solr example, i set this to zero (the wiki has an example correcting "hell" with "dell"), but in practice I think 1 is a good value.
            Additionally this has a practical use for solr users: you need a rather "raw" (e.g. not stemmed) analyzed field for spellchecking,
            if you set this to 1 you can re-use your reverse-wildcard field for spellchecking too, and it will never suggest reversed terms.
          • thresholdFrequency (example: 1% or 1): this plays the role of Solr's "HighFrequencyDictionary".
            In other words, you could set this to 1 to never suggest words that only appear in a single document... in many cases these are misspellings.
          • maxInspections (example: 5), the existing spellchecker uses a hardcoded 10 here.
            A lower value can work well here, since the algorithm used to draw candidates is actually levenshtein.
            However, I set the default to 5 (instead of 1), because its good to gather a few candidates for docFreq-sorting....
            but if you increase thresholdFrequency you can probably lower this.

        Equally, is this spellchecker a conceptual drop in replacement? By that I mean, are the suggestions it generates radically different to the separate index spellcheckers or are they along the same lines?

        I think they are better, e.g. if you are ranking by an edit-distance like function such as Levenshtein or Jaro-Winkler, it makes more sense to get your candidates via the same or similar algorithm! The existing spellchecker gets candidates with n-grams... I think this causes a mismatch... (Of course the inverse is true, if you use NGramDistance, use the existing spellchecker!)

        Again I did a lot of testing with various corpora, and I'm not a spellchecking expert but i didn't get particularly good results from the existing spellchecker.
        And for some corpora such as geonames, it didnt seem to have the configurability I needed to tune the spellchecker to that domain.

        For example, i queried on "zeeland" and the existing spellchecker returned freeland, leland, ireland, deland, and cleland as suggestions.
        Whats worse, is that it created a 240MB auxiliary index when my original index was only 300MB, and it took it 141 seconds to do this.

        The idea here isn't to solve the world's spellchecking problems, its mainly to get rid of the extra index. I think its trivial to
        set this one up to beat SpellChecker's suggestions, because I don't think SpellChecker's suggestions are very good.

        Show
        Robert Muir added a comment - Do you have any benchmarks for this spellchecker? I notice you mention a few times that you improved the performance. Do you know how it compares against the separate index approach? In general I think the performance is fine. I did a lot of testing against the geonames database (> 2 million unique terms). But, it completely depends upon the parameters you set. Here are some that can affect performance and quality: avoid doing work if the query term is already spelled correctly: minQueryLength (example: 4), query words of 3 characters or less are not checked. In general, with any metric, the candidates here will mostly be nonsense anyway. maxQueryFrequency (example: 1% or 1): if the query word is high frequency (e.g. appears in more than 1% of the documents its assumed to be correct, and no suggestions are given. You can also set this to something like 1, if say you have a small product database and you feel all your products are spelled completely correct in your index, and you don't want to ever suggest anything if the query term is in your products database. avoid doing work examining potentially bad suggestions: maxEdits (example: 1), the majority of misspellings are only 1 distance away. So if you lower this from the default "2" to 1, its faster and more "lightweight" in the sense you get less a chance of getting a bad suggestion. minPrefix (example: 1), most misspellings don't occur in the first character. For the solr example, i set this to zero (the wiki has an example correcting "hell" with "dell"), but in practice I think 1 is a good value. Additionally this has a practical use for solr users: you need a rather "raw" (e.g. not stemmed) analyzed field for spellchecking, if you set this to 1 you can re-use your reverse-wildcard field for spellchecking too, and it will never suggest reversed terms. thresholdFrequency (example: 1% or 1): this plays the role of Solr's "HighFrequencyDictionary". In other words, you could set this to 1 to never suggest words that only appear in a single document... in many cases these are misspellings. maxInspections (example: 5), the existing spellchecker uses a hardcoded 10 here. A lower value can work well here, since the algorithm used to draw candidates is actually levenshtein. However, I set the default to 5 (instead of 1), because its good to gather a few candidates for docFreq-sorting.... but if you increase thresholdFrequency you can probably lower this. Equally, is this spellchecker a conceptual drop in replacement? By that I mean, are the suggestions it generates radically different to the separate index spellcheckers or are they along the same lines? I think they are better, e.g. if you are ranking by an edit-distance like function such as Levenshtein or Jaro-Winkler, it makes more sense to get your candidates via the same or similar algorithm! The existing spellchecker gets candidates with n-grams... I think this causes a mismatch... (Of course the inverse is true, if you use NGramDistance, use the existing spellchecker!) Again I did a lot of testing with various corpora, and I'm not a spellchecking expert but i didn't get particularly good results from the existing spellchecker. And for some corpora such as geonames, it didnt seem to have the configurability I needed to tune the spellchecker to that domain. For example, i queried on "zeeland" and the existing spellchecker returned freeland, leland, ireland, deland, and cleland as suggestions. Whats worse, is that it created a 240MB auxiliary index when my original index was only 300MB, and it took it 141 seconds to do this. The idea here isn't to solve the world's spellchecking problems, its mainly to get rid of the extra index. I think its trivial to set this one up to beat SpellChecker's suggestions, because I don't think SpellChecker's suggestions are very good.
        Hide
        Chris Male added a comment -

        Hi,

        Thanks for that. Covers my questions nicely.

        The idea here isn't to solve the world's spellchecking problems, its mainly to get rid of the extra index.

        Yes definitely. I was just checking that we weren't doing that at a cost of reasonable suggestions. But your argument makes clear sense.

        This really is a great feature.

        Show
        Chris Male added a comment - Hi, Thanks for that. Covers my questions nicely. The idea here isn't to solve the world's spellchecking problems, its mainly to get rid of the extra index. Yes definitely. I was just checking that we weren't doing that at a cost of reasonable suggestions. But your argument makes clear sense. This really is a great feature.
        Hide
        Robert Muir added a comment -

        Yes definitely. I was just checking that we weren't doing that at a cost of reasonable suggestions. But your argument makes clear sense.

        Well, aspell has some test data here: http://aspell.net/test/cur/batch0.tab
        I could index some wikipedia, and run both spellcheckers?

        Additionally I suppose it would be fair to run the correct answers from this set, and see the results across both spellcheckers as far as spell-correcting already correct words (and what they suggest if they do!)

        Show
        Robert Muir added a comment - Yes definitely. I was just checking that we weren't doing that at a cost of reasonable suggestions. But your argument makes clear sense. Well, aspell has some test data here: http://aspell.net/test/cur/batch0.tab I could index some wikipedia, and run both spellcheckers? Additionally I suppose it would be fair to run the correct answers from this set, and see the results across both spellcheckers as far as spell-correcting already correct words (and what they suggest if they do!)
        Hide
        Chris Male added a comment -

        That is a very good idea yes, but I don't think its necessary to do that before this is committed. We can do that afterwards, get an idea of where the spellcheckers are, and improve them through other issues if needs be.

        Show
        Chris Male added a comment - That is a very good idea yes, but I don't think its necessary to do that before this is committed. We can do that afterwards, get an idea of where the spellcheckers are, and improve them through other issues if needs be.
        Hide
        Robert Muir added a comment -

        That is a very good idea yes, but I don't think its necessary to do that before this is committed.

        Here's some very rough numbers from that batch0.tab, against the FIRE english corpus (sorry i'm still downloading wikipedia, its quite large!)
        Note, this is only relative, e.g. i dont even know if these terms all exist in that corpus.
        additionally, some contain punctuation etc, i only lowercased them for consistency.

        for reference, there are 547 incorrect/correct term pairs in this aspell spelling correction test.
        My corpus has ~150,000 docs, with 304,000 unique terms in the body field.
        for both spellcheckers I used all defaults, e.g. spellchecker.suggestSimilar(words[1].toLowerCase(), 1, reader, "body", true);

        impl Number correct[1] (out of 547) Number correct, inverted[2] (out of 547) Avg time in ms[3]
        SpellChecker 214 218 1.47ms
        DirectSpellChecker 242 303 4.53ms

        1. using the misspelling as a query term, does the spellchecker return the correct spelling?
        2. using the correct spelling as a query term, does the spellchecker return nothing at all?
        3. this is the average time to perform an actual correction, both spellcheckers have some way to do no work at all for the common (correctly spelled) case.

        So although the benchmark itself isnt for search engine benchmarking (e.g. contains stopwords/punctuation), this basically shows what I've been seeing, that I think this spellchecker outperforms the existing one, and the perf cost is reasonable.

        Show
        Robert Muir added a comment - That is a very good idea yes, but I don't think its necessary to do that before this is committed. Here's some very rough numbers from that batch0.tab, against the FIRE english corpus (sorry i'm still downloading wikipedia, its quite large!) Note, this is only relative, e.g. i dont even know if these terms all exist in that corpus. additionally, some contain punctuation etc, i only lowercased them for consistency. for reference, there are 547 incorrect/correct term pairs in this aspell spelling correction test. My corpus has ~150,000 docs, with 304,000 unique terms in the body field. for both spellcheckers I used all defaults, e.g. spellchecker.suggestSimilar(words [1] .toLowerCase(), 1, reader, "body", true); impl Number correct [1] (out of 547) Number correct, inverted [2] (out of 547) Avg time in ms [3] SpellChecker 214 218 1.47ms DirectSpellChecker 242 303 4.53ms 1. using the misspelling as a query term, does the spellchecker return the correct spelling? 2. using the correct spelling as a query term, does the spellchecker return nothing at all? 3. this is the average time to perform an actual correction, both spellcheckers have some way to do no work at all for the common (correctly spelled) case. So although the benchmark itself isnt for search engine benchmarking (e.g. contains stopwords/punctuation), this basically shows what I've been seeing, that I think this spellchecker outperforms the existing one, and the perf cost is reasonable.
        Hide
        Robert Muir added a comment -

        By the way, out of curiousity i tested an alternative configuration, DirectSpellChecker with .setMaxEdits(1)

        With this "lighter" configuration:

        impl Number correct (out of 547) Number correct, inverted (out of 547) Avg time in ms
        DirectSpellChecker(n=1) 165 432 1.83ms

        So here, you have the flexibility to have essentially the same performance as the existing spellchecker,
        and the false positive rate is hugely reduced (in this contrived test). You trade off only being able to
        catch 77% of the suggestions relative to the old spellchecker... but this might be good for setups
        that feel the n=2 default is too aggressive.

        And again, like the original configuration, you have no index to rebuild at all.

        Show
        Robert Muir added a comment - By the way, out of curiousity i tested an alternative configuration, DirectSpellChecker with .setMaxEdits(1) With this "lighter" configuration: impl Number correct (out of 547) Number correct, inverted (out of 547) Avg time in ms DirectSpellChecker(n=1) 165 432 1.83ms So here, you have the flexibility to have essentially the same performance as the existing spellchecker, and the false positive rate is hugely reduced (in this contrived test). You trade off only being able to catch 77% of the suggestions relative to the old spellchecker... but this might be good for setups that feel the n=2 default is too aggressive. And again, like the original configuration, you have no index to rebuild at all.
        Hide
        Chris Male added a comment -

        They're both very fast and you get the flexibility of not having an additional index. +1 to committing.

        Show
        Chris Male added a comment - They're both very fast and you get the flexibility of not having an additional index. +1 to committing.
        Hide
        Michael McCandless added a comment -

        This is an awesome step forward!

        It requires no parallel index, and, it gets better accuracy (if your metric is edit distance like) at a negligible perf hit.

        It's great that it leverages the absurd speedups we've made to FuzzyQuery in 4.0.

        Show
        Michael McCandless added a comment - This is an awesome step forward! It requires no parallel index, and, it gets better accuracy (if your metric is edit distance like) at a negligible perf hit. It's great that it leverages the absurd speedups we've made to FuzzyQuery in 4.0.
        Hide
        Robert Muir added a comment -

        I'll work on cleaning up tests and doc, i think we can then commit this with the functionality it has.

        It's great that it leverages the absurd speedups we've made to FuzzyQuery in 4.0.

        Yes, if you read that scary fuzzy paper it seems thats its original use-case all along (we just did FuzzyQuery first, and re-used it here):
        http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652

        Along the same lines, I think we can then later improve both spellcheckers in easy ways. For example,
        it would be good to add a concept of "SpellCheckFilter" that can return true/false if a word is correctly spelled.

        Docfreq-based stuff helps, but if you know the language, something like hunspell could go a long way here
        to both preventing either spellchecker from trying to correct an already-correctly-spelled word or preventing
        it from suggesting misspellings.

        Show
        Robert Muir added a comment - I'll work on cleaning up tests and doc, i think we can then commit this with the functionality it has. It's great that it leverages the absurd speedups we've made to FuzzyQuery in 4.0. Yes, if you read that scary fuzzy paper it seems thats its original use-case all along (we just did FuzzyQuery first, and re-used it here): http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 Along the same lines, I think we can then later improve both spellcheckers in easy ways. For example, it would be good to add a concept of "SpellCheckFilter" that can return true/false if a word is correctly spelled. Docfreq-based stuff helps, but if you know the language, something like hunspell could go a long way here to both preventing either spellchecker from trying to correct an already-correctly-spelled word or preventing it from suggesting misspellings.
        Hide
        Robert Muir added a comment -

        here's the improved docs and tests.

        I'd like to commit this one and we can iterate as discussed, hopefully improve both spellcheckers.

        Show
        Robert Muir added a comment - here's the improved docs and tests. I'd like to commit this one and we can iterate as discussed, hopefully improve both spellcheckers.
        Hide
        Robert Muir added a comment -

        Committed revision 1003642.

        Show
        Robert Muir added a comment - Committed revision 1003642.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development