Solr
  1. Solr
  2. SOLR-2585

Context-Sensitive Spelling Suggestions & Collations

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: spellchecker
    • Labels:
      None

      Description

      Solr currently cannot offer what I'm calling here a "context-sensitive" spelling suggestion. That is, if a user enters one or more words that have docFrequency > 0, but nevertheless are misspelled, then no suggestions are offered. Currently, Solr will always consider a word "correctly spelled" if it is in the index and/or dictionary, regardless of context. This issue & patch add support for context-sensitive spelling suggestions.

      See SpellCheckCollatorTest.testContextSensitiveCollate() for a the typical use case for this functionality. This tests both using IndexBasedSepllChecker and DirectSolrSpellChecker.

      Two new Spelling Parameters were added:

      • spellcheck.alternativeTermCount - The count of suggestions to return for each query term existing in the index and/or dictionary. Presumably, users will want fewer suggestions for words with docFrequency>0. Also setting this value turns "on" context-sensitive spell suggestions.
      • spellcheck.maxResultsForSuggest - The maximum number of hits the request can return in order to both generate spelling suggestions and set the "correctlySpelled" element to "false". For example, if this is set to 5 and the user's query returns 5 or fewer results, the spellchecker will report "correctlySpelled=false" and also offer suggestions (and collations if requested). Setting this greater than zero is useful for creating "did-you-mean" suggestions for queries that return a low number of hits.

      I have also included a test using shards. See additions to DistributedSpellCheckComponentTest.

      In Lucene, SpellChecker.java can already support this functionality (by passing a null IndexReader and field-name). The DirectSpellChecker, however, needs a minor enhancement. This gives the option to allow DirectSpellChecker to return suggestions for all query terms regardless of frequency.

      1. SOLR-2585.patch
        47 kB
        James Dyer
      2. SOLR-2585.patch
        48 kB
        James Dyer
      3. SOLR-2585.patch
        41 kB
        James Dyer
      4. SOLR-2585.patch
        42 kB
        James Dyer
      5. SOLR-2585.patch
        41 kB
        James Dyer
      6. SOLR-2585.patch
        41 kB
        James Dyer
      7. SOLR-2585.patch
        41 kB
        David Radunz
      8. SOLR-2585.patch
        39 kB
        James Dyer
      9. SOLR-2585.patch
        41 kB
        James Dyer
      10. SOLR-2585.patch
        41 kB
        James Dyer

        Issue Links

          Activity

          Hide
          James Dyer added a comment -

          Committed to Trunk r1341894. I will also add this to the wiki.

          Show
          James Dyer added a comment - Committed to Trunk r1341894. I will also add this to the wiki.
          Hide
          James Dyer added a comment -

          fixes a distributed bug. This might be ready to commit.

          Show
          James Dyer added a comment - fixes a distributed bug. This might be ready to commit.
          Hide
          James Dyer added a comment -

          Updated patch also fixes SOLR-3457 bug.

          Show
          James Dyer added a comment - Updated patch also fixes SOLR-3457 bug.
          Hide
          James Dyer added a comment -

          Here's an updated version of this, synced to trunk. I would like to commit this new feature soon, unless there is a feeling out there that it needs improvement first.

          Show
          James Dyer added a comment - Here's an updated version of this, synced to trunk. I would like to commit this new feature soon, unless there is a feeling out there that it needs improvement first.
          Hide
          David Radunz added a comment -

          Hey, This patch only solves integration into the current trunk. I had to modify 'solr/core/src/test/org/apache/solr/spelling/SpellCheckCollatorTest.java' in order to apply the previous latest patch. This was due to a new set of test cases involving 'hyphenated word'.

          Show
          David Radunz added a comment - Hey, This patch only solves integration into the current trunk. I had to modify 'solr/core/src/test/org/apache/solr/spelling/SpellCheckCollatorTest.java' in order to apply the previous latest patch. This was due to a new set of test cases involving 'hyphenated word'.
          Hide
          James Dyer added a comment -

          This version is sync'ed with trunk (particularly the refactorings from SOLR-2848). Also a few minor changes as previously suggested.

          Show
          James Dyer added a comment - This version is sync'ed with trunk (particularly the refactorings from SOLR-2848 ). Also a few minor changes as previously suggested.
          Hide
          James Dyer added a comment -

          Robert,

          I was thinking of maybe eventually submitting some refactorings as a follow-up to this issue. But if you want, we could do some things first then come back to this. Here were my initial thoughts, none of which are very well-though-out at this point...

          1. Maybe move "FileBasedSpellChecker" to Lucene for consistency (each spell checker in Solr refers to a Spell checker in Lucene). Also, this makes it available to Lucene users.

          2. Perhaps SpellingOptions could somehow be deleted.

          3. If the Lucene Spell Checkers all inherited a common interface and/or Abstract Class, all of the *SolrSpellChecker classes could probably be reduced to 1 class (or 1 parent class with just a few overrides here and there...) (I know you feel we're not ready for this, but we could annotate the Lucene parent (class and/or interface) like this for now "@lucene.internal - external users should use the appropriate subclass directly / @lucene.experimental - this [class|interface] may change or be removed in a future version").

          4. Clarify the code in SpellCheckComponent. Wasn't thinking about this now, but I do see where you're coming from, especially with the distributed code in "finishStage". I think there is some code duplication between "finishStage" (distributed) and "process" (non-dist / 1st stage dist) that can maybe be eliminated. Probably some good code comments would help de-mystify this too. Maybe rename a method or two for additional clarity.

          5. Now that you point out that "instanceof" check in "finishStage", we probably should write a test case with DirectSpellChecker in a distributed environment. Possibly a revamped (set of) *SolrSpellChecker class(es) could eliminate the need for such checks?

          6. I think SpellingParams should be for parameters the user can put in their query. I'm not sure you can do this with "accuracy". This one should probably be somewhere else as this is a SearchComponent config param, not a request param. Maybe there are others like this.

          Show
          James Dyer added a comment - Robert, I was thinking of maybe eventually submitting some refactorings as a follow-up to this issue. But if you want, we could do some things first then come back to this. Here were my initial thoughts, none of which are very well-though-out at this point... 1. Maybe move "FileBasedSpellChecker" to Lucene for consistency (each spell checker in Solr refers to a Spell checker in Lucene). Also, this makes it available to Lucene users. 2. Perhaps SpellingOptions could somehow be deleted. 3. If the Lucene Spell Checkers all inherited a common interface and/or Abstract Class, all of the *SolrSpellChecker classes could probably be reduced to 1 class (or 1 parent class with just a few overrides here and there...) (I know you feel we're not ready for this, but we could annotate the Lucene parent (class and/or interface) like this for now "@lucene.internal - external users should use the appropriate subclass directly / @lucene.experimental - this [class|interface] may change or be removed in a future version"). 4. Clarify the code in SpellCheckComponent. Wasn't thinking about this now, but I do see where you're coming from, especially with the distributed code in "finishStage". I think there is some code duplication between "finishStage" (distributed) and "process" (non-dist / 1st stage dist) that can maybe be eliminated. Probably some good code comments would help de-mystify this too. Maybe rename a method or two for additional clarity. 5. Now that you point out that "instanceof" check in "finishStage", we probably should write a test case with DirectSpellChecker in a distributed environment. Possibly a revamped (set of) *SolrSpellChecker class(es) could eliminate the need for such checks? 6. I think SpellingParams should be for parameters the user can put in their query. I'm not sure you can do this with "accuracy". This one should probably be somewhere else as this is a SearchComponent config param, not a request param. Maybe there are others like this.
          Hide
          Robert Muir added a comment -

          Here is what i'm thinking with this discussion, I think we want to make it as easy as possible to add more spellchecker impls to Solr:

          I think the SpellCheckComponent has a lot of bad coupling and configuration seems to be all over the place (from finishStage):

              int numSug = Math.max(count, AbstractLuceneSpellChecker.DEFAULT_SUGGESTION_COUNT);
              SolrSpellChecker checker = getSpellChecker(rb.req.getParams());
              if (checker instanceof AbstractLuceneSpellChecker) {
                AbstractLuceneSpellChecker spellChecker = (AbstractLuceneSpellChecker) checker;
                min = spellChecker.getAccuracy();
                sd = spellChecker.getStringDistance();
              }
          

          Does spellcheckcomponent work correctly with DirectSpellChecker? its hard to tell when I look at this... at the same time even splitting up this enormous method into several private methods (e.g. addCollations or something) wouldn't remove any flexibility, and would make it a lot easier to see what is going on.

          Another idea: maybe the new SuggestMode should instead be explicitly passed to the implementations instead of 'inferred' from a bunch of other parameters?

          I think ultimately we want to look at the spellchecker impls as simple factories, e.g. in case we want to add Hunspell or something like that.

          At the same time I said before, I don't think we should make a single base class for spellcheckers and solve the problem that way, but maybe a good step is just 'rote refactoring' of the code to try to clean it up a bit.

          Show
          Robert Muir added a comment - Here is what i'm thinking with this discussion, I think we want to make it as easy as possible to add more spellchecker impls to Solr: I think the SpellCheckComponent has a lot of bad coupling and configuration seems to be all over the place (from finishStage): int numSug = Math.max(count, AbstractLuceneSpellChecker.DEFAULT_SUGGESTION_COUNT); SolrSpellChecker checker = getSpellChecker(rb.req.getParams()); if (checker instanceof AbstractLuceneSpellChecker) { AbstractLuceneSpellChecker spellChecker = (AbstractLuceneSpellChecker) checker; min = spellChecker.getAccuracy(); sd = spellChecker.getStringDistance(); } Does spellcheckcomponent work correctly with DirectSpellChecker? its hard to tell when I look at this... at the same time even splitting up this enormous method into several private methods (e.g. addCollations or something) wouldn't remove any flexibility, and would make it a lot easier to see what is going on. Another idea: maybe the new SuggestMode should instead be explicitly passed to the implementations instead of 'inferred' from a bunch of other parameters? I think ultimately we want to look at the spellchecker impls as simple factories, e.g. in case we want to add Hunspell or something like that. At the same time I said before, I don't think we should make a single base class for spellcheckers and solve the problem that way, but maybe a good step is just 'rote refactoring' of the code to try to clean it up a bit.
          Hide
          Robert Muir added a comment -

          Just glancing over the patch, this adds a lot of logic to the specific implementations (e.g. DirectSpellChecker, AbstractLuceneSpellChecker), that doesn't really seem to be implementation-specific.

          Separately, (not your fault, looks like a pre-existing condition but its visible trying to review the patch), it seems like SpellCheckComponent is in need of serious refactoring, the long pages of code make it almost impossible to review changes like this.

          Do you have any ideas on any refactorings we can do (maybe we could spin off issues?) to try to resolve some of this? It seems like we want to try to improve the spellchecking APIs to make it easier to add features like this, though I am glad you are somehow managing with what exists

          Show
          Robert Muir added a comment - Just glancing over the patch, this adds a lot of logic to the specific implementations (e.g. DirectSpellChecker, AbstractLuceneSpellChecker), that doesn't really seem to be implementation-specific. Separately, (not your fault, looks like a pre-existing condition but its visible trying to review the patch), it seems like SpellCheckComponent is in need of serious refactoring, the long pages of code make it almost impossible to review changes like this. Do you have any ideas on any refactorings we can do (maybe we could spin off issues?) to try to resolve some of this? It seems like we want to try to improve the spellchecking APIs to make it easier to add features like this, though I am glad you are somehow managing with what exists
          Hide
          James Dyer added a comment -

          Sync'ed to Trunk. Now that LUCENE-3436 is committed, this is the only patch needed to use/evaluate/test all the functionality described here.

          Show
          James Dyer added a comment - Sync'ed to Trunk. Now that LUCENE-3436 is committed, this is the only patch needed to use/evaluate/test all the functionality described here.
          Hide
          James Dyer added a comment -

          fixed code formatting

          Show
          James Dyer added a comment - fixed code formatting
          Hide
          James Dyer added a comment -

          As previously discussed, the Lucene portion of this issue has been spun off to LUCENE-3436.

          This is a new patch with just the Solr piece. Also, the new "Suggest Mode" enum is used both for the Original Lucene Spell Checker and DirectSpellChecker.

          Show
          James Dyer added a comment - As previously discussed, the Lucene portion of this issue has been spun off to LUCENE-3436 . This is a new patch with just the Solr piece. Also, the new "Suggest Mode" enum is used both for the Original Lucene Spell Checker and DirectSpellChecker.
          Hide
          Robert Muir added a comment -

          Do you agree with this division? I could split #2 off into a separate "LUCENE-" issue, and this issue can be about #1. (#3 solves itself when then other 2 are worked out)

          +1, Lets open an issue for the suggest modes.

          I was actually thinking that creating a common interface (or abstract class) the spellcheckers all could implement (or extend) would be a nice follow-up to this

          I'm not sure about this... when writing directspellchecker i felt pretty limited by the existing Lucene spellchecker APIs. In my opinion its as if they were already prematurely refactored, if you want to use the APIs you are then boxed in by their constraints... but our spellchecking really needs a lot of help and maturity first.

          Show
          Robert Muir added a comment - Do you agree with this division? I could split #2 off into a separate "LUCENE-" issue, and this issue can be about #1. (#3 solves itself when then other 2 are worked out) +1, Lets open an issue for the suggest modes. I was actually thinking that creating a common interface (or abstract class) the spellcheckers all could implement (or extend) would be a nice follow-up to this I'm not sure about this... when writing directspellchecker i felt pretty limited by the existing Lucene spellchecker APIs. In my opinion its as if they were already prematurely refactored, if you want to use the APIs you are then boxed in by their constraints... but our spellchecking really needs a lot of help and maturity first.
          Hide
          James Dyer added a comment -

          Robert,

          Thank you for your comments. You're right that "onlyMorePopular" will return suggestions for terms in the index. This issue is all about coming up with an alternative to the "onlyMorePopular" feature, which I feel is flawed (for what I've tried to use it for). There are three related problems I'm trying to solve here:

          1. If you specify "onlyMorePopular=true" in Solr, any Collations it tries to build will exclude all of the user's original terms generating suggestions. In other words, it assumes that any words with more-popular alternatives must be misspelled. This makes the "onlyMorePopular" option less-than-useful when used with Solr. For example, in my index of book titles, we have a few titles with the words "mist" and "life" and a whole bunch with "most" and "life". But if I fat-finger my query and search for (mist AND lifr), "onlyMorePopular=false" returns no collations while "onlyMorePopular=true" tries correcting all my words and suggests (most AND life). It would be better if we could get either or both of these collations returned. What is even more maddening is that sometimes the collation possibilities with "onlyMorePopular" all return 0 hits because it is throwing out the valid terms in the user's query in favor of "more popular" terms.

          2. Sometimes a less-frequent correction is desirable. If a Lucene user is trying to find the book titled "..in the midst of divorce" but queries on "mist divorce", with my data we won't get "midst" as a suggestion at all even when using "onlyMorePopular". The problem, of course, is that "mist" occurs with a higher frequency than "midst".

          3. When doing dismax queries, it would be nice to have a "master" dictionary that contains a conglomeration of all the terms in all the fields that dismax is set to search across. But when the spellchecker limits itself to terms that either are not in the dictonary or to terms that are more-popular, it sometimes misses the terms it needs to get the appropriate correction.

          I am all in favor in breaking this up into 2 issues as #1 is a solr-only problem and #2 involves Lucene also. (#3 would be solved if we did #1 & #2.) I also fully agree that if creating 3 "suggest modes" then it should be for all the spellcheckers, not just for DirectSpellChecker. I was actually thinking that creating a common interface (or abstract class) the spellcheckers all could implement (or extend) would be a nice follow-up to this, should this ever be committed. (Its difficult for me to maintain multiple uncommitted patches with dependencies on each other so I try to do these sorts of things one issue at a time...)

          Do you agree with this division? I could split #2 off into a separate "LUCENE-" issue, and this issue can be about #1. (#3 solves itself when then other 2 are worked out)

          Show
          James Dyer added a comment - Robert, Thank you for your comments. You're right that "onlyMorePopular" will return suggestions for terms in the index. This issue is all about coming up with an alternative to the "onlyMorePopular" feature, which I feel is flawed (for what I've tried to use it for). There are three related problems I'm trying to solve here: 1. If you specify "onlyMorePopular=true" in Solr, any Collations it tries to build will exclude all of the user's original terms generating suggestions. In other words, it assumes that any words with more-popular alternatives must be misspelled. This makes the "onlyMorePopular" option less-than-useful when used with Solr. For example, in my index of book titles, we have a few titles with the words "mist" and "life" and a whole bunch with "most" and "life". But if I fat-finger my query and search for (mist AND lifr), "onlyMorePopular=false" returns no collations while "onlyMorePopular=true" tries correcting all my words and suggests (most AND life). It would be better if we could get either or both of these collations returned. What is even more maddening is that sometimes the collation possibilities with "onlyMorePopular" all return 0 hits because it is throwing out the valid terms in the user's query in favor of "more popular" terms. 2. Sometimes a less-frequent correction is desirable. If a Lucene user is trying to find the book titled "..in the midst of divorce" but queries on "mist divorce", with my data we won't get "midst" as a suggestion at all even when using "onlyMorePopular". The problem, of course, is that "mist" occurs with a higher frequency than "midst". 3. When doing dismax queries, it would be nice to have a "master" dictionary that contains a conglomeration of all the terms in all the fields that dismax is set to search across. But when the spellchecker limits itself to terms that either are not in the dictonary or to terms that are more-popular, it sometimes misses the terms it needs to get the appropriate correction. I am all in favor in breaking this up into 2 issues as #1 is a solr-only problem and #2 involves Lucene also. (#3 would be solved if we did #1 & #2.) I also fully agree that if creating 3 "suggest modes" then it should be for all the spellcheckers, not just for DirectSpellChecker. I was actually thinking that creating a common interface (or abstract class) the spellcheckers all could implement (or extend) would be a nice follow-up to this, should this ever be committed. (Its difficult for me to maintain multiple uncommitted patches with dependencies on each other so I try to do these sorts of things one issue at a time...) Do you agree with this division? I could split #2 off into a separate "LUCENE-" issue, and this issue can be about #1. (#3 solves itself when then other 2 are worked out)
          Hide
          Robert Muir added a comment -

          That is, if a user enters one or more words that have docFrequency > 0, but nevertheless are misspelled, then no suggestions are offered.

          Well this isn't exactly true: it depends on how you set 'onlyMorePopular'.

          This change allows us to offer any term within the specified maximum edit distance as a Suggestion, regardless of the original term's frequency in the inde

          Right, I think I see what you are asking for here: you want suggestions for the term always regardless of whether they are more popular than the original term or not? I'm just having trouble understanding this, because in general suggesting terms with a lower docfreq than the original term doesn't sound like it would work well...

          But, if we want to do it, can we split this issue up? there are a lot of changes mixed in the patch:

          • the directspellchecker behaves the same way as the ordinary spellchecker here, with this logic:
                  if (!morePopular && freq > 0) {
                    return new String[] { word };
                  }
            
          • if we want to add a third "mode", I think we should add this to the ordinary spellchecker too, instead of bypassing this logic by passing a null reader to it, I think this isn't very intuitive!
          Show
          Robert Muir added a comment - That is, if a user enters one or more words that have docFrequency > 0, but nevertheless are misspelled, then no suggestions are offered. Well this isn't exactly true: it depends on how you set 'onlyMorePopular'. This change allows us to offer any term within the specified maximum edit distance as a Suggestion, regardless of the original term's frequency in the inde Right, I think I see what you are asking for here: you want suggestions for the term always regardless of whether they are more popular than the original term or not? I'm just having trouble understanding this, because in general suggesting terms with a lower docfreq than the original term doesn't sound like it would work well... But, if we want to do it, can we split this issue up? there are a lot of changes mixed in the patch: the directspellchecker behaves the same way as the ordinary spellchecker here, with this logic: if (!morePopular && freq > 0) { return new String[] { word }; } if we want to add a third "mode", I think we should add this to the ordinary spellchecker too, instead of bypassing this logic by passing a null reader to it, I think this isn't very intuitive!
          Hide
          James Dyer added a comment -

          As a clarification, the necessary change I mention to DirectSpellChecker is included in this patch (this question came up recently on the user-list).

          This change allows us to offer any term within the specified maximum edit distance as a Suggestion, regardless of the original term's frequency in the index.

          Show
          James Dyer added a comment - As a clarification, the necessary change I mention to DirectSpellChecker is included in this patch (this question came up recently on the user-list). This change allows us to offer any term within the specified maximum edit distance as a Suggestion, regardless of the original term's frequency in the index.
          Hide
          James Dyer added a comment -

          sync with current Trunk.

          Show
          James Dyer added a comment - sync with current Trunk.

            People

            • Assignee:
              James Dyer
              Reporter:
              James Dyer
            • Votes:
              2 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development