Solr
  1. Solr
  2. SOLR-2993

Integrate WordBreakSpellChecker with Solr

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA, 5.0
    • Component/s: spellchecker
    • Labels:
      None

      Description

      A SpellCheckComponent enhancement, leveraging the WordBreakSpellChecker from LUCENE-3523:

      • Detect spelling errors resulting from misplaced whitespace without the use of shingle-based dictionaries.
      • Seamlessly integrate word-break suggestions with single-word spelling corrections from the existing FileBased-, IndexBased- or Direct- spell checkers.
      • Provide collation support for word-break errors including cases where the user has a mix of single-word spelling errors and word-break errors in the same query.
      • Provide shard support.
      1. SOLR-2993-fixes.patch
        19 kB
        James Dyer
      2. SOLR-2993.patch
        100 kB
        James Dyer
      3. SOLR-2993.patch
        89 kB
        James Dyer
      4. SOLR-2993.patch
        83 kB
        James Dyer
      5. SOLR-2993.patch
        77 kB
        James Dyer

        Issue Links

          Activity

          Hide
          James Dyer added a comment -

          Patch adds features described in this issue. Users can create a Dictionary configuration in solrconfig.xml like this:

          <lst name="spellchecker">
           <str name="name">wordbreak</str>
           <str name="classname">solr.WordBreakSolrSpellChecker</str>      
           <str name="field">lowerfilt</str>
           <str name="combineWords">true</str>
           <str name="breakWords">true</str>
           <int name="maxChanges">10</int>
          </lst>
          

          Users can also specify multiple "spellcheck.dictionary" parameters. All specified dictionaries are consulted and results are interleaved. (this is handled by the new ConjunctionSolrSpellChecker) Collations are created with combinations from the different spellcheckers, with care taken that mutliple overlapping corrections do not occur in the same collation.

          <requestHandler name="spellCheckWithWordbreak" class="org.apache.solr.handler.component.SearchHandler">
           <lst name="defaults">
            <str name="spellcheck.dictionary">default</str>
            <str name="spellcheck.dictionary">wordbreak</str>
            <str name="spellcheck.count">20</str>
           </lst>
           <arr name="last-components">
            <str>spellcheck</str>
           </arr>
          </requestHandler>
          

          A future enhancement (outside the scope of this issue) would be to extend ConjunctionSolrSpellChecker to allow arbitrary dictionary combinations. For instance, if a user wanted to query two fields and have two separate dictionaries consulted for each field, etc. With this patch, however, ConjunctionSolrSpellChecker is intended to be used to add Word-Break suggestions in with Single-Word suggestions.

          Show
          James Dyer added a comment - Patch adds features described in this issue. Users can create a Dictionary configuration in solrconfig.xml like this: <lst name= "spellchecker" > <str name= "name" > wordbreak </str> <str name= "classname" > solr.WordBreakSolrSpellChecker </str> <str name= "field" > lowerfilt </str> <str name= "combineWords" > true </str> <str name= "breakWords" > true </str> <int name= "maxChanges" > 10 </int> </lst> Users can also specify multiple "spellcheck.dictionary" parameters. All specified dictionaries are consulted and results are interleaved. (this is handled by the new ConjunctionSolrSpellChecker) Collations are created with combinations from the different spellcheckers, with care taken that mutliple overlapping corrections do not occur in the same collation. <requestHandler name= "spellCheckWithWordbreak" class= "org.apache.solr.handler.component.SearchHandler" > <lst name= "defaults" > <str name= "spellcheck.dictionary" > default </str> <str name= "spellcheck.dictionary" > wordbreak </str> <str name= "spellcheck.count" > 20 </str> </lst> <arr name= "last-components" > <str> spellcheck </str> </arr> </requestHandler> A future enhancement (outside the scope of this issue) would be to extend ConjunctionSolrSpellChecker to allow arbitrary dictionary combinations. For instance, if a user wanted to query two fields and have two separate dictionaries consulted for each field, etc. With this patch, however, ConjunctionSolrSpellChecker is intended to be used to add Word-Break suggestions in with Single-Word suggestions.
          Hide
          James Dyer added a comment -

          Also included with the patch are several new unit tests, including one distributed/shard test scenario.

          Show
          James Dyer added a comment - Also included with the patch are several new unit tests, including one distributed/shard test scenario.
          Hide
          Okke Klein added a comment -

          I'm having some trouble combining this patch with your other patch in https://issues.apache.org/jira/browse/SOLR-2585. Could you make a patch with both features if possible?

          Show
          Okke Klein added a comment - I'm having some trouble combining this patch with your other patch in https://issues.apache.org/jira/browse/SOLR-2585 . Could you make a patch with both features if possible?
          Hide
          James Dyer added a comment -

          Okke,

          Thanks for your interest. For now you may need to evaluate the features separately. Possibly you could vote for your favorite one. Should either issue get committed, I will sync the other issue to the updated state of Trunk. Then we can have both at the same time. If there isn't any movement on these 2 for a long time maybe I'd consider merging the patches but that seems like an unnecessary step. It would be nice if one of the first 4.x releases included both of these features...

          Show
          James Dyer added a comment - Okke, Thanks for your interest. For now you may need to evaluate the features separately. Possibly you could vote for your favorite one. Should either issue get committed, I will sync the other issue to the updated state of Trunk. Then we can have both at the same time. If there isn't any movement on these 2 for a long time maybe I'd consider merging the patches but that seems like an unnecessary step. It would be nice if one of the first 4.x releases included both of these features...
          Hide
          Okke Klein added a comment -

          If I am not mistaken the functionality from https://issues.apache.org/jira/browse/SOLR-2585 can also be achieved in DirectSolrSpellChecker with thresholdTokenFrequency parameter. So I patched trunk with this patch and the corresponding Lucene patch and did some experimenting.

          The misplaced whitespaces were fixed and proper suggestions were returned. However if both word parts resulted in suggestions, the collation made no sense.

          Hypothetical example:
          "spe llcheck" would give suggestions "spa" and "spellcheck" and collate this into "spa spellcheck"

          In my use case I never got any results back when one of the parts had a typo. So "spe llchek" would not give any suggestions.

          For my use case it would also be handy if "spell check" would result in the suggestion "spellcheck".

          Or is this already possible?

          Show
          Okke Klein added a comment - If I am not mistaken the functionality from https://issues.apache.org/jira/browse/SOLR-2585 can also be achieved in DirectSolrSpellChecker with thresholdTokenFrequency parameter. So I patched trunk with this patch and the corresponding Lucene patch and did some experimenting. The misplaced whitespaces were fixed and proper suggestions were returned. However if both word parts resulted in suggestions, the collation made no sense. Hypothetical example: "spe llcheck" would give suggestions "spa" and "spellcheck" and collate this into "spa spellcheck" In my use case I never got any results back when one of the parts had a typo. So "spe llchek" would not give any suggestions. For my use case it would also be handy if "spell check" would result in the suggestion "spellcheck". Or is this already possible?
          Hide
          James Dyer added a comment -

          Okke,

          Thanks for looking at this patch. Here are a few comments:

          if both word parts resulted in suggestions, the collation made no sense.

          This is a problem with collations in general: By default, it simply mashes the top corrections together, often resulting in nonsense. The solution is to set "spellcheck.maxCollationTries" to a non-zero value. Doing so will cause the spellchecker to vet the collation possibilities against the index, resulting in collations that are guaranteed to generate hits.

          "spe llcheck" would give suggestions "spa" and "spellcheck" and collate this into "spa spellcheck"

          This is surprising to me and might indicate a bug. This patch is designed to carefully ensure that when building collations, the corrections do not overlap one another. For instance if "q=spe llcheck" and it gives corrections of "spe>spa" and "spe llcheck>spellcheck", it should not collate these to "q=spa spellcheck" because "spe" overlaps with "spe llcheck". So if you can describe in detail what you're indexing and querying (maybe paste the resulting xml), it would be help me figure out what's going on. Better yet, if you can write a failing unit test and post a patch...

          I never got any results back when one of the parts had a typo. So "spe llchek" would not give any suggestions.

          This patch does not have the ability to first correct a word fragment and then combine it with another fragment to make a corrected word. Possibly this would be a good next step after what we've got here already gets worked out.

          it would also be handy if "spell check" would result in the suggestion "spellcheck". Or is this already possible?

          This is the core of what this issue (really LUCENE-3523) is all about, provided that "spellcheck" is in the dictionary&index you're using.

          Show
          James Dyer added a comment - Okke, Thanks for looking at this patch. Here are a few comments: if both word parts resulted in suggestions, the collation made no sense. This is a problem with collations in general: By default, it simply mashes the top corrections together, often resulting in nonsense. The solution is to set "spellcheck.maxCollationTries" to a non-zero value. Doing so will cause the spellchecker to vet the collation possibilities against the index, resulting in collations that are guaranteed to generate hits. "spe llcheck" would give suggestions "spa" and "spellcheck" and collate this into "spa spellcheck" This is surprising to me and might indicate a bug. This patch is designed to carefully ensure that when building collations, the corrections do not overlap one another. For instance if "q=spe llcheck" and it gives corrections of "spe>spa" and "spe llcheck>spellcheck", it should not collate these to "q=spa spellcheck" because "spe" overlaps with "spe llcheck". So if you can describe in detail what you're indexing and querying (maybe paste the resulting xml), it would be help me figure out what's going on. Better yet, if you can write a failing unit test and post a patch... I never got any results back when one of the parts had a typo. So "spe llchek" would not give any suggestions. This patch does not have the ability to first correct a word fragment and then combine it with another fragment to make a corrected word. Possibly this would be a good next step after what we've got here already gets worked out. it would also be handy if "spell check" would result in the suggestion "spellcheck". Or is this already possible? This is the core of what this issue (really LUCENE-3523 ) is all about, provided that "spellcheck" is in the dictionary&index you're using.
          Hide
          Okke Klein added a comment -

          This is a problem with collations in general: By default, it simply mashes the top corrections together, often resulting in nonsense. The solution is to set "spellcheck.maxCollationTries" to a non-zero value. Doing so will cause the spellchecker to vet the collation possibilities against the index, resulting in collations that are guaranteed to generate hits.

          If wordbreak gives back a suggestion of a combined word, a suggestion with a word fragment with more hits is still ranked higher in the collation.

          So "spa llcheck" is preferred over "spellcheck" if spa has more hits then spellcheck.

          it would also be handy if "spell check" would result in the suggestion "spellcheck". Or is this already possible?

          This is the core of what this issue (really LUCENE-3523) is all about, provided that "spellcheck" is in the dictionary&index you're using.

          Never got this working as no suggestions were given when both word fragments were spelled correctly and the combined word was in the index. (when making typo in combined word the word was returned as suggestion)

          Show
          Okke Klein added a comment - This is a problem with collations in general: By default, it simply mashes the top corrections together, often resulting in nonsense. The solution is to set "spellcheck.maxCollationTries" to a non-zero value. Doing so will cause the spellchecker to vet the collation possibilities against the index, resulting in collations that are guaranteed to generate hits. If wordbreak gives back a suggestion of a combined word, a suggestion with a word fragment with more hits is still ranked higher in the collation. So "spa llcheck" is preferred over "spellcheck" if spa has more hits then spellcheck. it would also be handy if "spell check" would result in the suggestion "spellcheck". Or is this already possible? This is the core of what this issue (really LUCENE-3523 ) is all about, provided that "spellcheck" is in the dictionary&index you're using. Never got this working as no suggestions were given when both word fragments were spelled correctly and the combined word was in the index. (when making typo in combined word the word was returned as suggestion)
          Hide
          James Dyer added a comment -

          So "spa llcheck" is preferred over "spellcheck" if spa has more hits then spellcheck.

          I honestly didn't try this much with queries having all optional terms. I see what you mean, though that you might prefer it just leave the misspelled word in there if its an optional term anyhow. But wouldn't the query, in addition to giving spelling suggestions, also return some results because it would ignore the optional & misspelled query terms? If that's the case, your app can look at the results you got back and compare that to the collation options and determine what to do from there.

          no suggestions were given when both word fragments were spelled correctly

          As discussed in SOLR-2585, you can't get suggestions for terms that are in the index, unless you specify "spellcheck.onlyMorePopular=true". Of course "onlyMorePopular" can have its own unintended consequences. Hopefully someday in the not too distant future we'll be in a state where we can have both this issue and SOLR-2585 working together.

          Show
          James Dyer added a comment - So "spa llcheck" is preferred over "spellcheck" if spa has more hits then spellcheck. I honestly didn't try this much with queries having all optional terms. I see what you mean, though that you might prefer it just leave the misspelled word in there if its an optional term anyhow. But wouldn't the query, in addition to giving spelling suggestions, also return some results because it would ignore the optional & misspelled query terms? If that's the case, your app can look at the results you got back and compare that to the collation options and determine what to do from there. no suggestions were given when both word fragments were spelled correctly As discussed in SOLR-2585 , you can't get suggestions for terms that are in the index, unless you specify "spellcheck.onlyMorePopular=true". Of course "onlyMorePopular" can have its own unintended consequences. Hopefully someday in the not too distant future we'll be in a state where we can have both this issue and SOLR-2585 working together.
          Hide
          Okke Klein added a comment -

          I honestly didn't try this much with queries having all optional terms.

          Setting mm to 100% gave me the result I expected.

          Im confused:

          "This is the core of what this issue (really LUCENE-3523) is all about, provided that "spellcheck" is in the dictionary&index you're using".

          and then

          As discussed in SOLR-2585, you can't get suggestions for terms that are in the index, unless you specify "spellcheck.onlyMorePopular=true". Of course "onlyMorePopular" can have its own unintended consequences. Hopefully someday in the not too distant future we'll be in a state where we can have both this issue and SOLR-2585 working together.

          So should it be possible to get the suggestion "spellcheck" from "spell check", or not?

          Note: I do get suggestions for terms that are in the index.

          Show
          Okke Klein added a comment - I honestly didn't try this much with queries having all optional terms. Setting mm to 100% gave me the result I expected. Im confused: "This is the core of what this issue (really LUCENE-3523 ) is all about, provided that "spellcheck" is in the dictionary&index you're using". and then As discussed in SOLR-2585 , you can't get suggestions for terms that are in the index, unless you specify "spellcheck.onlyMorePopular=true". Of course "onlyMorePopular" can have its own unintended consequences. Hopefully someday in the not too distant future we'll be in a state where we can have both this issue and SOLR-2585 working together. So should it be possible to get the suggestion "spellcheck" from "spell check", or not? Note: I do get suggestions for terms that are in the index.
          Hide
          James Dyer added a comment -

          So should it be possible to get the suggestion "spellcheck" from "spell check", or not? Note: I do get suggestions for terms that are in the index.

          When combining words, it will require that at least one of the original terms be not in the index.

          So to use your example, WordBreakSpellChecker will combine "spell check" to "spellcheck" provided that:
          1. "spellcheck" is in the index.
          2. either:

          • "spell" is NOT in the index.
            OR
          • "check" is NOT in the index"
            OR
          • both "spell" and "check" are NOT in the index.

          But if both "spell" and "check" are in the index, then you won't get "spellcheck" as a suggestion. You can override this behavior if:
          1. You specify "onlyMorePopular". This works if "spellcheck" has a document frequency that is greater or equal than the highest document frequency between "spell" and "check".
          2. You apply SOLR-2585 (theoretically...not possible yet) and set "spellcheck.alternativeTermCount" greater than zero. This would tell it to generate alternative term suggestions for indexed terms.

          If this is not consistent with what you're experiencing then there is a possible bug in the WordBreakSpellChecker. In that case, please provide as many details as possible (or write a failing unit test) and I can look into it further.

          Show
          James Dyer added a comment - So should it be possible to get the suggestion "spellcheck" from "spell check", or not? Note: I do get suggestions for terms that are in the index. When combining words, it will require that at least one of the original terms be not in the index. So to use your example, WordBreakSpellChecker will combine "spell check" to "spellcheck" provided that: 1. "spellcheck" is in the index. 2. either: "spell" is NOT in the index. OR "check" is NOT in the index" OR both "spell" and "check" are NOT in the index. But if both "spell" and "check" are in the index, then you won't get "spellcheck" as a suggestion. You can override this behavior if: 1. You specify "onlyMorePopular". This works if "spellcheck" has a document frequency that is greater or equal than the highest document frequency between "spell" and "check". 2. You apply SOLR-2585 (theoretically...not possible yet) and set "spellcheck.alternativeTermCount" greater than zero. This would tell it to generate alternative term suggestions for indexed terms. If this is not consistent with what you're experiencing then there is a possible bug in the WordBreakSpellChecker. In that case, please provide as many details as possible (or write a failing unit test) and I can look into it further.
          Hide
          Okke Klein added a comment -

          Thanks for the explanation. I experimented with onlyMorePopular and it worked a few times. Unfortunately it also showed unwanted behavior as expected. So https://issues.apache.org/jira/browse/SOLR-2585 would be a next step to see if it provides the behavior I'm looking for.

          For the English language this feature might not be very important, but for languages like Dutch and German that have a lot of compounded words, a spellchecker that also combines word parts even if one of them has a typo (like Google does) would be extremely useful.

          Unfortunately I'm not a programmer, but I'll gladly test anything you throw at me

          Show
          Okke Klein added a comment - Thanks for the explanation. I experimented with onlyMorePopular and it worked a few times. Unfortunately it also showed unwanted behavior as expected. So https://issues.apache.org/jira/browse/SOLR-2585 would be a next step to see if it provides the behavior I'm looking for. For the English language this feature might not be very important, but for languages like Dutch and German that have a lot of compounded words, a spellchecker that also combines word parts even if one of them has a typo (like Google does) would be extremely useful. Unfortunately I'm not a programmer, but I'll gladly test anything you throw at me
          Hide
          James Dyer added a comment -

          Updated patch. Still some TODO's but for the most part this works well.

          Show
          James Dyer added a comment - Updated patch. Still some TODO's but for the most part this works well.
          Hide
          James Dyer added a comment -

          New patch. Clean things up, fix bugs, etc. This is getting close...

          Show
          James Dyer added a comment - New patch. Clean things up, fix bugs, etc. This is getting close...
          Hide
          James Dyer added a comment -

          Here is a new patch that can better handle collations involving mixed required/prohibited/optional terms and also boolean operators (AND/OR/NOT).

          When combining words, we do not want to combine an optional term with a prohibited one, etc. We also do not want to combine words that belong to different boolean clauses or those that were "NOT"ed to one another.

          Likewise, when splitting a term into multiples, we want to ensure all the resulting terms are required if the original one was required, etc. Also, if the query contains boolean operators (AND/OR/NOT), this version ANDs the split terms together.

          In the case of Boolean operators, SpellingQueryConverter can only make a guess as to the best action. It doesn't know the actual query parser used, the default "q.op" or "mm" setting, etc. All this does is make a reasonable guess as to the best way to re-write the query if corrections involved combining and/or splitting words.

          See WordBreakSpellCheckerTest#testCollate and SpellingQueryConverterTest#testRequiredOrProhibitedFlags for examples of how this works.

          Unless there are other issues, I plan to commit this in a few days.

          Show
          James Dyer added a comment - Here is a new patch that can better handle collations involving mixed required/prohibited/optional terms and also boolean operators (AND/OR/NOT). When combining words, we do not want to combine an optional term with a prohibited one, etc. We also do not want to combine words that belong to different boolean clauses or those that were "NOT"ed to one another. Likewise, when splitting a term into multiples, we want to ensure all the resulting terms are required if the original one was required, etc. Also, if the query contains boolean operators (AND/OR/NOT), this version ANDs the split terms together. In the case of Boolean operators, SpellingQueryConverter can only make a guess as to the best action. It doesn't know the actual query parser used, the default "q.op" or "mm" setting, etc. All this does is make a reasonable guess as to the best way to re-write the query if corrections involved combining and/or splitting words. See WordBreakSpellCheckerTest#testCollate and SpellingQueryConverterTest#testRequiredOrProhibitedFlags for examples of how this works. Unless there are other issues, I plan to commit this in a few days.
          Hide
          James Dyer added a comment -

          Committed...Trunk r1346058, branch_4x r1346069

          This commit includes updates to the Solr Example spellcheck config to use some of the newer SpellCheckComponent features, including this one.

          Show
          James Dyer added a comment - Committed...Trunk r1346058, branch_4x r1346069 This commit includes updates to the Solr Example spellcheck config to use some of the newer SpellCheckComponent features, including this one.
          Hide
          Yonik Seeley added a comment -

          Some of the spellcheck related tests just started failing:
          https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/14566/
          Prob related to this issue?

          Show
          Yonik Seeley added a comment - Some of the spellcheck related tests just started failing: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/14566/ Prob related to this issue?
          Hide
          James Dyer added a comment -

          This was my mistake. I ran tests, then changed the Solr Example config, forgetting that some tests depend on the Example config. I committed a quick test fix that hopefully will stop the failures for now. But one of the failures might be an actual problem. I am looking into it now.

          Show
          James Dyer added a comment - This was my mistake. I ran tests, then changed the Solr Example config, forgetting that some tests depend on the Example config. I committed a quick test fix that hopefully will stop the failures for now. But one of the failures might be an actual problem. I am looking into it now.
          Hide
          James Dyer added a comment -

          Re-open to figure out if failure with "testSpellCheckResponse" with WordBreakSolrSpellChecker added in is a valid failure. My original fix for this caused DistributedSpellCheckComponentTest to fail, so I'll need to investigate more thoroughly tomorrow. For now the offending tests are disabled. (Sorry for the stormy weather on Jenkins!)

          Show
          James Dyer added a comment - Re-open to figure out if failure with "testSpellCheckResponse" with WordBreakSolrSpellChecker added in is a valid failure. My original fix for this caused DistributedSpellCheckComponentTest to fail, so I'll need to investigate more thoroughly tomorrow. For now the offending tests are disabled. (Sorry for the stormy weather on Jenkins!)
          Hide
          James Dyer added a comment -

          Here is a patch that re-activates the previously-failing tests and fixes all the problems. All tests pass and I checked the solr example also. Here's a summary of the problems:

          • TestSpellCheckResponse had a test bug in that data wasn't being cleaned from the index between tests. Bug did not mainfest until I made solr example changes.
          • Some asserts in TestSpellCheckResponse needed modifying to conform to changes in the solr example (test relies on example config).
          • ConjunctionSolrSpellChecker was not preverving the original token doc freq's from the child spellcheckers. Bug wasn't being properly tested for before, but showed up once TestSpellCheckResponse was fixed.
          • WordBreakSolrSpellChecker was not generating original token doc freq's. Bug wasn't being properly tested for before, but showed up once TestSpellCheckResponse was fixed.

          I will commit shortly.

          Show
          James Dyer added a comment - Here is a patch that re-activates the previously-failing tests and fixes all the problems. All tests pass and I checked the solr example also. Here's a summary of the problems: TestSpellCheckResponse had a test bug in that data wasn't being cleaned from the index between tests. Bug did not mainfest until I made solr example changes. Some asserts in TestSpellCheckResponse needed modifying to conform to changes in the solr example (test relies on example config). ConjunctionSolrSpellChecker was not preverving the original token doc freq's from the child spellcheckers. Bug wasn't being properly tested for before, but showed up once TestSpellCheckResponse was fixed. WordBreakSolrSpellChecker was not generating original token doc freq's. Bug wasn't being properly tested for before, but showed up once TestSpellCheckResponse was fixed. I will commit shortly.
          Hide
          James Dyer added a comment -

          Committed fixes...Trunk: r1346489, branch_4x: r1346499

          Show
          James Dyer added a comment - Committed fixes...Trunk: r1346489, branch_4x: r1346499

            People

            • Assignee:
              James Dyer
              Reporter:
              James Dyer
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development