Solr
  1. Solr
  2. SOLR-606

spellcheck.colate doesn't handle multiple tokens properly

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.3
    • Component/s: spellchecker
    • Labels:
      None
    • Environment:

      tomcat

      Description

      originally posted as part of SOLR-572:

      https://issues.apache.org/jira/browse/SOLR-572?focusedCommentId=12608487#action_12608487

      the new spellcheck.collate feature seems to exhibit some strange behaviors when handed a query with multiple tokens.

      {
       "responseHeader":{
        "params":{
      	"q":"redbull air show"}},
        "spellcheck":{
         "suggestions":[
      	"redbull",[
      	 "suggestion",["redbelly"]],
      	"show",[
      	 "suggestion",["shot"]],
      	"collation","redbelly airshotw"]}}
      

      in this case, note the fields are incorrectly concatenated (no space between tokens, left over 'w' from input string)

      {
       "responseHeader":{
        "params":{
      	"q":"redbull air show",
      	"spellcheck.q":"redbull air show"}},
       "spellcheck":{
        "suggestions":[
      	"redbull air show",[
      	 "suggestion",["redbull singers"]],
      	"collation","redbull singersredbull air show"]}}
      

      this is slightly different - the suggestions are still concatenated without a space, but the collation is way off.

      --Geoff

      1. handler.component.SpellCheckComponent-collate-patch.txt
        1 kB
        Stefan Oestreicher
      2. SOLR-606.patch
        4 kB
        Grant Ingersoll

        Activity

        Hide
        Grant Ingersoll added a comment -

        Can you try this patch and post the results? It doesn't fix the problem, but I'm having a hard time reproducing it and it adds some more output to the spellcheck.extendedResults=true option.

        Thus, you will need to add extendedResults to your flags.

        Show
        Grant Ingersoll added a comment - Can you try this patch and post the results? It doesn't fix the problem, but I'm having a hard time reproducing it and it adds some more output to the spellcheck.extendedResults=true option. Thus, you will need to add extendedResults to your flags.
        Hide
        Grant Ingersoll added a comment -

        Also, can you post your spell check configuration?

        Show
        Grant Ingersoll added a comment - Also, can you post your spell check configuration?
        Hide
        Geoffrey Young added a comment -

        I'm not in charge of any of the environments, so it might take me some time to apply the patch. hopefully I'll be able to report back tomorrow.

        if it matters, my spelling field is defined as so:

        <fieldType name="spell" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
        <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
        </fieldType>

        my spellcheck component configuration was straight from the docs, save changing the queryAnalyzerFieldType to match the above.

        Show
        Geoffrey Young added a comment - I'm not in charge of any of the environments, so it might take me some time to apply the patch. hopefully I'll be able to report back tomorrow. if it matters, my spelling field is defined as so: <fieldType name="spell" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> my spellcheck component configuration was straight from the docs, save changing the queryAnalyzerFieldType to match the above.
        Hide
        Grant Ingersoll added a comment -

        Hmmm, I suspect the issue is in the type of tokens created. Let me try that out.

        Show
        Grant Ingersoll added a comment - Hmmm, I suspect the issue is in the type of tokens created. Let me try that out.
        Hide
        Geoffrey Young added a comment -

        results with your patch applied:

        {
         "responseHeader":{
          "status":0,
          "QTime":24283},
         "command":"build",
         "response":{"numFound":0,"start":0,"docs":[]
         },
         "spellcheck":{
          "suggestions":[
        	"queryConversion",[
        	 "token",[
        	  "text","redbull",
        	  "start",0,
        	  "end",7],
        	 "token",[
        	  "text","air",
        	  "start",8,
        	  "end",11],
        	 "token",[
        	  "text","show",
        	  "start",12,
        	  "end",16]],
        	"redbull",[
        	 "numFound",1,
        	 "startOffset",0,
        	 "endOffset",7,
        	 "origFreq",0,
        	 "suggestion",{
        	  "frequency":1,
        	  "word":"redbelly"}],
        	"show",[
        	 "numFound",1,
        	 "startOffset",12,
        	 "endOffset",16,
        	 "origFreq",0,
        	 "suggestion",{
        	  "frequency":1,
        	  "word":"shot"}],
        	"correctlySpelled",false,
        	"collation","redbelly airshotw"]}}
        

        and with spellcheck.q defined it's

        {
         "responseHeader":{
          "status":0,
          "QTime":20,
          "params":{
        	"echoParams":"all",
        	"indent":"on",
        	"echoParams":"all",
        	"indent":"on",
        	"spellcheck.extendedResults":"true",
        	"q":"redbull air show",
        	"spellcheck.q":"redbull air show",
        	"spellcheck.collate":"true",
        	"spellcheck":"true",
        	"wt":"json"}},
         "response":{"numFound":0,"start":0,"docs":[]
         },
         "spellcheck":{
          "suggestions":[
        	"queryConversion",[
        	 "token",[
        	  "text","redbull air show",
        	  "start",0,
        	  "end",0]],
        	"redbull air show",[
        	 "numFound",1,
        	 "startOffset",0,
        	 "endOffset",0,
        	 "origFreq",0,
        	 "suggestion",{
        	  "frequency":1,
        	  "word":"redbull singers"}],
        	"correctlySpelled",false,
        	"collation","redbull singersredbull air show"]}}
        
        Show
        Geoffrey Young added a comment - results with your patch applied: { "responseHeader":{ "status":0, "QTime":24283}, "command":"build", "response":{"numFound":0,"start":0,"docs":[] }, "spellcheck":{ "suggestions":[ "queryConversion",[ "token",[ "text","redbull", "start",0, "end",7], "token",[ "text","air", "start",8, "end",11], "token",[ "text","show", "start",12, "end",16]], "redbull",[ "numFound",1, "startOffset",0, "endOffset",7, "origFreq",0, "suggestion",{ "frequency":1, "word":"redbelly"}], "show",[ "numFound",1, "startOffset",12, "endOffset",16, "origFreq",0, "suggestion",{ "frequency":1, "word":"shot"}], "correctlySpelled",false, "collation","redbelly airshotw"]}} and with spellcheck.q defined it's { "responseHeader":{ "status":0, "QTime":20, "params":{ "echoParams":"all", "indent":"on", "echoParams":"all", "indent":"on", "spellcheck.extendedResults":"true", "q":"redbull air show", "spellcheck.q":"redbull air show", "spellcheck.collate":"true", "spellcheck":"true", "wt":"json"}}, "response":{"numFound":0,"start":0,"docs":[] }, "spellcheck":{ "suggestions":[ "queryConversion",[ "token",[ "text","redbull air show", "start",0, "end",0]], "redbull air show",[ "numFound",1, "startOffset",0, "endOffset",0, "origFreq",0, "suggestion",{ "frequency":1, "word":"redbull singers"}], "correctlySpelled",false, "collation","redbull singersredbull air show"]}}
        Hide
        Grant Ingersoll added a comment -

        Hi Geoff,

        Can you comment on the use of the KeywordTokenizer for spelling? I'm not saying it's not a bug, but my guess is it is why I'm not seeing the issue w/ my setup. http://wiki.apache.org/solr/SpellCheckerRequestHandler has some recommendations on setup of the spell field that are still applicable.

        I'll try to figure something out for KeywordTokenizer at some point this week or next.

        Show
        Grant Ingersoll added a comment - Hi Geoff, Can you comment on the use of the KeywordTokenizer for spelling? I'm not saying it's not a bug, but my guess is it is why I'm not seeing the issue w/ my setup. http://wiki.apache.org/solr/SpellCheckerRequestHandler has some recommendations on setup of the spell field that are still applicable. I'll try to figure something out for KeywordTokenizer at some point this week or next.
        Hide
        Geoffrey Young added a comment -

        sure

        the choice of keywords is intentional. I don't want word suggestions but rather phrase suggestions.

        I'm searching almost exclusively over proper names - band names ("celine dion"), event names ("wicked: a new musical"), venue names ("staples center"), etc.

        in my case, it does me zero good to suggest a phrase that doesn't exist, even if the word parts do exist independently in my data.

        for example...

        o "hannah montana" is an "artist"
        o a user mis-types "hanna montanna"
        o spellchecker thinks "hanna" is spelled correctly (based on the presence of "Jake Hanna" among other artists), and suggests "montana" (based on "Montana Rangers", etc)
        o spellchecker gives me "hanna montana" as a suggestion... which then also misses since it doesn't exist (and the stemmer doesn't seem to catch the trailing 'h', but even if it did, there are other examples I can give)

        not surprisingly, using keywords instead of raw tokens for the dictionary gives me back only "things" that have exact matches, like "hannah montana", or "aerosmith" for "arrow smith", "boston red sox" for "boston red socks", etc.

        I know I'm not doing what most people are interested in, but it's very important for us to match phrases instead of raw words due to the crazy kinds of ways bands name themselves.

        fwiw, I found this bug as I was playing around with the new component - for the reasons mentioned above I'm not at all interested in the collation feature, so I don't consider this a priority for me. others may stumble upon it, though, which is why I reported it.

        HTH, and thanks for working out the spelling component in general - it's most excellent.

        Show
        Geoffrey Young added a comment - sure the choice of keywords is intentional. I don't want word suggestions but rather phrase suggestions. I'm searching almost exclusively over proper names - band names ("celine dion"), event names ("wicked: a new musical"), venue names ("staples center"), etc. in my case, it does me zero good to suggest a phrase that doesn't exist, even if the word parts do exist independently in my data. for example... o "hannah montana" is an "artist" o a user mis-types "hanna montanna" o spellchecker thinks "hanna" is spelled correctly (based on the presence of "Jake Hanna" among other artists), and suggests "montana" (based on "Montana Rangers", etc) o spellchecker gives me "hanna montana" as a suggestion... which then also misses since it doesn't exist (and the stemmer doesn't seem to catch the trailing 'h', but even if it did, there are other examples I can give) not surprisingly, using keywords instead of raw tokens for the dictionary gives me back only "things" that have exact matches, like "hannah montana", or "aerosmith" for "arrow smith", "boston red sox" for "boston red socks", etc. I know I'm not doing what most people are interested in, but it's very important for us to match phrases instead of raw words due to the crazy kinds of ways bands name themselves. fwiw, I found this bug as I was playing around with the new component - for the reasons mentioned above I'm not at all interested in the collation feature, so I don't consider this a priority for me. others may stumble upon it, though, which is why I reported it. HTH, and thanks for working out the spelling component in general - it's most excellent.
        Hide
        Stefan Oestreicher added a comment - - edited

        I recently ran into this exact issue and I found the problem.
        The collation is created by replacing the misspelled tokens with the suggestions using a StringBuilder:

        for (Iterator<Map.Entry<Token, String>> bestIter = best.entrySet().iterator(); bestIter.hasNext();) {
                Map.Entry<Token, String> entry = bestIter.next();
                Token tok = entry.getKey();
                collation.replace(tok.startOffset(), tok.endOffset(), entry.getValue());
        }
        

        As you can see it's just replacing the relevant tokens in the original query. However, if the length of a suggestion doesn't equal the length of the original token, all offsets used after that replacement are no longer valid thus randomly yielding incorrect results.
        I fixed that by keeping track of that difference and adding it to the token offsets. For this to work I had to change the HashMap to a LinkedHashMap since this solution depends on the iteration order of the Tokens to correspond to their occurrence in the string.
        I attached a patch reflecting those changes: handler.component.SpellCheckComponent-collate-patch.txt

        Show
        Stefan Oestreicher added a comment - - edited I recently ran into this exact issue and I found the problem. The collation is created by replacing the misspelled tokens with the suggestions using a StringBuilder: for (Iterator<Map.Entry<Token, String>> bestIter = best.entrySet().iterator(); bestIter.hasNext();) { Map.Entry<Token, String> entry = bestIter.next(); Token tok = entry.getKey(); collation.replace(tok.startOffset(), tok.endOffset(), entry.getValue()); } As you can see it's just replacing the relevant tokens in the original query. However, if the length of a suggestion doesn't equal the length of the original token, all offsets used after that replacement are no longer valid thus randomly yielding incorrect results. I fixed that by keeping track of that difference and adding it to the token offsets. For this to work I had to change the HashMap to a LinkedHashMap since this solution depends on the iteration order of the Tokens to correspond to their occurrence in the string. I attached a patch reflecting those changes: handler.component.SpellCheckComponent-collate-patch.txt
        Hide
        Grant Ingersoll added a comment -

        Committed revision 685983. Also added in a unit test that caused it to fail w/o the patch.

        Show
        Grant Ingersoll added a comment - Committed revision 685983. Also added in a unit test that caused it to fail w/o the patch.

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Geoffrey Young
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development