Solr
  1. Solr
  2. SOLR-395

Spell-check should return frequencies of word and suggestions

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.3
    • Component/s: spellchecker
    • Labels:
      None

      Description

      When issuing a spell-check, the word being searched for might be present in the index with a very low frequency (i.e. a misspelling that made it's way into the index). It might therefore be helpful if the client receives the frequency of the word plus the frequencies of each of the suggestions.
      This feature should be optional (using a URL param).

      1. extended_results.diff
        43 kB
        Mike Krimerman
      2. returnFrequencies.patch
        3 kB
        Mike Krimerman

        Issue Links

          Activity

          Hide
          Mike Krimerman added a comment -

          patch for returning frequencies for word and suggestions.
          Lucene's suggestions are sorted by distance first and frequency second (if applicable).

          The patch adds two fields:

          • a frequency field for the word
          • a list of frequencies (same length as the suggestion list).
          Show
          Mike Krimerman added a comment - patch for returning frequencies for word and suggestions. Lucene's suggestions are sorted by distance first and frequency second (if applicable). The patch adds two fields: a frequency field for the word a list of frequencies (same length as the suggestion list).
          Hide
          Mike Klaas added a comment -

          These two issues should probably be combined into one patch

          Show
          Mike Klaas added a comment - These two issues should probably be combined into one patch
          Hide
          Mike Klaas added a comment -

          Might it be better to rename the fields "queryFreq"/"suggestionFreqs"? (or something more different that "frequency" + "frequencies")

          Show
          Mike Klaas added a comment - Might it be better to rename the fields "queryFreq"/"suggestionFreqs"? (or something more different that "frequency" + "frequencies")
          Hide
          Scott Tabar added a comment -

          I will be making changes to SOLR-375 to display the frequency for the word that is being checked instead of using the boolean exists. This should not be conditional by a parameter, but should be part of the default results as is the exists modification currently has been implemented.

          It would not be a problem to incorporate these changes in to JIRA-375 and also for me to add additional unit tests to cover the frequency modifications.

          Mike (both), do you have any other suggestions to enhance the SpellCheckerRequestHandler?

          Not running this code, but just reviewing the patch, it appears like the frequency list is parallel and separate to the suggestion list. This is great from the perspective of backwards compatibility, but would it make more sense to alter the suggestion list's data structure to make a stronger tie or relationship to the word that is be suggested? Right now only the frequency is of interest, but if additional information can be provided, say the value of "distance", then there would be a logical place for it, otherwise we would have yet another "list" of "values". Having an organized data structure could be more conducive to using Java's "for each" or Prototype's "each" construct without needing to track index values in to one array or the other. I realize this may be more a matter of preference on style, but nows the time to make a change if it is so desired.

          One idea of integrating the frequency of the suggestion is to make the frequency an attribute on the <str> tag such as <str frequency="1283">happy</str>. This may help with backwards compatibility but there is not much support for the addition of attributes within Solr so that could prevent its use.

          Show
          Scott Tabar added a comment - I will be making changes to SOLR-375 to display the frequency for the word that is being checked instead of using the boolean exists. This should not be conditional by a parameter, but should be part of the default results as is the exists modification currently has been implemented. It would not be a problem to incorporate these changes in to JIRA-375 and also for me to add additional unit tests to cover the frequency modifications. Mike (both), do you have any other suggestions to enhance the SpellCheckerRequestHandler? Not running this code, but just reviewing the patch, it appears like the frequency list is parallel and separate to the suggestion list. This is great from the perspective of backwards compatibility, but would it make more sense to alter the suggestion list's data structure to make a stronger tie or relationship to the word that is be suggested? Right now only the frequency is of interest, but if additional information can be provided, say the value of "distance", then there would be a logical place for it, otherwise we would have yet another "list" of "values". Having an organized data structure could be more conducive to using Java's "for each" or Prototype's "each" construct without needing to track index values in to one array or the other. I realize this may be more a matter of preference on style, but nows the time to make a change if it is so desired. One idea of integrating the frequency of the suggestion is to make the frequency an attribute on the <str> tag such as <str frequency="1283">happy</str>. This may help with backwards compatibility but there is not much support for the addition of attributes within Solr so that could prevent its use.
          Hide
          Mike Krimerman added a comment -

          The separate list of frequencies is indeed for backwards compatibility, it seems preferable to do as you suggested and add a frequency for each suggestion if backwards compatibility is not an issue.
          If the distance can be added it would be a nice addition. Lucene sorts the suggestion list by distance first and frequency second.

          Regarding the XML formatting, that would be nice addition. However I was under the impression that Solr uses only tag-elements (and not attributes) for responses. How would the frequency be returned if a JSON or Python response is requested?

          Another nice addition might be to implement the decision of the prominent suggestion; however that might require some heuristics and not be generic.

          Show
          Mike Krimerman added a comment - The separate list of frequencies is indeed for backwards compatibility, it seems preferable to do as you suggested and add a frequency for each suggestion if backwards compatibility is not an issue. If the distance can be added it would be a nice addition. Lucene sorts the suggestion list by distance first and frequency second. Regarding the XML formatting, that would be nice addition. However I was under the impression that Solr uses only tag-elements (and not attributes) for responses. How would the frequency be returned if a JSON or Python response is requested? Another nice addition might be to implement the decision of the prominent suggestion; however that might require some heuristics and not be generic.
          Hide
          Mike Klaas added a comment -

          If the extra data is only present when some parameter is present, backward compatibility is not affected.

          Show
          Mike Klaas added a comment - If the extra data is only present when some parameter is present, backward compatibility is not affected.
          Hide
          Mike Krimerman added a comment -

          The attached patch combines patches for issues 375, 395, 401 and some more:

          1. (375) Adds the exist property for a single word spell-check - whether the word exists in dictionary
          2. Adds the sp.query.onlyMorePopular option for returning suggestions that are more popular than query word(s)
          3. The sp.query.extendedResults implies a multi-word query plus returning frequencies for each word in query and for each suggestion.
          4. (401) A minimum threshold for adding words to the spell-check dictionary as percent/100 of documents where word should appear.
          5. Arguments prefixed with the 'sp' prefix, backwards compatibility remains.
            1. sp.dictionary.indexDir - backwards compatible with spellcheckerIndexDir
            2. sp.dictionary.termSourceField - backwards compatible with termSourceField
            3. sp.dictionary.threshold - threshold for words to enter dictionary
            4. sp.query.suggestionCount - backwards compatible with suggestionCount
            5. sp.query.accuracy - backwards compatible with accuracy
            6. sp.query.onlyMorePopular - only more popular suggestions
            7. sp.query.extendedResults - multi-word query and a response with frequencies
          6. (375) A unit-test file, extended and modified to test 401
          7. Formatted extended-results to be more friendly for Python/Ruby
          Show
          Mike Krimerman added a comment - The attached patch combines patches for issues 375, 395, 401 and some more: (375) Adds the exist property for a single word spell-check - whether the word exists in dictionary Adds the sp.query.onlyMorePopular option for returning suggestions that are more popular than query word(s) The sp.query.extendedResults implies a multi-word query plus returning frequencies for each word in query and for each suggestion. (401) A minimum threshold for adding words to the spell-check dictionary as percent/100 of documents where word should appear. Arguments prefixed with the 'sp' prefix, backwards compatibility remains. sp.dictionary.indexDir - backwards compatible with spellcheckerIndexDir sp.dictionary.termSourceField - backwards compatible with termSourceField sp.dictionary.threshold - threshold for words to enter dictionary sp.query.suggestionCount - backwards compatible with suggestionCount sp.query.accuracy - backwards compatible with accuracy sp.query.onlyMorePopular - only more popular suggestions sp.query.extendedResults - multi-word query and a response with frequencies (375) A unit-test file, extended and modified to test 401 Formatted extended-results to be more friendly for Python/Ruby
          Hide
          Mike Klaas added a comment -

          Since this patch essentially subsumes SOLR-401 and SOLR-375. I'll mark them as closed to move discussion here.

          Nice patch! (here my bias is showing given that I helped Mike develop it off-line).

          Do any of the original spellcheck contributors have comments about this new direction? I like that:

          • spellcheck parameters share a common prefix, and
          • the new format is extensible: new data can be added to the suggestions without breaking compatibility.

          If not, I'll commit in a day or so.

          Show
          Mike Klaas added a comment - Since this patch essentially subsumes SOLR-401 and SOLR-375 . I'll mark them as closed to move discussion here. Nice patch! (here my bias is showing given that I helped Mike develop it off-line). Do any of the original spellcheck contributors have comments about this new direction? I like that: spellcheck parameters share a common prefix, and the new format is extensible: new data can be added to the suggestions without breaking compatibility. If not, I'll commit in a day or so.
          Hide
          Yonik Seeley added a comment -

          > the new format is extensible: new data can be added to the suggestions without breaking compatibility.

          That's always a good thing... could you give an example of the new format for those of us too lazy to try it out ourselves?

          Show
          Yonik Seeley added a comment - > the new format is extensible: new data can be added to the suggestions without breaking compatibility. That's always a good thing... could you give an example of the new format for those of us too lazy to try it out ourselves?
          Hide
          Mike Krimerman added a comment -

          The new format produces output as (querying for pithon+progremming, extendedResults=true):

           
          <response>
              <lst name="responseHeader">
                  <int name="status">0</int>
                  <int name="QTime">173</int>
              </lst>
              <lst name="result">
                  <lst name="pithon">
                      <int name="frequency">5</int>
                      <lst name="suggestions">
                          <lst name="python">
                              <int name="frequency">18785</int>
                          </lst>
                      </lst>
                  </lst>
                  <lst name="progremming">
                      <int name="frequency">0</int>
                      <lst name="suggestions">
                          <lst name="programming">
                              <int name="frequency">70997</int>
                          </lst>
                          <lst name="progressing">
                              <int name="frequency">1930</int>
                          </lst>
                          <lst name="programing">
                              <int name="frequency">597</int>
                          </lst>
                          <lst name="progamming">
                              <int name="frequency">113</int>
                          </lst>
                          <lst name="reprogramming">
                              <int name="frequency">344</int>
                          </lst>
                      </lst>
                  </lst>
              </lst>
          </response>
          

          In this example the best suggestions are the first ones. Some queries may return a suggestion which is very close to the query word, but with relatively low frequency (Lucene sorts results by distance first). In that case suggestions that are somewhat farther but with a much higher frequency should be chosen.

          Show
          Mike Krimerman added a comment - The new format produces output as (querying for pithon+progremming, extendedResults=true): <response> <lst name= "responseHeader" > <int name= "status" > 0 </int> <int name= "QTime" > 173 </int> </lst> <lst name= "result" > <lst name= "pithon" > <int name= "frequency" > 5 </int> <lst name= "suggestions" > <lst name= "python" > <int name= "frequency" > 18785 </int> </lst> </lst> </lst> <lst name= "progremming" > <int name= "frequency" > 0 </int> <lst name= "suggestions" > <lst name= "programming" > <int name= "frequency" > 70997 </int> </lst> <lst name= "progressing" > <int name= "frequency" > 1930 </int> </lst> <lst name= "programing" > <int name= "frequency" > 597 </int> </lst> <lst name= "progamming" > <int name= "frequency" > 113 </int> </lst> <lst name= "reprogramming" > <int name= "frequency" > 344 </int> </lst> </lst> </lst> </lst> </response> In this example the best suggestions are the first ones. Some queries may return a suggestion which is very close to the query word, but with relatively low frequency (Lucene sorts results by distance first). In that case suggestions that are somewhat farther but with a much higher frequency should be chosen.
          Hide
          Mike Klaas added a comment - - edited

          a python example

          {
            'responseHeader': {
              'status':0,
              'QTime':16
            },
            'result':{
              'pithon':{
                'frequency':5,
                'suggestions':['python',{'frequency':18785}]
              },
              'haus':{
                'frequency':482,
                'suggestions':['hats',{'frequency':6794},'hans',
          {'frequency':5986},'haul',{'frequency':3152},'haas',
          {'frequency':1054},'hays',{'frequency':533}]
              },
              'endication':{
                'frequency':0,
                'suggestions':['indication',{'frequency':9634},'syndication',
          {'frequency':17777},'dedication',{'frequency':4470},'medication',
          {'frequency':3746},'indications',{'frequency':2783}]
              }
            }
          }
          
          Show
          Mike Klaas added a comment - - edited a python example { 'responseHeader': { 'status':0, 'QTime':16 }, 'result':{ 'pithon':{ 'frequency':5, 'suggestions':['python',{'frequency':18785}] }, 'haus':{ 'frequency':482, 'suggestions':['hats',{'frequency':6794},'hans', {'frequency':5986},'haul',{'frequency':3152},'haas', {'frequency':1054},'hays',{'frequency':533}] }, 'endication':{ 'frequency':0, 'suggestions':['indication',{'frequency':9634},'syndication', {'frequency':17777},'dedication',{'frequency':4470},'medication', {'frequency':3746},'indications',{'frequency':2783}] } } }
          Hide
          Mike Klaas added a comment -

          Committed! Thanks Mike and Scott.

          Show
          Mike Klaas added a comment - Committed! Thanks Mike and Scott.

            People

            • Assignee:
              Mike Klaas
              Reporter:
              Mike Krimerman
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development