Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-4277

Spellchecker sometimes falsely reports a spelling error and correction

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0
    • None
    • spellchecker
    • None

    Description

      In some cases, the Solr spell checker improperly reports query terms as being misspelled.

      Using the Solr example for 4.0, I added these mini documents:

      curl http://localhost:8983/solr/update?commit=true -H 'Content-type:application/csv' -d '
      id,name
      spel-1,aardvark abacus ball bill cat cello
      spel-2,abate accord band bell cattle check
      spel-3,adorn border clean clock'
      

      I then issued this request:

      curl "http://localhost:8983/solr/spell/?q=check&indent=true"
      

      The spell checker falsely concluded that "check" was misspelled and improperly corrected it to "clock":

      <lst name="spellcheck">
        <lst name="suggestions">
          <lst name="check">
            <int name="numFound">1</int>
            <int name="startOffset">0</int>
            <int name="endOffset">5</int>
            <int name="origFreq">1</int>
            <arr name="suggestion">
              <lst>
                <str name="word">clock</str>
                <int name="freq">1</int>
              </lst>
            </arr>
          </lst>
          <bool name="correctlySpelled">false</bool>
          <lst name="collation">
            <str name="collationQuery">clock</str>
            <int name="hits">1</int>
            <lst name="misspellingsAndCorrections">
              <str name="check">clock</str>
            </lst>
          </lst>
        </lst>
      </lst>
      

      And if I query for "clock", it gets corrected to "check"!

      curl "http://localhost:8983/solr/spell/?q=clock&indent=true"
      
        <lst name="suggestions">
          <lst name="clock">
            <int name="numFound">1</int>
            <int name="startOffset">0</int>
            <int name="endOffset">5</int>
            <int name="origFreq">1</int>
            <arr name="suggestion">
              <lst>
                <str name="word">check</str>
                <int name="freq">1</int>
              </lst>
            </arr>
          </lst>
          <bool name="correctlySpelled">false</bool>
          <lst name="collation">
            <str name="collationQuery">check</str>
            <int name="hits">1</int>
            <lst name="misspellingsAndCorrections">
              <str name="clock">check</str>
            </lst>
          </lst>
        </lst>
      

      Note: This appears to be only because "clock" is so close to "check". With other terms I don't see the problem:

      curl "http://localhost:8983/solr/spell/?q=cattle+abate+check&indent=true"
      
        <lst name="suggestions">
          <lst name="check">
            <int name="numFound">1</int>
            <int name="startOffset">13</int>
            <int name="endOffset">18</int>
            <int name="origFreq">1</int>
            <arr name="suggestion">
              <lst>
                <str name="word">clock</str>
                <int name="freq">1</int>
              </lst>
            </arr>
          </lst>
          <bool name="correctlySpelled">false</bool>
          <lst name="collation">
            <str name="collationQuery">cattle abate clock</str>
            <int name="hits">2</int>
            <lst name="misspellingsAndCorrections">
              <str name="cattle">cattle</str>
              <str name="abate">abate</str>
              <str name="check">clock</str>
            </lst>
          </lst>
        </lst>
      

      Although, it inappropriately lists "cattle" and "abate" in the "misspellings" section even though no suggestions were offered.

      Finally, I can workaround this issue by removing the following line from solrconfig.xml:

            <str name="spellcheck.alternativeTermCount">5</str>
      

      Which responds to the previous request with:

        <lst name="suggestions">
          <bool name="correctlySpelled">false</bool>
        </lst>
      

      Which makes the original problem go away. Although, it does beg the question as to why my 100% correct query is still tagged as "correctlySpelled" = "false", but that's a separate Jira.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jkrupan Jack Krupansky
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: