Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10314

Spellcheck with SnowballPorterFilterFactory and Synonyms doesn't work well

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Information Provided
    • None
    • None
    • spellchecker
    • None

    Description

      As noted in SOLR-10252, the default spellcheck configuration in the data_driven_schema_configs (and basic_configs) uses the _text_ field as the default field for spellcheck. This field is text_general field type.

      If I use this default configuration for spellcheck, but modify the text_general field to use the SnowballPorterFilterFactory (with language=German in this case), and have synonyms in my analysis chain, queries to the /spell request handler will fail when there are 2 or more terms which are both preceded with a + operator.

      Note that the default spellcheck configuration also enables spellcheck.collate - if I disable that, I do not get any error. I also do not get an error if I use only 1 term, even if it is spelled "correctly". If at least one of the terms is spelled incorrectly, that also does not give an error.

      So, in summary, there's a pretty specific list of variables at work here:

      1. /spell request handler
      2. 2 or more terms, both spelled correctly (or, both terms exist in the index)
      3. all terms required with +
      4. synonyms (there is a big list in this case, which I cannot share...see SOLR-10252 for an example of the parsed query to see how big the list can get)
      5. SnowballPorterFilter
      6. spellcheck.collate=true

      The error returned is:

      org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:7574/solr/spelltest3_shard1_replica2: String index out of range: -1
      

      I made several experiments and found that if synonyms are removed from the field type (and thus the query analysis chain), the query is successful with collations enabled. So it's not SnowballPorterFilter by itself, but with + and synonyms and collation.

      The field type definition is:

        <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
          <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
            <filter class="solr.SnowballPorterFilterFactory" language="German"/>
            <filter class="solr.LowerCaseFilterFactory"/>
          </analyzer>
          <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
            <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.SnowballPorterFilterFactory" language="German"/>
          </analyzer>
        </fieldType>
      

      This problem was found with 5.5.2, but I verified it still exists in 6.4 and 6.5.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ctargett Cassandra Targett
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: