Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-16436

DirectSolrSpellChecher: maxQueryFrequency bug in multi-shard

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • main (10.0), 9.2
    • spellchecker
    • None

    Description

      DirectSolrSpellChecher has some very confusing/unexpected behavior when:

      • maxQueryFrequency is configured
      • In a multi-shard collection
      • Using thresholdTokenFrequency or spellcheck.onlyMorePopular=true or spellcheck.alternativeTermCount
        • (ie: anything that cause SuggestMode != SUGGEST_WHEN_NOT_IN_INDEX so suggestions are possible even for terms in the index)

      The nature of the unexpected behavior varies depending on whether maxQueryFrequency is configured as a float less then 1 (ie: a percentage relative to the maxDocs in the index) or an integer greater then 1 (ie: an absolute max frequency):

      • When maxQueryFrequency < 1 (ie: "percentage of maxDocs")
        • It's possible to get "false negative" suggestions
          • ie: a term that should generate suggestions (and would in an equivalent single-shard deployment) does not
        • A term from the original query may not exist in enough total documents then the configured maxQueryFrequency percentage across the entire collection, but will not return suggestions
        • This can happen if a term exists in more then the configured maxQueryFrequency percentage of docs on one (or more) individual shards
          • As long as at least one shard says the term is "correctly spelled" (which is what DirectSolrSpellChecher decides when the maxQueryFrequency threshold is met) then the merge logic ignores any suggestions that might come from other shards
      • When 1 < maxQueryFrequency (ie "absolute value")
        • It's possible to get "false positive" suggestions
          • ie: a term that should not generate suggestions (and would not in an equivalent single-shard deployment) does
        • A term from the original query may exist in more total documents in the collection then the configured maxQueryFrequency but will still return suggestions
        • This can happen if a term exists in fewer then the configured maxQueryFrequency number of docs on every individual shard
          • Since no shard says the term is "correctly spelled", the suggestions are merged and returned
          • No aspect of the code considers the possibility that the sum of the origFreq returned by all shards might be higher then the specified maxQueryFrequency

      Attachments

        1. SOLR-16436.patch
          12 kB
          Chris M. Hostetter
        2. SOLR-16436-1.patch
          15 kB
          Chris M. Hostetter

        Activity

          People

            hossman Chris M. Hostetter
            hossman Chris M. Hostetter
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: