Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-15449

edimax sow causes issues with minimum should match in case of multi field with different analysis

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 8.8.2
    • 9.0
    • None
    • None

    Description

      Intro

      in multi-field search where the text analysis per field produces a different amount of tokens:

      sow=true causes the minimum should match to be "per document"
      i.e a document to be a match must contain all the mm query terms anywhere at least once

      sow=false causes the minimum should match to be "per field"
      i.e a document to be a match must contain all the mm query terms in a single field at least once

      When the query parsed moves from being term centric(sow=true) to field centric(sow=false and different text analysis), mm means two different things:

      sow = true
      mm=2
      qf = author subjects_as_same_term
      q = united kingdom
      defType = edismax
      "parsedquery_toString":
      "+(((author:united | subjects_as_same_term:united) (author:kingdom | subjects_as_same_term:kingdom))~2)"
      
      "response":{"numFound":2,"start":0,"maxScore":7.757958,"numFoundExact":true,"docs":[
            {
              "id":"888888",
              "author":"united",
              "subjects":["kingdom"],
              "score":7.757958},
            {
              "id":"77777",
              "author":"united kingdom",
              "score":5.874222}]
        },
      

      mimimum of query terms matched within the same field (i.e. all query terms required must be found in one of the fields)
      “PER FIELD”

      sow = false
      mm=2
      qf = author subjects_as_same_term
      q = united kingdom
      defType = edismax
      "parsedquery_toString":
      "+(((author:united author:kingdom)~2) | 
      (((subjects_as_same_term:uk subjects_as_same_term:"united kingdom" subjects_as_same_term:england subjects_as_same_term:london subjects_as_same_term:british subjects_as_same_term:britain))~1))"
      

      This (author:united author:kingdom)~2 means we need both the clauses to match to have a good candidate, in disjunction with
      (subjects_as_same_term:uk subjects_as_same_term:”united kingdom” subjects_as_same_term:england subjects_as_same_term:london subjects_as_same_term:british subjects_as_same_term:britain))~1 that means we need at least one clause to match (because synonyms expanded the two original terms into a single one)

      "response":{"numFound":1,"start":0,"maxScore":5.874222,"numFoundExact":true,"docs":[
            {
              "id":"77777",
              "author":"united kingdom",
              "score":5.874222}]
        }
      

      Problem

      When a field text analysis is incompatible with the query text, mm is not fully respected:

      sow = false
      mm=100%
      qf = text numeric_i
      q = terminator 100
      defType = edismax
      "parsedquery_toString":
      "+(((text:terminator text:100)~2) | 
      (numeric_i:100)~1))"
      

      A document just containing '100' in the field numeric_i is returned as a good search result but it actually doesn't respect the mm=100%

      Reference: https://sease.io/2021/05/apache-solr-sow-parameter-split-on-whitespace-and-multi-field-full-text-search.html

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              abenedetti Alessandro Benedetti
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h
                  4h