Solr
  1. Solr
  2. SOLR-3085

Fix the dismax/edismax stopwords mm issue

    Details

      Description

      As discussed here http://search-lucene.com/m/Wr7iz1a95jx and here http://search-lucene.com/m/Yne042qEyCq1 and here http://search-lucene.com/m/RfAp82nSsla DisMax has an issue with stopwords if not all fields used in QF have exactly same stopword lists.

      Typical solution is to not use stopwords or harmonize stopword lists across all fields in your QF, or relax the MM to a lower percentag. Sometimes these are not acceptable workarounds, and we should find a better solution.

      1. SOLR-3085.patch
        9 kB
        Jan Høydahl
      2. SOLR-3085.patch
        10 kB
        Jan Høydahl
      3. SOLR-3085.patch
        5 kB
        Jan Høydahl

        Issue Links

          Activity

          Hide
          Jan Høydahl added a comment -

          In this thread http://search-lucene.com/m/Tzktd1a95jx James Dyer suggests:

          I do wonder...what if (e)dismax had a flag you could set that would tell it that if any analyzers removed a term, then that term would become optional for any fields for which it remained? I'm not sure what the development effort would perhaps it would be a nice way to circumvent this problem in a future release...

          I like the suggestion. Would this be possible?

          Take as example this parsed query (q=the contract&qf=alltags title_en&mm=100%defType=edismax)

          +((DisjunctionMaxQuery((alltags:the)~0.01) DisjunctionMaxQuery((title_en:contract | alltags:contract)~0.01))~2)
          

          The field "alltags" does not use stopwords, but "title_en" does. So we get a required DisMax Query for alltags:the which does not match any docs. Is it possible in the (e)DisMax code to detect this and make the first DisMax query optional?

          Show
          Jan Høydahl added a comment - In this thread http://search-lucene.com/m/Tzktd1a95jx James Dyer suggests: I do wonder...what if (e)dismax had a flag you could set that would tell it that if any analyzers removed a term, then that term would become optional for any fields for which it remained? I'm not sure what the development effort would perhaps it would be a nice way to circumvent this problem in a future release... I like the suggestion. Would this be possible? Take as example this parsed query (q=the contract&qf=alltags title_en&mm=100%defType=edismax) +((DisjunctionMaxQuery((alltags:the)~0.01) DisjunctionMaxQuery((title_en:contract | alltags:contract)~0.01))~2) The field "alltags" does not use stopwords, but "title_en" does. So we get a required DisMax Query for alltags:the which does not match any docs. Is it possible in the (e)DisMax code to detect this and make the first DisMax query optional?
          Hide
          Hoss Man added a comment - - edited

          So we get a required DisMax Query for alltags:the which does not match any docs.

          I think you are missreading that output...

          +( ( DisjunctionMaxQuery((alltags:the)~0.01) 
               DisjunctionMaxQuery((title_en:contract | alltags:contract)~0.01)
             )~2
           )
          

          The "DisjunctionMaxQuery((alltags:the)~0.01)" clause is not required in that query. it is one of two SHOULD clauses in a boolean query, and becomes subject to the same "mm" rule. both clauses in that BooleanQuery are already SHOULD clauses, so i don't know what it would mean to make then more "optional".

          Show
          Hoss Man added a comment - - edited So we get a required DisMax Query for alltags:the which does not match any docs. I think you are missreading that output... +( ( DisjunctionMaxQuery((alltags:the)~0.01) DisjunctionMaxQuery((title_en:contract | alltags:contract)~0.01) )~2 ) The "DisjunctionMaxQuery((alltags:the)~0.01)" clause is not required in that query. it is one of two SHOULD clauses in a boolean query, and becomes subject to the same "mm" rule. both clauses in that BooleanQuery are already SHOULD clauses, so i don't know what it would mean to make then more "optional".
          Hide
          Jan Høydahl added a comment -

          You're right that technically it's not marked as required, but in the context of this "feature" we're discussing, the reason why people get 0 hits is that mm=100%, counted from all (SHOULD) clauses. And that means effectively that alltags:the is required.

          What James suggested, and what most people tricked by this "feature" expects, is that if "the" is a stopword for one of the qf fields, it becomes optional in some way.

          So how can we get that end result? First we need a way to safely detect that we're in this scenario, perhaps by inspecting whether each DisMax clause contains a field query for every field listed in QF. If one or more is missing, we can assume that the query term is a stopword in one or more of the fields. Then, one way may be to subtract the MM count accordingly, so that in our case above, when we detect that the DisMax clause for "the" does not contain "title_en", we do mm=mm-1 which will give us an MM of 1 instead of 2 and we'll get hits. This is probably the easiest solution.

          Another way would be to keep mm as is, and move the affected clause out of the BooleanQuery and add it as a BoostQuery instead?

          This behavior should be parameter driven, e.g. &mm.sw=false reading "Minimum should match does not require Stop Words"

          Show
          Jan Høydahl added a comment - You're right that technically it's not marked as required, but in the context of this "feature" we're discussing, the reason why people get 0 hits is that mm=100%, counted from all (SHOULD) clauses. And that means effectively that alltags:the is required. What James suggested, and what most people tricked by this "feature" expects, is that if "the" is a stopword for one of the qf fields, it becomes optional in some way. So how can we get that end result? First we need a way to safely detect that we're in this scenario, perhaps by inspecting whether each DisMax clause contains a field query for every field listed in QF. If one or more is missing, we can assume that the query term is a stopword in one or more of the fields. Then, one way may be to subtract the MM count accordingly, so that in our case above, when we detect that the DisMax clause for "the" does not contain "title_en", we do mm=mm-1 which will give us an MM of 1 instead of 2 and we'll get hits. This is probably the easiest solution. Another way would be to keep mm as is, and move the affected clause out of the BooleanQuery and add it as a BoostQuery instead? This behavior should be parameter driven, e.g. &mm.sw=false reading "Minimum should match does not require Stop Words"
          Hide
          Hoss Man added a comment -

          Then, one way may be to subtract the MM count accordingly, so that in our case above, when we detect that the DisMax clause for "the" does not contain "title_en", we do mm=mm-1 which will give us an MM of 1 instead of 2 and we'll get hits. This is probably the easiest solution.

          that wouldn't make any sense ... in your example that would result in the query matching every doc containing "alltags:the" (or "title_en:contract", or "alltags:contract") which hardly seems like what the user is likely to expect if they used mm=100% (with or w/o a "mm.sw=false" param)

          Another way would be to keep mm as is, and move the affected clause out of the BooleanQuery and add it as a BoostQuery instead?

          something like that might work .. but i haven't thought it through very hard ... i have a nagging feeling that there are non-stopword cases that would be indistinguishable (to the parser) from this type of stopword case, and thus would also trigger this logic undesirably, but i can't articulate what they might be off the top of my head.

          Show
          Hoss Man added a comment - Then, one way may be to subtract the MM count accordingly, so that in our case above, when we detect that the DisMax clause for "the" does not contain "title_en", we do mm=mm-1 which will give us an MM of 1 instead of 2 and we'll get hits. This is probably the easiest solution. that wouldn't make any sense ... in your example that would result in the query matching every doc containing "alltags:the" (or "title_en:contract", or "alltags:contract") which hardly seems like what the user is likely to expect if they used mm=100% (with or w/o a "mm.sw=false" param) Another way would be to keep mm as is, and move the affected clause out of the BooleanQuery and add it as a BoostQuery instead? something like that might work .. but i haven't thought it through very hard ... i have a nagging feeling that there are non-stopword cases that would be indistinguishable (to the parser) from this type of stopword case, and thus would also trigger this logic undesirably, but i can't articulate what they might be off the top of my head.
          Hide
          Jan Høydahl added a comment -

          i have a nagging feeling that there are non-stopword cases that would be indistinguishable (to the parser) from this type of stopword case, and thus would also trigger this logic undesirably, but i can't articulate what they might be off the top of my head.

          A potential difficult one is this multi language example: &qf=title_no title_en tags. Each of these fields may have their separate stopwords list, say title_no has a stopword "men" (norwegian for but) and title_en has stopword "the". Then we query q=the men. The user expectation would be that it would return ENGLISH docs matching "men", since "the" is an english stopword.

          Today we'd get:

          +((DisjunctionMaxQuery((title_no:the | tags:the)~0.01) DisjunctionMaxQuery((title_en:men | tags:men)~0.01))~2)
          

          In this case with mm=100% we'd likely get 0 hits, given that "the" is not common in either of title_no or tags. However, the parser cannot know whether the user's real information need is "the" or "men" - since both are stopwords for different fields.

          Now, all DisMax clauses in this example have had at least one stopword pruned, so using the "mm decrement" strategy would change mm from 2 to 0 which would turn this into an OR query - and of course return results. This is a compromise, so a better option in this special case would probably be to use eDismax's "smart" conditional stopword removal [1], but that requires change of fieldType.

          The "convert to boost query" approach would only work when we have at least one clause without stop words, since we cannot query ONLY with bq. Say two of my four query terms q=the best cheap holiday are stop words, and mm=100%. So we remove the two stop clauses from the BooleanQuery and reduce mm accordingly from 4 (100%) to 2, and add the two stop clauses as BQs. This approach would also work for mm<100% cases, since we only count mm clauses from the non-stop clauses.


          [1] For the special case of all clauses being stop clauses, eDisMax's existing "smart" conditional stopword handling could perhaps be another solution? For those unfamiliar with it, you can specify &stopwords=true (which is the default) and eDismax will remove stopwords for you instead of letting Analysis do it. It requires that you don't have StopFilterFactory in your Analysis. Now, if ALL query terms are stopwords, disMax will not remove them, to support queries like "Who is the who?". (Q: How does edismax pick up which stopword dicationary(ies) to use?). It's of no use to those removing stop-words in their "index" analysis though.

          Show
          Jan Høydahl added a comment - i have a nagging feeling that there are non-stopword cases that would be indistinguishable (to the parser) from this type of stopword case, and thus would also trigger this logic undesirably, but i can't articulate what they might be off the top of my head. A potential difficult one is this multi language example: &qf=title_no title_en tags . Each of these fields may have their separate stopwords list, say title_no has a stopword "men" (norwegian for but) and title_en has stopword "the". Then we query q=the men . The user expectation would be that it would return ENGLISH docs matching "men", since "the" is an english stopword. Today we'd get: +((DisjunctionMaxQuery((title_no:the | tags:the)~0.01) DisjunctionMaxQuery((title_en:men | tags:men)~0.01))~2) In this case with mm=100% we'd likely get 0 hits, given that "the" is not common in either of title_no or tags. However, the parser cannot know whether the user's real information need is "the" or "men" - since both are stopwords for different fields. Now, all DisMax clauses in this example have had at least one stopword pruned, so using the "mm decrement" strategy would change mm from 2 to 0 which would turn this into an OR query - and of course return results. This is a compromise, so a better option in this special case would probably be to use eDismax's "smart" conditional stopword removal [1] , but that requires change of fieldType. The "convert to boost query" approach would only work when we have at least one clause without stop words, since we cannot query ONLY with bq. Say two of my four query terms q=the best cheap holiday are stop words, and mm=100%. So we remove the two stop clauses from the BooleanQuery and reduce mm accordingly from 4 (100%) to 2, and add the two stop clauses as BQs. This approach would also work for mm<100% cases, since we only count mm clauses from the non-stop clauses. [1] For the special case of all clauses being stop clauses, eDisMax's existing "smart" conditional stopword handling could perhaps be another solution? For those unfamiliar with it, you can specify &stopwords=true (which is the default) and eDismax will remove stopwords for you instead of letting Analysis do it. It requires that you don't have StopFilterFactory in your Analysis. Now, if ALL query terms are stopwords, disMax will not remove them, to support queries like "Who is the who?". (Q: How does edismax pick up which stopword dicationary(ies) to use?). It's of no use to those removing stop-words in their "index" analysis though.
          Hide
          Jan Høydahl added a comment -

          How about we add a new fieldType to exampel schema.xml text_general_smartstopwords and in it document how to use eDisMax to conditionally remove stopwords on query side only?

          Show
          Jan Høydahl added a comment - How about we add a new fieldType to exampel schema.xml text_general_smartstopwords and in it document how to use eDisMax to conditionally remove stopwords on query side only?
          Hide
          Bill Bell added a comment -

          We also found another loophole. If we send [* TO *] to edismax we also can bring down the server.

          Some chars are not being escaped before being sent to SOLR. Eg I can send queries like this to SOLR by searching on ([* TO *] OR [* TO *] OR [* TO *]) in the search box - it took 72 seconds to return:

          webapp=/solr path=/select params=

          {d=160.9344&start=0&q=([*+TO+*]+OR+[*+TO+*]+OR+[*+TO+*])&pt=40.7146,-74.0071&qt=providersearchdist&wt=json&qq=city_state_lower:"new+york,+ny"&rows=20}

          hits=276442 status=0 QTime=72458

          Show
          Bill Bell added a comment - We also found another loophole. If we send [* TO *] to edismax we also can bring down the server. Some chars are not being escaped before being sent to SOLR. Eg I can send queries like this to SOLR by searching on ( [* TO *] OR [* TO *] OR [* TO *] ) in the search box - it took 72 seconds to return: webapp=/solr path=/select params= {d=160.9344&start=0&q=([*+TO+*]+OR+[*+TO+*]+OR+[*+TO+*])&pt=40.7146,-74.0071&qt=providersearchdist&wt=json&qq=city_state_lower:"new+york,+ny"&rows=20} hits=276442 status=0 QTime=72458
          Hide
          Jan Høydahl added a comment -

          @Bill, since this is a bit off topic, I moved your "loophole" to SOLR-3243. It is certainly something that is dangerous and I cannot see a single usecase for allowing an un-fielded range! Good catch.

          Show
          Jan Høydahl added a comment - @Bill, since this is a bit off topic, I moved your "loophole" to SOLR-3243 . It is certainly something that is dangerous and I cannot see a single usecase for allowing an un-fielded range! Good catch.
          Hide
          Hoss Man added a comment -

          Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19.

          Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited

          Show
          Hoss Man added a comment - Bulk changing fixVersion 3.6 to 4.0 for any open issues that are unassigned and have not been updated since March 19. Email spam suppressed for this bulk edit; search for hoss20120323nofix36 to identify all issues edited
          Hide
          Jonathan Rochkind added a comment -

          Hoss says: i have a nagging feeling that there are non-stopword cases that would be indistinguishable (to the parser) from this type of stopword case, and thus would also trigger this logic undesirably, but i can't articulate what they might be off the top of my head.

          Indeed there are, pretty much anything where analysis differs between two fields in a way that can effect number of tokens produced. Punctuation stripping can sometimes do this, and I ran into such a case in my real world use. More info http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/

          This is a difficult problem to fix in the general case, at one point I think there was a solr listserv discussion where I tried to brainstorm general case solutions, but they were all shot down by people who knew more about Solr than me. I can't find the archive of that discussion now though.

          Show
          Jonathan Rochkind added a comment - Hoss says: i have a nagging feeling that there are non-stopword cases that would be indistinguishable (to the parser) from this type of stopword case, and thus would also trigger this logic undesirably, but i can't articulate what they might be off the top of my head. Indeed there are, pretty much anything where analysis differs between two fields in a way that can effect number of tokens produced. Punctuation stripping can sometimes do this, and I ran into such a case in my real world use. More info http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/ This is a difficult problem to fix in the general case, at one point I think there was a solr listserv discussion where I tried to brainstorm general case solutions, but they were all shot down by people who knew more about Solr than me. I can't find the archive of that discussion now though.
          Hide
          Jan Høydahl added a comment -

          Good article, Jonathan. I agree that it may be very hard to completely fix this 100%, but an option to at least avoid the most common frustrations around this would be welcome, even if it is only fixing the symptoms. I.e. having a configuration parameter relaxMmHack=true which relaxes mm if one of the fields yields fewer tokens than the others would be fixing the felt effect of the problem for many peo, simply by adding a param.

          Show
          Jan Høydahl added a comment - Good article, Jonathan. I agree that it may be very hard to completely fix this 100%, but an option to at least avoid the most common frustrations around this would be welcome, even if it is only fixing the symptoms. I.e. having a configuration parameter relaxMmHack=true which relaxes mm if one of the fields yields fewer tokens than the others would be fixing the felt effect of the problem for many peo, simply by adding a param.
          Hide
          Hoss Man added a comment -

          bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

          Show
          Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
          Hide
          Robert Muir added a comment -

          rmuir20120906-bulk-40-change

          Show
          Robert Muir added a comment - rmuir20120906-bulk-40-change
          Hide
          Hoss Man added a comment -

          removing fixVersion=4.0 since there is no evidence that anyone is currently working on this issue. (this can certainly be revisited if volunteers step forward)

          Show
          Hoss Man added a comment - removing fixVersion=4.0 since there is no evidence that anyone is currently working on this issue. (this can certainly be revisited if volunteers step forward)
          Hide
          Markus Jelsma added a comment -

          Any progress with this one? Any smart ideas to share?

          Show
          Markus Jelsma added a comment - Any progress with this one? Any smart ideas to share?
          Hide
          Naomi Dushay added a comment -

          We avoided this by adding stopwords to our string fields (and simultaneously dealing with whitespace around punctuation marks). It's dumb, but it worked fine in dismax. We no longer use stopwords in general.

          <!-- single token with punctuation terms removed so dismax doesn't look for punctuation terms in these fields -->
          <!-- On client side, Lucene query parser breaks things up by whitespace before field analysis for dismax -->
          <!-- so punctuation terms (& : are stopwords to allow results from other fields when these chars are surrounded by spaces in query -->
          <fieldType name="string_punct_stop" class="solr.TextField" omitNorms="true">
          <analyzer type="index">
          <tokenizer class="solr.KeywordTokenizerFactory" />
          <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc" mode="compose" />
          </analyzer>
          <analyzer type="query">
          <tokenizer class="solr.KeywordTokenizerFactory" />
          <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc" mode="compose" />
          <!-- removing punctuation for Lucene query parser issues -->
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_punctuation.txt" enablePositionIncrements="true" />
          </analyzer>
          </fieldType>

          Show
          Naomi Dushay added a comment - We avoided this by adding stopwords to our string fields (and simultaneously dealing with whitespace around punctuation marks). It's dumb, but it worked fine in dismax. We no longer use stopwords in general. <!-- single token with punctuation terms removed so dismax doesn't look for punctuation terms in these fields --> <!-- On client side, Lucene query parser breaks things up by whitespace before field analysis for dismax --> <!-- so punctuation terms (& : are stopwords to allow results from other fields when these chars are surrounded by spaces in query --> <fieldType name="string_punct_stop" class="solr.TextField" omitNorms="true"> <analyzer type="index"> <tokenizer class="solr.KeywordTokenizerFactory" /> <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc" mode="compose" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory" /> <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc" mode="compose" /> <!-- removing punctuation for Lucene query parser issues --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_punctuation.txt" enablePositionIncrements="true" /> </analyzer> </fieldType>
          Hide
          Jan Høydahl added a comment -

          This issue is still with us... How do people feel about a super-simple fix through a new optional param:

          mm.autoRelax=true|false : Automatically relax number of MUST clauses when tokens are removed from some fields but not all

          It would count number of fields remaining for each clause and then adjust MM accordingly. I can attempt a patch.

          Show
          Jan Høydahl added a comment - This issue is still with us... How do people feel about a super-simple fix through a new optional param: mm.autoRelax=true|false : Automatically relax number of MUST clauses when tokens are removed from some fields but not all It would count number of fields remaining for each clause and then adjust MM accordingly. I can attempt a patch.
          Hide
          Markus Jelsma added a comment -

          I think that would be certainly better than the current situation. But there may be another issue; if you don't remove stopwords at all, like we do, there is a problem with mm and stop words too. For example: q=train from amsterdam to rotterdam&mm=2<-1 5<80%; ideally you would not want documents with only terms `from`, `to` and another non-stop word to match. In this case we would need mm to apply only on non-stop words but also need a query time stopwordfilter that doesn't remove them but marks them as stop words.

          Show
          Markus Jelsma added a comment - I think that would be certainly better than the current situation. But there may be another issue; if you don't remove stopwords at all, like we do, there is a problem with mm and stop words too. For example: q=train from amsterdam to rotterdam&mm=2<-1 5<80%; ideally you would not want documents with only terms `from`, `to` and another non-stop word to match. In this case we would need mm to apply only on non-stop words but also need a query time stopwordfilter that doesn't remove them but marks them as stop words.
          Hide
          Jan Høydahl added a comment -

          ideally you would not want documents with only terms `from`, `to` and another non-stop word to match. In this case we would need mm to apply only on non-stop words but also need a query time stopwordfilter that doesn't remove them but marks them as stop words.

          What exactly would "marks them as stop words" mean if they are not to be removed?

          Show
          Jan Høydahl added a comment - ideally you would not want documents with only terms `from`, `to` and another non-stop word to match. In this case we would need mm to apply only on non-stop words but also need a query time stopwordfilter that doesn't remove them but marks them as stop words. What exactly would "marks them as stop words" mean if they are not to be removed?
          Hide
          Markus Jelsma added a comment -

          As in not removed, we would still want to be able to query for stop words, but have mm only apply to non-stop words.

          Show
          Markus Jelsma added a comment - As in not removed, we would still want to be able to query for stop words, but have mm only apply to non-stop words.
          Hide
          Jan Høydahl added a comment -

          This patch is a first shot at a new param mm.autoRelax

          When set as &mm.autoRelax=true on the request, it will adjust minShouldMatch to at most require the number of clauses with the max number of disjuncts.

          Have tested manually for some common cases, and it seems to work as expected, i.e. if you had a query q=A horse in a stable that gave problems due to mm=100% => minShouldMatch=5, applying autoRelax will adjust it to 2. Need to add some JUnit tests.

          Show
          Jan Høydahl added a comment - This patch is a first shot at a new param mm.autoRelax When set as &mm.autoRelax=true on the request, it will adjust minShouldMatch to at most require the number of clauses with the max number of disjuncts. Have tested manually for some common cases, and it seems to work as expected, i.e. if you had a query q=A horse in a stable that gave problems due to mm=100% => minShouldMatch=5, applying autoRelax will adjust it to 2. Need to add some JUnit tests.
          Hide
          Jan Høydahl added a comment -

          New patch

          • Adds tests
          • Now also works with the old dismax
          • No longer breaks mm for non-dismax boolean clauses
          Show
          Jan Høydahl added a comment - New patch Adds tests Now also works with the old dismax No longer breaks mm for non-dismax boolean clauses
          Hide
          Jan Høydahl added a comment -

          Last patch had a bad import. This one now passes ant precommit.

          Show
          Jan Høydahl added a comment - Last patch had a bad import. This one now passes ant precommit .
          Hide
          Markus Jelsma added a comment - - edited

          Hi Jan - the SolrCore.java modification shouldnt be in the patch. Anyway, it looks like this fix does what it advertises. The problem i reported above, perhaps another issue, is still real. Environments without stopwords still have a problem with mm. Consider your q=A horse in a stable. With mm=2 we get all kinds of documents, usually all documents in the corpus (in and a). Ideally this or another parameter would only require horse and stable.

          edit, you already remove the import.

          Show
          Markus Jelsma added a comment - - edited Hi Jan - the SolrCore.java modification shouldnt be in the patch. Anyway, it looks like this fix does what it advertises. The problem i reported above, perhaps another issue, is still real. Environments without stopwords still have a problem with mm. Consider your q=A horse in a stable. With mm=2 we get all kinds of documents, usually all documents in the corpus (in and a). Ideally this or another parameter would only require horse and stable. edit, you already remove the import.
          Hide
          Jan Høydahl added a comment -

          Environments without stopwords still have a problem with mm. Consider your q=A horse in a stable. With mm=2 we get all kinds of documents, usually all documents in the corpus (in and a). Ideally this or another parameter would only require horse and stable.

          The mm.autoRelax param is designed to tackle one of the most common situation where your qf includes a bunch of "text" fields with stopword removal plus one or more "string" fields like "id" or "tags" etc. Take the example of qf=title body tags where title and body removes stopwords but tags does not. This would translate to something like

          (DMQ(tags:a) DMQ(title:horse | body:horse | tags:horse) DMQ(tags:in) DMQ(tags:a) DMQ(title:stable | body:stable | tags:stable))~5
          

          Very often in these cases the "tags" field does not contain free-text, so tags:a, tags:in would not match, and we always get 0 hits – thus mm=2 would help here.

          But for cases where you query multiple english analyzed text fields with different stopword lists, relaxation of mm is not the cure. The cure is rather to add the same stopword handling to all those text fieldTypes.

          Clearly mm.autoRelax is not a complete solution for all mm issues. For other cases we may need other cures. One idea I thought of the other day is a param mergeStopwords=true, which modifies the analysis chain for each field in qf to include all StopFilters on the "query" analysis of each field. I.e. if my field A has stopwords="a.txt" and field B has stopwords="b.txt", then edismax would add those two stopword filters in a row for both fields, much the same way that edismax removes the StopFilter when doing smart stopword handling.

          Show
          Jan Høydahl added a comment - Environments without stopwords still have a problem with mm. Consider your q=A horse in a stable. With mm=2 we get all kinds of documents, usually all documents in the corpus (in and a). Ideally this or another parameter would only require horse and stable. The mm.autoRelax param is designed to tackle one of the most common situation where your qf includes a bunch of "text" fields with stopword removal plus one or more "string" fields like "id" or "tags" etc. Take the example of qf=title body tags where title and body removes stopwords but tags does not. This would translate to something like (DMQ(tags:a) DMQ(title:horse | body:horse | tags:horse) DMQ(tags:in) DMQ(tags:a) DMQ(title:stable | body:stable | tags:stable))~5 Very often in these cases the "tags" field does not contain free-text, so tags:a, tags:in would not match, and we always get 0 hits – thus mm=2 would help here. But for cases where you query multiple english analyzed text fields with different stopword lists, relaxation of mm is not the cure. The cure is rather to add the same stopword handling to all those text fieldTypes. Clearly mm.autoRelax is not a complete solution for all mm issues. For other cases we may need other cures. One idea I thought of the other day is a param mergeStopwords=true , which modifies the analysis chain for each field in qf to include all StopFilters on the "query" analysis of each field. I.e. if my field A has stopwords="a.txt" and field B has stopwords="b.txt" , then edismax would add those two stopword filters in a row for both fields, much the same way that edismax removes the StopFilter when doing smart stopword handling.
          Hide
          Jan Høydahl added a comment -

          As it seems there is no silver bullet for all kind of mm problems, I suggest to chop up the elephant, starting with mm.autoRelax as the first tool. And then try to tackle other needs later. Thoughts?

          Show
          Jan Høydahl added a comment - As it seems there is no silver bullet for all kind of mm problems, I suggest to chop up the elephant, starting with mm.autoRelax as the first tool. And then try to tackle other needs later. Thoughts?
          Hide
          Jan Høydahl added a comment -

          This keeps getting up on the users list. Any objections to adding params mm.autoRelax and mergeStopwords to start with, perhaps as experimental and then if they prove useful, promote them as permanent edismax citizens?

          Show
          Jan Høydahl added a comment - This keeps getting up on the users list. Any objections to adding params mm.autoRelax and mergeStopwords to start with, perhaps as experimental and then if they prove useful, promote them as permanent edismax citizens?
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.

            People

            • Assignee:
              Unassigned
              Reporter:
              Jan Høydahl
            • Votes:
              7 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:

                Development