Solr
  1. Solr
  2. SOLR-3589

Edismax parser does not honor mm parameter if analyzer splits a token

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-BETA
    • Fix Version/s: 3.6.2, 4.1, 5.0
    • Component/s: search
    • Labels:
      None

      Description

      With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced.
      This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.

      See these messages for more discussion:
      http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html

      http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html

      http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

      1. SOLR-3589_test.patch
        1 kB
        Robert Muir
      2. SOLR-3589.patch
        10 kB
        Tom Burton-West
      3. SOLR-3589.patch
        8 kB
        Robert Muir
      4. SOLR-3589.patch
        8 kB
        Robert Muir
      5. SOLR-3589.patch
        5 kB
        Robert Muir
      6. SOLR-3589.patch
        3 kB
        Robert Muir
      7. SOLR-3589-3.6.PATCH
        11 kB
        Tom Burton-West
      8. testSolr3589.xml.gz
        1 kB
        Tom Burton-West
      9. testSolr3589.xml.gz
        1 kB
        Tom Burton-West

        Issue Links

          Activity

          Tom Burton-West created issue -
          Hide
          Joel Rosen added a comment - - edited

          A user in this thread reports this is a bug introduced in 3.6:

          http://lucene.472066.n3.nabble.com/Dismax-Question-td3992446.html

          He says they reverted to 3.5 and it went away.

          However I just tried the same setup with Solr versions 3.5, 3.4, and 3.1, and I can verify that the behavior is the same in each, so now I doubt it was a bug introduced in 3.6. Could it be something in the default configuration that changed between versions?

          Show
          Joel Rosen added a comment - - edited A user in this thread reports this is a bug introduced in 3.6: http://lucene.472066.n3.nabble.com/Dismax-Question-td3992446.html He says they reverted to 3.5 and it went away. However I just tried the same setup with Solr versions 3.5, 3.4, and 3.1, and I can verify that the behavior is the same in each, so now I doubt it was a bug introduced in 3.6. Could it be something in the default configuration that changed between versions?
          Hide
          Tom Burton-West added a comment -

          I didn't see enough configuration information in that thread to determine whether they were reporting the same bug or some different bug or configuration issue. After reading that thread I also verified that the problem reported here occurs with version 3.4. So I think that the thread you cite may refer to a different issue. If you do happen to find some configuration change that fixes the problem, please let me know.

          Tom

          Show
          Tom Burton-West added a comment - I didn't see enough configuration information in that thread to determine whether they were reporting the same bug or some different bug or configuration issue. After reading that thread I also verified that the problem reported here occurs with version 3.4. So I think that the thread you cite may refer to a different issue. If you do happen to find some configuration change that fixes the problem, please let me know. Tom
          Hide
          Jack Krupansky added a comment -

          The root problem is that with automatic phrase query generation turned off, by default and for the text_general field in particular, the core Lucene query parser is generating a query for the tuple of sub-terms using the default query operator, which is "OR" by default. There is no notion of an "mm" or min-match parameter down at that level in Lucene, which knows nothing about Solr or edismax or request parameters.

          As things stand, the only option is to set the default query operator, "q.op", to "AND".

          You can of course also turn on autoGeneratePhraseQueries or select an analyzer than doesn't split terms.

          At this point, I would advise resolving this issue as "Won't Fix", although it could also be spun off into a Lucene issue to add support for min-match down at that level, which edismax can then also communicate with.

          Show
          Jack Krupansky added a comment - The root problem is that with automatic phrase query generation turned off, by default and for the text_general field in particular, the core Lucene query parser is generating a query for the tuple of sub-terms using the default query operator, which is "OR" by default. There is no notion of an "mm" or min-match parameter down at that level in Lucene, which knows nothing about Solr or edismax or request parameters. As things stand, the only option is to set the default query operator, "q.op", to "AND". You can of course also turn on autoGeneratePhraseQueries or select an analyzer than doesn't split terms. At this point, I would advise resolving this issue as "Won't Fix", although it could also be spun off into a Lucene issue to add support for min-match down at that level, which edismax can then also communicate with.
          Hide
          Joel Rosen added a comment -

          It's not just mm. You set q.op to AND and it does the same thing.

          The issue is that the query parser should treat the split tokens as separate tokens just as if they were separated by whitespace, but it doesn't. If I use a smart Chinese tokenizer to split up a Chinese sentence into words, why can't the query parser treat those words exactly the same way it treats words from an English sentence?

          Show
          Joel Rosen added a comment - It's not just mm. You set q.op to AND and it does the same thing. The issue is that the query parser should treat the split tokens as separate tokens just as if they were separated by whitespace, but it doesn't. If I use a smart Chinese tokenizer to split up a Chinese sentence into words, why can't the query parser treat those words exactly the same way it treats words from an English sentence?
          Hide
          Jack Krupansky added a comment -

          It's not just mm. You set q.op to AND and it does the same thing.

          Joel, you're right. Upon closer inspection of the code, I see that the reason is that edismax never sets the Lucene default operator directly. Instead, it sets the default value of "mm" parameter to 100% if "q.op" is "AND", and set's BooleanQuery.minNrShouldMatch to the number of optional terms. That is equivalent to setting the default Lucene query operator at the top-level boolean level, but has no effect for terms that get split down at the analyzer level. Oh well. Scratch that suggestion.

          I think I'm back to wanting to suggest that edismax should actually set the Lucene-level default query operator if "mm" is 100%. I think that would fix the original problem and allow the user to choose whether to user "mm" or "q.op" to control AND/OR.

          Show
          Jack Krupansky added a comment - It's not just mm. You set q.op to AND and it does the same thing. Joel, you're right. Upon closer inspection of the code, I see that the reason is that edismax never sets the Lucene default operator directly. Instead, it sets the default value of "mm" parameter to 100% if "q.op" is "AND", and set's BooleanQuery.minNrShouldMatch to the number of optional terms. That is equivalent to setting the default Lucene query operator at the top-level boolean level, but has no effect for terms that get split down at the analyzer level. Oh well. Scratch that suggestion. I think I'm back to wanting to suggest that edismax should actually set the Lucene-level default query operator if "mm" is 100%. I think that would fix the original problem and allow the user to choose whether to user "mm" or "q.op" to control AND/OR.
          Hide
          Jack Krupansky added a comment -

          If I use a smart Chinese tokenizer to split up a Chinese sentence into words, why can't the query parser treat those words exactly the same way it treats words from an English sentence?

          Indexing of whole documents can in fact treat text as if it were words from an English sentence, and split tokens do in fact behave as such in that context, but a query is not an English sentence or sentence in any natural language. Rather, a query is a structured expression composed of terms and operators, typically separated by whitespace or special operators such as parentheses. Portions of queries may look like natural language phrases or even whole sentences, but in reality they are sequences of terms and operators.

          In addition to being parsed according to the syntax of queries, as opposed to natural language processing or the raw token stream processing of an indexer, each of the query terms must be "analyzed" before the final form of the term can be generated into a Lucene Query structure. That analysis is performed separate form the "parsing" of the structured user query expression. That means that the processing of sub-terms that result from analysis is handled at a different level than source-level query terms that happen to "look" like English words. In other words, the sub-terms are processed by the "query generator" while the source terms were processed by the "query parser". We loosely refer to the combination of (user) query parsing and (Lucene) query generation as "the query parser", but it is important to distinguish (user query) "parsing" from (Lucene Query) "generation".

          The query parser does its best to handle sub-terms reasonably, but expecting that they will magically handled the same exact way as source terms is somewhat impractical. That doesn't mean that there can't be improvement, but simply that a dose of realism is needed when considering the potential, challenges, and limits of query parsing/processing/generation.

          Show
          Jack Krupansky added a comment - If I use a smart Chinese tokenizer to split up a Chinese sentence into words, why can't the query parser treat those words exactly the same way it treats words from an English sentence? Indexing of whole documents can in fact treat text as if it were words from an English sentence, and split tokens do in fact behave as such in that context, but a query is not an English sentence or sentence in any natural language. Rather, a query is a structured expression composed of terms and operators, typically separated by whitespace or special operators such as parentheses. Portions of queries may look like natural language phrases or even whole sentences, but in reality they are sequences of terms and operators. In addition to being parsed according to the syntax of queries, as opposed to natural language processing or the raw token stream processing of an indexer, each of the query terms must be "analyzed" before the final form of the term can be generated into a Lucene Query structure. That analysis is performed separate form the "parsing" of the structured user query expression. That means that the processing of sub-terms that result from analysis is handled at a different level than source-level query terms that happen to "look" like English words. In other words, the sub-terms are processed by the "query generator" while the source terms were processed by the "query parser". We loosely refer to the combination of (user) query parsing and (Lucene) query generation as "the query parser", but it is important to distinguish (user query) "parsing" from (Lucene Query) "generation". The query parser does its best to handle sub-terms reasonably, but expecting that they will magically handled the same exact way as source terms is somewhat impractical. That doesn't mean that there can't be improvement, but simply that a dose of realism is needed when considering the potential, challenges, and limits of query parsing/processing/generation.
          Hide
          Joel Rosen added a comment -

          Sounds to me like this is an English-centric design flaw with dismax. The point of dismax is to intelligently process simple user-entered phrases, right? If I understand correctly, it does this by looking at the terms entered and making some decisions about how to join them with AND or OR. But it assumes that a term is a whitespace-delimited string, yes? This is an incorrect assumption for Chinese. If instead of making this assumption, dismax ran the analyzers first to determine what is and isn't a term, then I imagine you would get more predictable behavior across both whitespace delimited and non-whitespace delimited languages, and you wouldn't need any "magical" handling for different languages.

          Show
          Joel Rosen added a comment - Sounds to me like this is an English-centric design flaw with dismax. The point of dismax is to intelligently process simple user-entered phrases, right? If I understand correctly, it does this by looking at the terms entered and making some decisions about how to join them with AND or OR. But it assumes that a term is a whitespace-delimited string, yes? This is an incorrect assumption for Chinese. If instead of making this assumption, dismax ran the analyzers first to determine what is and isn't a term, then I imagine you would get more predictable behavior across both whitespace delimited and non-whitespace delimited languages, and you wouldn't need any "magical" handling for different languages.
          Hide
          Jack Krupansky added a comment -

          Be careful not to confuse dismax and edismax. They are two different query parsers, with different goals.

          One of edismax's goals was to support "fielded queries" (e.g., "title:abc AND date:123") and the full Lucene query syntax. No typical analyzer will be able to tell you that title and date are field names.

          Not "English-centric", but European/Latin-centric for sure. The edismax and classic Lucene query parsers share that heritage, based on whitespace, but the dismax query parser doesn't "suffer" from that same need to parse field names and operators.

          There is no question that better query parser support is needed for non-European/Latin languages, but that requires careful, high-level, overall design, which is a tall order for a fast-paced open source community where features tend to be looked at in isolation.

          One clarification...

          assumes that a term is a whitespace-delimited string

          Yes and no. We need to be careful about distinguishing a "source term" - what the parser recognizes, which is whitespace delimited, from "analyzed terms" which are recognized and output by the field type analyzers. There is no requirement that the output terms be whitespace-delimited or that the input to an anlyzer be whitespace-delimited. So, the theory has been that even a whitespace-centric complex-structure query parser can also handle, for example, Chinese text. Obviously that hasn't worked out as cleanly as desired and more work is needed.

          Show
          Jack Krupansky added a comment - Be careful not to confuse dismax and edismax. They are two different query parsers, with different goals. One of edismax's goals was to support "fielded queries" (e.g., "title:abc AND date:123") and the full Lucene query syntax. No typical analyzer will be able to tell you that title and date are field names. Not "English-centric", but European/Latin-centric for sure. The edismax and classic Lucene query parsers share that heritage, based on whitespace, but the dismax query parser doesn't "suffer" from that same need to parse field names and operators. There is no question that better query parser support is needed for non-European/Latin languages, but that requires careful, high-level, overall design, which is a tall order for a fast-paced open source community where features tend to be looked at in isolation. One clarification... assumes that a term is a whitespace-delimited string Yes and no. We need to be careful about distinguishing a "source term" - what the parser recognizes, which is whitespace delimited, from "analyzed terms" which are recognized and output by the field type analyzers. There is no requirement that the output terms be whitespace-delimited or that the input to an anlyzer be whitespace-delimited. So, the theory has been that even a whitespace-centric complex-structure query parser can also handle, for example, Chinese text. Obviously that hasn't worked out as cleanly as desired and more work is needed.
          Hide
          Jack Krupansky added a comment -

          My proposal is for edismax to set the Lucene default query operator to "AND" if either: 1) "q.op" is "AND", or 2) "mm" is "100%".

          I think that will address the stated problem.

          Any objection?

          I'll try to come up with a patch, but a committer will be needed to apply it.

          Show
          Jack Krupansky added a comment - My proposal is for edismax to set the Lucene default query operator to "AND" if either: 1) "q.op" is "AND", or 2) "mm" is "100%". I think that will address the stated problem. Any objection? I'll try to come up with a patch, but a committer will be needed to apply it.
          Hide
          Yonik Seeley added a comment -

          My proposal is for edismax to set the Lucene default query operator to "AND"

          Hmmm, I dunno. mm=100% is really only meant to apply to top level query terms, not structured lucene queries.

          For example, in (foo:x foo:(a b c))
          It doesn't seem like a b c should all be mandatory just because there happens to be a default mm of 100% (and they are not today).

          Show
          Yonik Seeley added a comment - My proposal is for edismax to set the Lucene default query operator to "AND" Hmmm, I dunno. mm=100% is really only meant to apply to top level query terms, not structured lucene queries. For example, in (foo:x foo:(a b c)) It doesn't seem like a b c should all be mandatory just because there happens to be a default mm of 100% (and they are not today).
          Hide
          Jack Krupansky added a comment -

          I could back off and simply say that edismax should set the Lucene default query operator to "AND" if "q.op" is "AND", but that would not address this particular issue, which is complaining that mm won't force the split terms to be ANDed.

          If we really want to say that mm CAN'T be used to force split terms to be ANDED, then we should really resolve this issue asinvalid/won't fix.

          I should probably file a separate issue for the fact that q.op is not obeyed for any but the top-level query.

          And, the wiki makes no mention of "mm" being intended only for the top level query.

          Show
          Jack Krupansky added a comment - I could back off and simply say that edismax should set the Lucene default query operator to "AND" if "q.op" is "AND", but that would not address this particular issue, which is complaining that mm won't force the split terms to be ANDed. If we really want to say that mm CAN'T be used to force split terms to be ANDED, then we should really resolve this issue asinvalid/won't fix. I should probably file a separate issue for the fact that q.op is not obeyed for any but the top-level query. And, the wiki makes no mention of "mm" being intended only for the top level query.
          Hide
          Yonik Seeley added a comment -

          I was not saying that this issue shouldn't be fixed, but merely commenting on the negative consequences of one proposed solution.

          Show
          Yonik Seeley added a comment - I was not saying that this issue shouldn't be fixed, but merely commenting on the negative consequences of one proposed solution.
          Tom Burton-West made changes -
          Field Original Value New Value
          Link This issue is related to SOLR-2368 [ SOLR-2368 ]
          Hide
          Lance Norskog added a comment - - edited

          [
          See SOLR-3636, it's the same problem space but with synonym expansion. If "Monkeyhouse" expands to "monkey house", then a dismax or edismax query finds words with either ("monkey" OR "house"). Must-match defaults to 100% so you would expect this to mean "monkey" AND "house".

          This seems to be a multi-part problem.
          ] retracted as per below. Yes, synonyms are another box'o'fun.

          Show
          Lance Norskog added a comment - - edited [ See SOLR-3636 , it's the same problem space but with synonym expansion. If "Monkeyhouse" expands to "monkey house", then a dismax or edismax query finds words with either ("monkey" OR "house"). Must-match defaults to 100% so you would expect this to mean "monkey" AND "house". This seems to be a multi-part problem. ] retracted as per below. Yes, synonyms are another box'o'fun.
          Hide
          Bernd Fehling added a comment -

          I would not mix synonyms into this because they need a special seperate treatment.
          It might work for "monkeyhouse => monkey house" but what if you have synonyms like "nuclear fission, kernspaltung, fissione nucleare"?
          You would expect to get a search like (nuclear AND fission) OR (kernspaltung) OR (fissione AND nucleare).
          This is a simplified example just to show that if you include synonyms into this issue you also have to detect/parse/obey the kind of synonym mapping.

          Show
          Bernd Fehling added a comment - I would not mix synonyms into this because they need a special seperate treatment. It might work for "monkeyhouse => monkey house" but what if you have synonyms like "nuclear fission, kernspaltung, fissione nucleare"? You would expect to get a search like (nuclear AND fission) OR (kernspaltung) OR (fissione AND nucleare). This is a simplified example just to show that if you include synonyms into this issue you also have to detect/parse/obey the kind of synonym mapping.
          Tom Burton-West made changes -
          Affects Version/s 4.0-BETA [ 12322455 ]
          Hide
          Tom Burton-West added a comment -

          Just repeated tests in Solr 4.0Beta and the bug behaves the same.

          Show
          Tom Burton-West added a comment - Just repeated tests in Solr 4.0Beta and the bug behaves the same.
          Hide
          Tom Burton-West added a comment -

          File is gzipped. Unix line endings. Put document in solr/example/exampledocs. Queries listed in file.

          Show
          Tom Burton-West added a comment - File is gzipped. Unix line endings. Put document in solr/example/exampledocs. Queries listed in file.
          Tom Burton-West made changes -
          Attachment testSolr3589.xml.gz [ 12542189 ]
          Hide
          Tom Burton-West added a comment -

          I'm not at the point where I understand the test cases for Edismax enough to write unit tests. If someone can point me to an example unit test somewhere that I could use to model a test please do.
          In the meantime, attached is a file which can be put in the Solr exampledocs directory and indexed. Sample queries demonstrating the problem with English hyphenated words and with CJK are included

          Show
          Tom Burton-West added a comment - I'm not at the point where I understand the test cases for Edismax enough to write unit tests. If someone can point me to an example unit test somewhere that I could use to model a test please do. In the meantime, attached is a file which can be put in the Solr exampledocs directory and indexed. Sample queries demonstrating the problem with English hyphenated words and with CJK are included
          Hide
          Tom Burton-West added a comment -

          See above note

          Show
          Tom Burton-West added a comment - See above note
          Tom Burton-West made changes -
          Attachment testSolr3589.xml.gz [ 12542190 ]
          Hide
          Naomi Dushay added a comment - - edited

          (comment redacted! – I can't repeat the results today. Perhaps I was missing a Solr commit ... since the index is behaving differently, and I didn't change the index, though I did restart Solr a few times.)

          (below left for historical purposes)

          I may have stumbled into something. Try setting q.op explicitly.

          (baseurl)/select?q=fire-fly

          gives me a lot more results than

          (baseurl)/select?q=fire-fly&q.op=AND

          oddly, q.op=OR gives me the same results as setting it to AND.

          Why did I stumble into this?

          from http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29

          "In Solr 1.4 and prior, you should basically set mm=0 if you want the equivilent of q.op=OR, and mm=100% if you want the equivilent of q.op=AND. In 3.x and trunk the default value of mm is dictated by the q.op param (q.op=AND => mm=100%; q.op=OR => mm=0%). Keep in mind the default operator is effected by your schema.xml <solrQueryParser defaultOperator="xxx"/> entry. In older versions of Solr the default value is 100% (all clauses must match)"

          I have q.op set in my schema, thus:

          <solrQueryParser defaultOperator="AND" />

          but when I use the q.op parameter, I experience something different. Wild!

          Does this give us any insights?

          Show
          Naomi Dushay added a comment - - edited (comment redacted! – I can't repeat the results today. Perhaps I was missing a Solr commit ... since the index is behaving differently, and I didn't change the index, though I did restart Solr a few times.) (below left for historical purposes) I may have stumbled into something. Try setting q.op explicitly. (baseurl)/select?q=fire-fly gives me a lot more results than (baseurl)/select?q=fire-fly&q.op=AND oddly, q.op=OR gives me the same results as setting it to AND. Why did I stumble into this? from http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29 "In Solr 1.4 and prior, you should basically set mm=0 if you want the equivilent of q.op=OR, and mm=100% if you want the equivilent of q.op=AND. In 3.x and trunk the default value of mm is dictated by the q.op param (q.op=AND => mm=100%; q.op=OR => mm=0%). Keep in mind the default operator is effected by your schema.xml <solrQueryParser defaultOperator="xxx"/> entry. In older versions of Solr the default value is 100% (all clauses must match)" I have q.op set in my schema, thus: <solrQueryParser defaultOperator="AND" /> but when I use the q.op parameter, I experience something different. Wild! Does this give us any insights?
          Hide
          Naomi Dushay added a comment -

          Would this bug be addressed if this one is addressed? https://issues.apache.org/jira/browse/LUCENE-3833 (add operator to lucene queryparser for term quorum)

          Show
          Naomi Dushay added a comment - Would this bug be addressed if this one is addressed? https://issues.apache.org/jira/browse/LUCENE-3833 (add operator to lucene queryparser for term quorum)
          Hide
          Robert Muir added a comment -

          Simple unit test, based on Tom's example.

          Show
          Robert Muir added a comment - Simple unit test, based on Tom's example.
          Robert Muir made changes -
          Attachment SOLR-3589_test.patch [ 12551648 ]
          Hide
          Robert Muir added a comment -

          here's my hack patch.

          no idea if its right... in particular i dont really know what is going on with this query parser in general.

          the high level idea is that we can detect if the "multiple tokens" are synonyms versus CJK by the coord setting of the BQ, since coord will be enabled in the e.g. CJK case (same as if it were whitespace), but disabled for synonyms.

          but it could be totally wrong: at least tests pass and it might give someone else some ideas or be a useful workaround

          Show
          Robert Muir added a comment - here's my hack patch. no idea if its right... in particular i dont really know what is going on with this query parser in general. the high level idea is that we can detect if the "multiple tokens" are synonyms versus CJK by the coord setting of the BQ, since coord will be enabled in the e.g. CJK case (same as if it were whitespace), but disabled for synonyms. but it could be totally wrong: at least tests pass and it might give someone else some ideas or be a useful workaround
          Robert Muir made changes -
          Attachment SOLR-3589.patch [ 12551654 ]
          Hide
          Robert Muir added a comment -

          I traced through the logic here, and added additional tests (e.g. multi-field aliasing for this CJK case).

          Actually I needed the logic in a different place, anyway I think this patch is significantly more baked.

          Show
          Robert Muir added a comment - I traced through the logic here, and added additional tests (e.g. multi-field aliasing for this CJK case). Actually I needed the logic in a different place, anyway I think this patch is significantly more baked.
          Robert Muir made changes -
          Attachment SOLR-3589.patch [ 12551666 ]
          Hide
          Robert Muir added a comment -

          More tests: I think this patch is ready actually.

          Its well contained, we only apply this when the analyzer splits a token (e.g. CJK), not to any structured queries and not for the synonyms case.

          Show
          Robert Muir added a comment - More tests: I think this patch is ready actually. Its well contained, we only apply this when the analyzer splits a token (e.g. CJK), not to any structured queries and not for the synonyms case.
          Robert Muir made changes -
          Attachment SOLR-3589.patch [ 12551672 ]
          Robert Muir made changes -
          Assignee Robert Muir [ rcmuir ]
          Hide
          Robert Muir added a comment -

          I pinged hossman on IRC for some feedback, ill update the tests to show we aren't changing behavior with synonyms: this isnt tested today.

          Show
          Robert Muir added a comment - I pinged hossman on IRC for some feedback, ill update the tests to show we aren't changing behavior with synonyms: this isnt tested today.
          Hide
          Robert Muir added a comment -

          patch with the added synonyms test.

          Show
          Robert Muir added a comment - patch with the added synonyms test.
          Robert Muir made changes -
          Attachment SOLR-3589.patch [ 12551756 ]
          Hide
          Tom Burton-West added a comment -

          Back-port to 3.6 branch

          Show
          Tom Burton-West added a comment - Back-port to 3.6 branch
          Tom Burton-West made changes -
          Attachment SOLR-3589.patch [ 12552358 ]
          Hide
          Tom Burton-West added a comment -

          I back-ported to 3.6 branch. Forgot to change the name from SOLR-3589.patch, so the 6/Nov/12 patch is the 3.6 patch against yesterdays svn version of 3.6.

          Main difference I saw between 3.6 and 4.0 is that Solr 4.0 uses DisMaxQParser.parseMinShouldMatch() to set the default at 0% if q.op=OR and %100 if q.op =AND

          I just kept the 3.6 behavior which uses 3.6 default of 100% (if mm is not set)

          I'll test the 3.6 patch against a production index tomorrow.

          Show
          Tom Burton-West added a comment - I back-ported to 3.6 branch. Forgot to change the name from SOLR-3589 .patch, so the 6/Nov/12 patch is the 3.6 patch against yesterdays svn version of 3.6. Main difference I saw between 3.6 and 4.0 is that Solr 4.0 uses DisMaxQParser.parseMinShouldMatch() to set the default at 0% if q.op=OR and %100 if q.op =AND I just kept the 3.6 behavior which uses 3.6 default of 100% (if mm is not set) I'll test the 3.6 patch against a production index tomorrow.
          Hide
          Robert Muir added a comment -

          Hi Tom: thanks for working on the 3.6 backport!

          I'll commit the trunk/4.x patch first, and wait for your testing and review your patch before looking at 3.6!

          Show
          Robert Muir added a comment - Hi Tom: thanks for working on the 3.6 backport! I'll commit the trunk/4.x patch first, and wait for your testing and review your patch before looking at 3.6!
          Hide
          Robert Muir added a comment -

          Committed to trunk/4.x.

          Will look tomorrow at 3.6.

          Show
          Robert Muir added a comment - Committed to trunk/4.x. Will look tomorrow at 3.6.
          Hide
          Tom Burton-West added a comment -

          Forgot to work from your latest patch with the synonyms test. I'll post a new backport of the patch with the synonyms test and against the latest 3.6x in svn shortly

          Show
          Tom Burton-West added a comment - Forgot to work from your latest patch with the synonyms test. I'll post a new backport of the patch with the synonyms test and against the latest 3.6x in svn shortly
          Hide
          Tom Burton-West added a comment -

          Backport to 3.6 r1406713. Includes synonyms test.

          Will test in against production later today

          Show
          Tom Burton-West added a comment - Backport to 3.6 r1406713. Includes synonyms test. Will test in against production later today
          Tom Burton-West made changes -
          Attachment SOLR-3589-3.6.PATCH [ 12552500 ]
          Hide
          Tom Burton-West added a comment -

          Hi Robert,

          I just put the backport to 3.6 up on our test server and pointed it to one of our production shards. The improvement for Chinese queries are dramatic. (Especially for longer queries like the TREC 5 queries, see examples below)

          When you have time, please look over the backport of the patch. I think it is fine but I would appreciate you looking it over. My understanding of your patch is that it just affects a small portion of the edismax logic, but I don't understand the edismax parser well enough to be sure there isn't some difference between 3.6 and 4.0 that I didn't account for in the patch.

          Thanks for working on this. Naomi and I are both very excited about this bug finally being fixed and want to put the fix into production soon.

          Example TREC 5 Chinese queries:

          <num> Number: CH4
          <E-title> The newly discovered oil fields in China.
          <C-title> 中国大陆新发现的油田
          40,135 items found for 中国大陆新发现的油田 with current implementation (due to dismax bug)
          78 items found for 中国大陆新发现的油田 with patch

          <num> Number: CH10
          <E-title> Border Trade in Xinjiang
          <C-title> 新疆的边境贸易
          20,249 items found for 新疆的边境贸易 current implementation (with bug)
          243 items found for 新疆的边境贸易 with patch.

          Show
          Tom Burton-West added a comment - Hi Robert, I just put the backport to 3.6 up on our test server and pointed it to one of our production shards. The improvement for Chinese queries are dramatic. (Especially for longer queries like the TREC 5 queries, see examples below) When you have time, please look over the backport of the patch. I think it is fine but I would appreciate you looking it over. My understanding of your patch is that it just affects a small portion of the edismax logic, but I don't understand the edismax parser well enough to be sure there isn't some difference between 3.6 and 4.0 that I didn't account for in the patch. Thanks for working on this. Naomi and I are both very excited about this bug finally being fixed and want to put the fix into production soon. — Example TREC 5 Chinese queries: <num> Number: CH4 <E-title> The newly discovered oil fields in China. <C-title> 中国大陆新发现的油田 40,135 items found for 中国大陆新发现的油田 with current implementation (due to dismax bug) 78 items found for 中国大陆新发现的油田 with patch <num> Number: CH10 <E-title> Border Trade in Xinjiang <C-title> 新疆的边境贸易 20,249 items found for 新疆的边境贸易 current implementation (with bug) 243 items found for 新疆的边境贸易 with patch.
          Hide
          Robert Muir added a comment -

          Backported to 3.6 branch in case we do a 3.6.2

          Thanks Tom!

          Show
          Robert Muir added a comment - Backported to 3.6 branch in case we do a 3.6.2 Thanks Tom!
          Robert Muir made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 3.6.2 [ 12322485 ]
          Fix Version/s 4.1 [ 12321141 ]
          Fix Version/s 5.0 [ 12321664 ]
          Resolution Fixed [ 1 ]
          Hide
          Naomi Dushay added a comment -

          any chance this fix can be applied to dismax as well?

          Show
          Naomi Dushay added a comment - any chance this fix can be applied to dismax as well?
          Steve Rowe made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Robert Muir
          http://svn.apache.org/viewvc?view=revision&revision=1406439

          SOLR-3589: Edismax parser does not honor mm parameter if analyzer splits a token

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1406439 SOLR-3589 : Edismax parser does not honor mm parameter if analyzer splits a token
          Naomi Dushay made changes -
          Link This issue is related to SOLR-3739 [ SOLR-3739 ]

            People

            • Assignee:
              Robert Muir
              Reporter:
              Tom Burton-West
            • Votes:
              4 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development