Solr
  1. Solr
  2. SOLR-3823

Parentheses in a boost query cause errors

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-BETA
    • Fix Version/s: 4.0, 6.0
    • Component/s: query parsers
    • Labels:
      None
    • Environment:

      Mac, jdk 1.6, Chrome

      Description

      When using a boost query (bq) that contains a parentheses (like this example from the Relevancy Cookbook section of the wiki):

       ? defType = dismax 
          & q = foo bar 
          & bq = (*:* -xxx)^999 
      

      You get the following error:

      org.apache.lucene.queryparser.classic.ParseException: Cannot parse 'xxx)': Encountered " ")" ") "" at line 1, column 12. Was expecting one of: <EOF> <AND> ... <OR> ... <NOT> ... "+" ... "" ... <BAREOPER> ... "(" ... "*" ... "^" ... <QUOTED> ... <TERM> ... <FUZZY_SLOP> ... <PREFIXTERM> ... <WILDTERM> ... <REGEXPTERM> ... "[" ... "{" ... <NUMBER> ...

        Issue Links

          Activity

          Hide
          Erick Erickson added a comment -

          The error is because of the colon character, it has meaning in a query and must be escaped. See: http://lucene.apache.org/core/3_6_1/queryparsersyntax.html.

          So I'll close this as invalid, if you disagree please let us know.

          BTW, it's better to raise this kind of question on the user's list rather than open a JIRA, at least until you're sure it's really a bug.

          Show
          Erick Erickson added a comment - The error is because of the colon character, it has meaning in a query and must be escaped. See: http://lucene.apache.org/core/3_6_1/queryparsersyntax.html . So I'll close this as invalid, if you disagree please let us know. BTW, it's better to raise this kind of question on the user's list rather than open a JIRA, at least until you're sure it's really a bug.
          Hide
          Hoss Man added a comment -

          1) editing the issue description to include "noformat" tags – i think Erick was getting confused by the "*:*" showing up as just ":"

          2) i can't reproduce the described problem. When i tried using the solr example data, this request worked just fine...

          http://localhost:8983/solr/select?q=ipod&defType=dismax&bq=%28*:*%20-id:IW-02%29^999

          Mathos: please follow up on the solr-user@lucene mailing list with more details about the problems you are you having and your actual (specific) configs/queries

          Show
          Hoss Man added a comment - 1) editing the issue description to include "noformat" tags – i think Erick was getting confused by the "*:*" showing up as just ":" 2) i can't reproduce the described problem. When i tried using the solr example data, this request worked just fine... http://localhost:8983/solr/select?q=ipod&defType=dismax&bq=%28*:*%20-id:IW-02%29 ^999 Mathos: please follow up on the solr-user@lucene mailing list with more details about the problems you are you having and your actual (specific) configs/queries
          Hide
          Mathos Marcer added a comment -

          The problem seems to be when I specify defType=edismax, under defType=dismax it is working like a champ.

          Show
          Mathos Marcer added a comment - The problem seems to be when I specify defType=edismax, under defType=dismax it is working like a champ.
          Hide
          Erick Erickson added a comment -

          Thanks, Hoss, you're right...

          But I can get this to fail both with BETA and today's trunk with the example data.

          http://localhost:8983/solr/select?q=foo&defType=edismax&bq=(name:nonsense -xxx)^999
          

          Interestingly this works: (note the space after "bq"),

          http://localhost:8983/solr/select?q=foo&defType=edismax&bq =(name:nonsense -xxx)^999
          

          This fails (spaces around parens, there was an issue with non-space parens lately, but apparently it's unrelated.)

          http://localhost:8983/solr/select?q=foo&defType=edismax&bq= ( name:nonsense -xxx ) ^999
          

          Stack trace from log:

          Caused by: org.apache.lucene.queryparser.classic.ParseException: Encountered "<EOF>" at line 1, column 1.
          Was expecting one of:
          <NOT> ...
          "+" ...
          "-" ...
          <BAREOPER> ...
          "(" ...
          "*" ...
          <QUOTED> ...
          <TERM> ...
          <PREFIXTERM> ...
          <WILDTERM> ...
          <REGEXPTERM> ...
          "[" ...
          "{" ...
          <NUMBER> ...
          <TERM> ...
          "*" ...

          at org.apache.lucene.queryparser.classic.QueryParser.generateParseException(QueryParser.java:708)
          at org.apache.lucene.queryparser.classic.QueryParser.jj_consume_token(QueryParser.java:590)
          at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:275)
          at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:181)
          at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:261)
          at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:181)
          at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:170)
          at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:120)
          ... 35 more

          Sep 11, 2012 12:37:58 PM org.apache.solr.core.SolrCore execute
          INFO: [collection1] webapp=/solr path=/select params=

          {q=foo&defType=edismax&bq=+(+name:nonsense+-xxx+)+^999}

          status=400 QTime=2

          Show
          Erick Erickson added a comment - Thanks, Hoss, you're right... But I can get this to fail both with BETA and today's trunk with the example data. http://localhost:8983/solr/select?q=foo&defType=edismax&bq=(name:nonsense -xxx)^999 Interestingly this works: (note the space after "bq"), http://localhost:8983/solr/select?q=foo&defType=edismax&bq =(name:nonsense -xxx)^999 This fails (spaces around parens, there was an issue with non-space parens lately, but apparently it's unrelated.) http://localhost:8983/solr/select?q=foo&defType=edismax&bq= ( name:nonsense -xxx ) ^999 Stack trace from log: Caused by: org.apache.lucene.queryparser.classic.ParseException: Encountered "<EOF>" at line 1, column 1. Was expecting one of: <NOT> ... "+" ... "-" ... <BAREOPER> ... "(" ... "*" ... <QUOTED> ... <TERM> ... <PREFIXTERM> ... <WILDTERM> ... <REGEXPTERM> ... "[" ... "{" ... <NUMBER> ... <TERM> ... "*" ... at org.apache.lucene.queryparser.classic.QueryParser.generateParseException(QueryParser.java:708) at org.apache.lucene.queryparser.classic.QueryParser.jj_consume_token(QueryParser.java:590) at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:275) at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:181) at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:261) at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:181) at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:170) at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:120) ... 35 more Sep 11, 2012 12:37:58 PM org.apache.solr.core.SolrCore execute INFO: [collection1] webapp=/solr path=/select params= {q=foo&defType=edismax&bq=+(+name:nonsense+-xxx+)+^999} status=400 QTime=2
          Hide
          Mathos Marcer added a comment -

          I'm glad I'm not just going crazy

          I did notice while the space before the equal sign (ie "bq =(name:nonsense -xxx)^999") doesn't produce a parsing error, comparing results between 3.6 and 4.0 BETA, it doesn't appear to be applying the boost. In fact I get the same results as if I didn't have the "bq" option there at all.

          Show
          Mathos Marcer added a comment - I'm glad I'm not just going crazy I did notice while the space before the equal sign (ie "bq =(name:nonsense -xxx)^999") doesn't produce a parsing error, comparing results between 3.6 and 4.0 BETA, it doesn't appear to be applying the boost. In fact I get the same results as if I didn't have the "bq" option there at all.
          Hide
          Mathos Marcer added a comment - - edited

          Actually looking at it closer, it is probably because with adding the space after bq is it doesn't register it as "bq" but as "bq " looking at the params section of the query:

          <str name="bq ">(*:* -replacement)^99500000</str>

          Show
          Mathos Marcer added a comment - - edited Actually looking at it closer, it is probably because with adding the space after bq is it doesn't register it as "bq" but as "bq " looking at the params section of the query: <str name="bq ">(*:* -replacement)^99500000</str>
          Hide
          Erick Erickson added a comment -

          FWIW, I'm on a Mac (Lion) too, although I doubt that matters.

          Show
          Erick Erickson added a comment - FWIW, I'm on a Mac (Lion) too, although I doubt that matters.
          Hide
          Jack Krupansky added a comment -

          It turns out that it is the white space that causes the problem.

          Looking at the code, I see that edismax calls SolrPluginUtils.parseFieldBoosts to "parse" the "bq" value, which is absurd since it simply breaks up the param value based on whitespace, even in the middle of the query. Then the code calls QParser.subQuery to parse the portions of the query, which of course makes no sense unless the original query contained no whitespace or was a sequence of non-white space queries separated with whitespace.

          Dismax, OTOH, treats "bq" as an array of query strings (multiple "bq" params) and parses each as a separate sub-query.

          Edismax in 3.6/3.6.1 has the old dismax code.

          The culprit? SOLR-3278: "edismax support for negative boost with "bq" parameter" introduced the error, back in March.

          Show
          Jack Krupansky added a comment - It turns out that it is the white space that causes the problem. Looking at the code, I see that edismax calls SolrPluginUtils.parseFieldBoosts to "parse" the "bq" value, which is absurd since it simply breaks up the param value based on whitespace, even in the middle of the query. Then the code calls QParser.subQuery to parse the portions of the query, which of course makes no sense unless the original query contained no whitespace or was a sequence of non-white space queries separated with whitespace. Dismax, OTOH, treats "bq" as an array of query strings (multiple "bq" params) and parses each as a separate sub-query. Edismax in 3.6/3.6.1 has the old dismax code. The culprit? SOLR-3278 : "edismax support for negative boost with "bq" parameter" introduced the error, back in March.
          Hide
          Hoss Man added a comment -

          jack: thanks for digging in and tracking down the bug .. i'll try to get to this for 4.0

          (Mathos: FWIW, details really matter bug reports like – i tried what you said and couldn't reproduce, if you mentioned that you were using edismax instead of dismax i probably would have been able to reproduce, and the stack trace would have helped me find the bug in the code)

          Show
          Hoss Man added a comment - jack: thanks for digging in and tracking down the bug .. i'll try to get to this for 4.0 (Mathos: FWIW, details really matter bug reports like – i tried what you said and couldn't reproduce, if you mentioned that you were using edismax instead of dismax i probably would have been able to reproduce, and the stack trace would have helped me find the bug in the code)
          Hide
          Hoss Man added a comment -

          Committed revision 1383708. - trunk
          Committed revision 1383716. - 4x

          thanks everybody for helping track this down.

          Show
          Hoss Man added a comment - Committed revision 1383708. - trunk Committed revision 1383716. - 4x thanks everybody for helping track this down.
          Hide
          James Dyer added a comment -

          Hoss,

          I appreciate you fixing this, but I would rather get a fix that preserves the negative boost support (SOLR-3278). I guess I don't understand the bug this issue was addressing. Is it simply that "bq" would fail if extra whitespace was in the query? Could we write a failing testcase for that? Do you see a reason why it would be difficult to fix this and retain the negative boosts?

          The discussion of LUCENE-4378 is pertinent: we have products in our index that we either do not sell or we know most of our customer do not want. Yet they often score very high. The only way I can reliably prevent these from becoming top hits is to use a negative boost. I would imagine this is a frequent requirement.

          I'm more than willing to contribute for this, but I couldn't tell that this issue was an actual problem or a case of users putting whitespace where it doesn't belong and prior versions being more forgiving.

          Show
          James Dyer added a comment - Hoss, I appreciate you fixing this, but I would rather get a fix that preserves the negative boost support ( SOLR-3278 ). I guess I don't understand the bug this issue was addressing. Is it simply that "bq" would fail if extra whitespace was in the query? Could we write a failing testcase for that? Do you see a reason why it would be difficult to fix this and retain the negative boosts? The discussion of LUCENE-4378 is pertinent: we have products in our index that we either do not sell or we know most of our customer do not want. Yet they often score very high. The only way I can reliably prevent these from becoming top hits is to use a negative boost. I would imagine this is a frequent requirement. I'm more than willing to contribute for this, but I couldn't tell that this issue was an actual problem or a case of users putting whitespace where it doesn't belong and prior versions being more forgiving.
          Hide
          Erick Erickson added a comment -

          James:

          The problem was quite the opposite. When there was NO space in the bq clause it'd fail like this, i.e.
          bq=(stuff). And when there was space, I don't think it worked at all....

          But yeah, it'd be good to have both parens and negative boosts...

          Show
          Erick Erickson added a comment - James: The problem was quite the opposite. When there was NO space in the bq clause it'd fail like this, i.e. bq=(stuff). And when there was space, I don't think it worked at all.... But yeah, it'd be good to have both parens and negative boosts...
          Hide
          Hoss Man added a comment -

          I couldn't tell that this issue was an actual problem or a case of users putting whitespace where it doesn't belong and prior versions being more forgiving.

          James: the core of the bug was your use of SolrPluginUtils.parseFieldBoosts to try and parse the bq params.

          This is not safe – if you look at the method it is an extremely trivial utility that is specific for parsing qf/pf style strings containing a list of field names and boosts. it's not a safe way to parse an arbitrary query string, and any non trivial query string can cause problems with it.

          AS you noted in SOLR-3278, parseFieldBoosts is used for parsing the "bf" param and that's actually a long standing unsafe bug as well (SOLR-2014) but since functions tend to be much simpler, it's historically been less problematic. when people run into problems with it, the workarround is to use "bq=

          {!func}

          ..." instead.

          I would rather get a fix that preserves the negative boost support

          Since SOLR-3278 had not been released publicly outside of the ALPHA/BETA, my first priority was to fix the regression compared to 3.x where non-trivial bq queries worked fine.

          The documented method of dealing with "negative boosting" in solr is actually the type of query that was the crux of this bug report, and i updated the tests you added in SOLR-3278 to use that pattern...

          https://wiki.apache.org/solr/SolrRelevancyFAQ#How_do_I_give_a_negative_.28or_very_low.29_boost_to_documents_that_match_a_query.3F

          I have no objections to supporting "true" negative boosts, but i think the right way to do it is in the query parsers / QParsers themselves (so that the boosts can be on any clause) and not just as a special hack for bq/bf (the fact that it works in bf is actualy just a fluke of it's buggy implementation) but as you can see in LUCENE-4378 this is a contentious idea.

          Show
          Hoss Man added a comment - I couldn't tell that this issue was an actual problem or a case of users putting whitespace where it doesn't belong and prior versions being more forgiving. James: the core of the bug was your use of SolrPluginUtils.parseFieldBoosts to try and parse the bq params. This is not safe – if you look at the method it is an extremely trivial utility that is specific for parsing qf/pf style strings containing a list of field names and boosts. it's not a safe way to parse an arbitrary query string, and any non trivial query string can cause problems with it. AS you noted in SOLR-3278 , parseFieldBoosts is used for parsing the "bf" param and that's actually a long standing unsafe bug as well ( SOLR-2014 ) but since functions tend to be much simpler, it's historically been less problematic. when people run into problems with it, the workarround is to use "bq= {!func} ..." instead. I would rather get a fix that preserves the negative boost support Since SOLR-3278 had not been released publicly outside of the ALPHA/BETA, my first priority was to fix the regression compared to 3.x where non-trivial bq queries worked fine. The documented method of dealing with "negative boosting" in solr is actually the type of query that was the crux of this bug report, and i updated the tests you added in SOLR-3278 to use that pattern... https://wiki.apache.org/solr/SolrRelevancyFAQ#How_do_I_give_a_negative_.28or_very_low.29_boost_to_documents_that_match_a_query.3F I have no objections to supporting "true" negative boosts, but i think the right way to do it is in the query parsers / QParsers themselves (so that the boosts can be on any clause) and not just as a special hack for bq/bf (the fact that it works in bf is actualy just a fluke of it's buggy implementation) but as you can see in LUCENE-4378 this is a contentious idea.
          Hide
          James Dyer added a comment -

          Hoss, Thank you for working through this and opening Lucene-4378 to at least investigate changing the parser grammar. I understand the issue with what I had done initially and appreciate your help on this.

          Show
          James Dyer added a comment - Hoss, Thank you for working through this and opening Lucene-4378 to at least investigate changing the parser grammar. I understand the issue with what I had done initially and appreciate your help on this.
          Hide
          Erik Hatcher added a comment -

          Sorry to chime in after this issue has been marked as resolved, but .... it doesn't seem right to me that bq is parsed with the defType specified parser. Rather it should, like fq's, default to the "lucene" query parser. No?

          Show
          Erik Hatcher added a comment - Sorry to chime in after this issue has been marked as resolved, but .... it doesn't seem right to me that bq is parsed with the defType specified parser. Rather it should, like fq's, default to the "lucene" query parser. No?
          Hide
          Jack Krupansky added a comment - - edited

          bq is parsed with the defType specified parser. Rather it should, like fq's, default to the "lucene" query parser.

          Sounds reasonable for fq and bq to be parsed the same way, but it seems unrelated to this specific Jira.

          And this also begs the question of why bq shouldn't simply be a global query parameter like fq. I don't see any reason why a multiplicative boost shouldn't apply to a non-dismax/edismax query - but again this is outside the scope of this specific Jira.

          Show
          Jack Krupansky added a comment - - edited bq is parsed with the defType specified parser. Rather it should, like fq's, default to the "lucene" query parser. Sounds reasonable for fq and bq to be parsed the same way, but it seems unrelated to this specific Jira. And this also begs the question of why bq shouldn't simply be a global query parameter like fq. I don't see any reason why a multiplicative boost shouldn't apply to a non-dismax/edismax query - but again this is outside the scope of this specific Jira.
          Hide
          Hoss Man added a comment -

          Sorry to chime in after this issue has been marked as resolved, but .... it doesn't seem right to me that bq is parsed with the defType specified parser.

          I don't understand this comment – I see no evidence that "bq" is incorrectly using the defType parser in either dismax or edismax.

          In both cases null is passed to the subQuery() call for bq allowing it inherit the defTYpe from the localparams of the main "q" string (which is appropriate since the bq is an implicit subquery added to the main query) but not a global defType. ie, in the first example below, the "lucene" parser would be used to parse the bq, while in the second example the "yyy" parser would be used because of the explicit defType localparam...

          /select?qf=name+text&defType=xxx&q={!edismax}mainquery&bq=myboost
          
          /select?qf=name+text&defType=xxx&q={!edismax defType=yyy}mainquery&bq=myboost
          

          ...this is how subqueries have always worked with QParsers. if you think this should be reconsidered, please open a specific issue.

          (NOTE: fq's are not subqueries of the main query, so they don't look at the "q" at all)

          Show
          Hoss Man added a comment - Sorry to chime in after this issue has been marked as resolved, but .... it doesn't seem right to me that bq is parsed with the defType specified parser. I don't understand this comment – I see no evidence that "bq" is incorrectly using the defType parser in either dismax or edismax. In both cases null is passed to the subQuery() call for bq allowing it inherit the defTYpe from the localparams of the main "q" string (which is appropriate since the bq is an implicit subquery added to the main query) but not a global defType. ie, in the first example below, the "lucene" parser would be used to parse the bq, while in the second example the "yyy" parser would be used because of the explicit defType localparam... /select?qf=name+text&defType=xxx&q={!edismax}mainquery&bq=myboost /select?qf=name+text&defType=xxx&q={!edismax defType=yyy}mainquery&bq=myboost ...this is how subqueries have always worked with QParsers. if you think this should be reconsidered, please open a specific issue. (NOTE: fq's are not subqueries of the main query, so they don't look at the "q" at all)
          Hide
          Jack Krupansky added a comment -

          ...this is how subqueries have always worked with QParsers

          Is there any wiki/Javadoc that documents this behavior?

          I would also note that both the dismax and edismax wiki pages for "bq" say:

          A raw query string (in the SolrQuerySyntax)

          which is not right. We know it uses defType, but we clearly need the precise language as it is supposed to be (noting the comments by hossman.)

          See:
          http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29
          http://wiki.apache.org/solr/ExtendedDisMax#bq_.28Boost_Query.29

          Show
          Jack Krupansky added a comment - ...this is how subqueries have always worked with QParsers Is there any wiki/Javadoc that documents this behavior? I would also note that both the dismax and edismax wiki pages for "bq" say: A raw query string (in the SolrQuerySyntax) which is not right. We know it uses defType, but we clearly need the precise language as it is supposed to be (noting the comments by hossman.) See: http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29 http://wiki.apache.org/solr/ExtendedDisMax#bq_.28Boost_Query.29
          Hide
          Erik Hatcher added a comment -

          I see no evidence that "bq" is incorrectly using the defType parser in either dismax or edismax

          Ok, sorry I misinterpreted the nature of this issue apparently.

          Show
          Erik Hatcher added a comment - I see no evidence that "bq" is incorrectly using the defType parser in either dismax or edismax Ok, sorry I misinterpreted the nature of this issue apparently.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Chris M. Hostetter
          http://svn.apache.org/viewvc?view=revision&revision=1383716

          SOLR-3823: Fix 'bq' parsing in edismax. Please note that this required reverting the negative boost support added by SOLR-3278 (merge r1383708)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Chris M. Hostetter http://svn.apache.org/viewvc?view=revision&revision=1383716 SOLR-3823 : Fix 'bq' parsing in edismax. Please note that this required reverting the negative boost support added by SOLR-3278 (merge r1383708)
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Hoss Man
              Reporter:
              Mathos Marcer
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development