Solr
  1. Solr
  2. SOLR-3377

eDismax: A fielded query wrapped by parens is not recognized

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.6
    • Fix Version/s: 4.0-BETA
    • Component/s: query parsers
    • Labels:
      None

      Description

      As reported by "bernd" on the user list, a query like this

      q=(name:test)

      will yield 0 hits in 3.6 while it worked in 3.5. It works without the parens.

      1. SOLR-3377.patch
        2 kB
        Bernd Fehling
      2. SOLR-3377.patch
        2 kB
        Bernd Fehling
      3. SOLR-3377.patch
        2 kB
        Bernd Fehling
      4. SOLR-3377.patch
        0.6 kB
        Jan Høydahl

        Issue Links

          Activity

          Hide
          Jan Høydahl added a comment -

          The query is parsed into:

          +(id:name:test^10.0 | text:name:test^0.5 | cat:name:test^1.4 | manu:name:test^1.1 | name:name:test^1.2 | features:name:test | sku:nametest^1.5)

          suggesting that eDismax falls back to literal matching because the field name (name is not found. Perhaps introduced by SOLR-3026?

          A temporary workaround seems to be to insert a space before all field names:

          q=( name:test )

          Show
          Jan Høydahl added a comment - The query is parsed into: +(id:name:test^10.0 | text:name:test^0.5 | cat:name:test^1.4 | manu:name:test^1.1 | name:name:test^1.2 | features:name:test | sku:nametest^1.5) suggesting that eDismax falls back to literal matching because the field name (name is not found. Perhaps introduced by SOLR-3026 ? A temporary workaround seems to be to insert a space before all field names: q= ( name:test )
          Hide
          Jan Høydahl added a comment -

          This first patch only reproduces the bug with a failing unit test

          Show
          Jan Høydahl added a comment - This first patch only reproduces the bug with a failing unit test
          Hide
          Bernd Fehling added a comment -

          A suggestion, how would it be to do a simple "cleanup" as a first step of query string handling by removing the outermost parantheses (if there are any)?
          This won't touch the inner edismax logic and produce another bad sideeffect.

          Show
          Bernd Fehling added a comment - A suggestion, how would it be to do a simple "cleanup" as a first step of query string handling by removing the outermost parantheses (if there are any)? This won't touch the inner edismax logic and produce another bad sideeffect.
          Hide
          Bernd Fehling added a comment -

          I have enhanced the test a bit more and supplied a patch which works.
          It also passes all other tests.

          Please test and give feedback.

          Show
          Bernd Fehling added a comment - I have enhanced the test a bit more and supplied a patch which works. It also passes all other tests. Please test and give feedback.
          Hide
          Jan Høydahl added a comment -

          Tried the patch, it fails one of the existing tests for me @ TestExtendedDismaxParser.java:164, the query is
          star OR (-star)
          Does this test pass for you?

          Show
          Jan Høydahl added a comment - Tried the patch, it fails one of the existing tests for me @ TestExtendedDismaxParser.java:164, the query is star OR (-star) Does this test pass for you?
          Hide
          Bernd Fehling added a comment -

          You're right.
          New patch is doing the job now but I'm not very happy with it.

          The patch should be a more general solution but this is currently a specific solution.

          At least it does not change the original query as a replacement from "(" to "( + space" would do.

          Show
          Bernd Fehling added a comment - You're right. New patch is doing the job now but I'm not very happy with it. The patch should be a more general solution but this is currently a specific solution. At least it does not change the original query as a replacement from "(" to "( + space" would do.
          Hide
          Bernd Fehling added a comment -

          Now this is a general patch which passes all tests.
          It can handle one or more parentheses directly before a valid field name.

          Please test and give feedback.

          Show
          Bernd Fehling added a comment - Now this is a general patch which passes all tests. It can handle one or more parentheses directly before a valid field name. Please test and give feedback.
          Hide
          Bernd Fehling added a comment -

          If there are no further complains about the patch can it then be committed to 3.6.1 / trunk and the issue closed?

          Who is doing the commit / has the permissions to commit?

          Show
          Bernd Fehling added a comment - If there are no further complains about the patch can it then be committed to 3.6.1 / trunk and the issue closed? Who is doing the commit / has the permissions to commit?
          Hide
          Jan Høydahl added a comment -

          I think we need better test coverage before this is ready.
          We should add a bunch of tests with queries involving parens, to verify that they behave as expected. Both tests involving parens as intended grouping for boolean precedence as well as parens not intended as boolean sugar but as plain text pasted from somewhere:

          q=(foo OR title:bar) AND (title:foo OR bar)
          q=Meeting at noon (room:Auditorium)
          

          The first should obey the instructed boolean order, while the last should return docs with the literal token "room:Autirium" in any of the qf fields.

          The key goal of dismax is to be very robust so people can paste in all kind of garbage, and get matches. So if the query parses as valid boolean logic, that should be used.

          Show
          Jan Høydahl added a comment - I think we need better test coverage before this is ready. We should add a bunch of tests with queries involving parens, to verify that they behave as expected. Both tests involving parens as intended grouping for boolean precedence as well as parens not intended as boolean sugar but as plain text pasted from somewhere: q=(foo OR title:bar) AND (title:foo OR bar) q=Meeting at noon (room:Auditorium) The first should obey the instructed boolean order, while the last should return docs with the literal token "room:Autirium" in any of the qf fields. The key goal of dismax is to be very robust so people can paste in all kind of garbage, and get matches. So if the query parses as valid boolean logic, that should be used.
          Hide
          Bernd Fehling added a comment -

          Shoot me an enhanced unit test which covers your requests and i will look into this.

          But, while looking through all the test cases I think we really need a clear definition of rules and define a BNF syntax description for edismax and then implement the BNF syntax. This has two advantages, the user knows how to construct a valid query and we can clean up the patch work inside edismax. This can also obey fallback mechanism to always return a valid query.
          How about that?

          Show
          Bernd Fehling added a comment - Shoot me an enhanced unit test which covers your requests and i will look into this. But, while looking through all the test cases I think we really need a clear definition of rules and define a BNF syntax description for edismax and then implement the BNF syntax. This has two advantages, the user knows how to construct a valid query and we can clean up the patch work inside edismax. This can also obey fallback mechanism to always return a valid query. How about that?
          Hide
          Jan Høydahl added a comment -

          Have not time to work on this for the next few months, so unassigning for now

          Show
          Jan Høydahl added a comment - Have not time to work on this for the next few months, so unassigning for now
          Hide
          Jack Krupansky added a comment -

          This seems like a rather serious bug to me, quite a black eye for Solr 3.6. There must be some committer willing to commit the proposed patch.

          Show
          Jack Krupansky added a comment - This seems like a rather serious bug to me, quite a black eye for Solr 3.6. There must be some committer willing to commit the proposed patch.
          Hide
          Bernd Fehling added a comment - - edited

          I was willing to supply a final fix to this and was hoping that it will make it to release 4.x.
          But unfortunately:

          • I got no enhanced unit test
          • noone comitted this/my patch either
          • the problem is still there

          So I said "was willing", thats true, I gave up on this and thinking now about switching to ElasticSearch because they really appreciate any help.

          Show
          Bernd Fehling added a comment - - edited I was willing to supply a final fix to this and was hoping that it will make it to release 4.x. But unfortunately: I got no enhanced unit test noone comitted this/my patch either the problem is still there So I said "was willing", thats true, I gave up on this and thinking now about switching to ElasticSearch because they really appreciate any help.
          Hide
          Jan Høydahl added a comment -

          Bernd, I agree that this is a bug that absolutely should be fixed.

          I have followed it through this far but have not yet had the chance to go the last mile until committing, but I am definitely keen to pick it up again after summer holidays and parental leave, hopefully before. The reason I unassigned myself is to signal to the other committers that I'm not actively working on this and let others step in if they wish.

          This is the way Apache works - we are all volunteers, and I am sure that with some patience this will make it through in time for 4.0 final. You've done a great job so far with the patch. It may be "final" and good to go, but personally I'd write some more tests since this particular area has been lacking - before committing.

          Show
          Jan Høydahl added a comment - Bernd, I agree that this is a bug that absolutely should be fixed. I have followed it through this far but have not yet had the chance to go the last mile until committing, but I am definitely keen to pick it up again after summer holidays and parental leave, hopefully before. The reason I unassigned myself is to signal to the other committers that I'm not actively working on this and let others step in if they wish. This is the way Apache works - we are all volunteers, and I am sure that with some patience this will make it through in time for 4.0 final. You've done a great job so far with the patch. It may be "final" and good to go, but personally I'd write some more tests since this particular area has been lacking - before committing.
          Hide
          Jan Høydahl added a comment -

          Upgrading priority to signal the severity - i.e. a valid user query may return 0 hits, which may be pretty critical for some.

          Show
          Jan Høydahl added a comment - Upgrading priority to signal the severity - i.e. a valid user query may return 0 hits, which may be pretty critical for some.
          Hide
          Yonik Seeley added a comment - - edited

          I took a quick look at the current edismax code.
          The major flaw seems to be the attempt to check user fields in splitIntoClauses(). That method was never meant to be an exact replication of Lucene query parsing. Is there a reason we aren't checking user fields in the parser itself (the ExtendedSolrQueryParser.getFieldQuery)?

          edit: thinking a little more about it, I guess one reason is so whitespace or other potentially significant syntax isn't discarded?

          Show
          Yonik Seeley added a comment - - edited I took a quick look at the current edismax code. The major flaw seems to be the attempt to check user fields in splitIntoClauses(). That method was never meant to be an exact replication of Lucene query parsing. Is there a reason we aren't checking user fields in the parser itself (the ExtendedSolrQueryParser.getFieldQuery)? edit: thinking a little more about it, I guess one reason is so whitespace or other potentially significant syntax isn't discarded?
          Hide
          Hoss Man added a comment -

          bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

          Show
          Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
          Hide
          Yonik Seeley added a comment -

          Thanks Bernd, this looks like an improvement.
          After some ad-hoc testing, it seems we still have problems with q=(+id:42)

          Another minor concern: the change to clause.field to exclude things like '(' also means that when it's not a valid lucene query, our reconstructed query will currently drop the paren.

          Example: A query of (a:b with a qf=id correctly produces id:"(a:b"
          but a query of (id:b produces id:b
          That type of thing should really only affect exact match type fields where punctuation isn't dropped - not sure how much of an issue it really is.

          Show
          Yonik Seeley added a comment - Thanks Bernd, this looks like an improvement. After some ad-hoc testing, it seems we still have problems with q=(+id:42) Another minor concern: the change to clause.field to exclude things like '(' also means that when it's not a valid lucene query, our reconstructed query will currently drop the paren. Example: A query of (a:b with a qf=id correctly produces id:"(a:b" but a query of (id:b produces id:b That type of thing should really only affect exact match type fields where punctuation isn't dropped - not sure how much of an issue it really is.
          Hide
          Yonik Seeley added a comment -

          I fixed the issue with +,- and committed to trunk and 4x
          http://svn.apache.org/viewvc?view=revision&revision=1361091

          note: I'll start adding the commit URL since JIRA is not currently linking up the commit automatically.

          Show
          Yonik Seeley added a comment - I fixed the issue with +,- and committed to trunk and 4x http://svn.apache.org/viewvc?view=revision&revision=1361091 note: I'll start adding the commit URL since JIRA is not currently linking up the commit automatically.
          Hide
          Jack Krupansky added a comment -

          Checking CHANGES.TXT and the revision history, I don't think this fix got backported to 3.6.1 when it was backported to 4.x. I would lobby for it to be included in 3.6.2 since it is such a serious problem.

          Show
          Jack Krupansky added a comment - Checking CHANGES.TXT and the revision history, I don't think this fix got backported to 3.6.1 when it was backported to 4.x. I would lobby for it to be included in 3.6.2 since it is such a serious problem.
          Hide
          Shawn Heisey added a comment -

          I have come across something similar to this in Solr 3.5.0. I am wondering if this problem is fixed by this issue in the upcoming 4.0. I don't have a 4.0 index at the moment that I can test on.

          query:

          ((feature:depphotos) OR (feature:glowimages) OR (feature:ipurestockx)  (Kitchen))
          

          handler definition:

              <str name="defType">edismax</str>
              <str name="echoParams">all</str>
              <int name="rows">70</int>
              <str name="shards">idxa2.REDACTED.com:8981/solr/inclive,idxa1.REDACTED.com:8981/solr/s0live,idxa1.REDACTED.com:8981/solr/s1live,idxa1.REDACTED.com:8981/solr/s2live,idxa2.REDACTED.com:8981/solr/s3live,idxa2.REDACTED.com:8981/solr/s4live,idxa2.REDACTED.com:8981/solr/s5live</str>
              <float name="tie">0.1</float>
              <int name="qs">3</int>
              <int name="ps">3</int>
              <str name="qf">catchall</str>
              <str name="pf">catchall^2</str>
              <str name="boost">
                recip(ms(NOW/DAY,pd),3.16e-11,1,1)
              </str>
              <str name="fl">score,*</str>
              <str name="mm">100%</str>
              <str name="q.alt">*:*</str>
              <bool name="lowercaseOperators">false</bool>
          

          parsedquery_toString:

          boost(+(feature:depphotos feature:glowimages feature:ipurestockx (catchall:kitchen)~0.1) (catchall:"(feature:depphotos feature) depphotos (feature:glowimages feature) glowimages (feature:ipurestockx feature) ipurestockx kitchen"~3^2.0)~0.1,1.0/(3.16E-11*float(ms(const(1349654400000),date(pd)))+1.0))
          

          Adding some spaces fixes it:

          (( feature:depphotos ) OR ( feature:glowimages ) OR ( feature:ipurestockx )  (Kitchen))
          

          becomes:

          boost(+(feature:depphotos feature:glowimages feature:ipurestockx (catchall:kitchen)~0.1) (),1.0/(3.16E-11*float(ms(const(1349654400000),date(pd)))+1.0))
          

          It also suffers from the problem where the default operator (AND in our case) is ignored, but I believe that issue is already filed.

          Show
          Shawn Heisey added a comment - I have come across something similar to this in Solr 3.5.0. I am wondering if this problem is fixed by this issue in the upcoming 4.0. I don't have a 4.0 index at the moment that I can test on. query: ((feature:depphotos) OR (feature:glowimages) OR (feature:ipurestockx) (Kitchen)) handler definition: <str name= "defType" >edismax</str> <str name= "echoParams" >all</str> < int name= "rows" >70</ int > <str name= "shards" >idxa2.REDACTED.com:8981/solr/inclive,idxa1.REDACTED.com:8981/solr/s0live,idxa1.REDACTED.com:8981/solr/s1live,idxa1.REDACTED.com:8981/solr/s2live,idxa2.REDACTED.com:8981/solr/s3live,idxa2.REDACTED.com:8981/solr/s4live,idxa2.REDACTED.com:8981/solr/s5live</str> < float name= "tie" >0.1</ float > < int name= "qs" >3</ int > < int name= "ps" >3</ int > <str name= "qf" >catchall</str> <str name= "pf" >catchall^2</str> <str name= "boost" > recip(ms(NOW/DAY,pd),3.16e-11,1,1) </str> <str name= "fl" >score,*</str> <str name= "mm" >100%</str> <str name= "q.alt" >*:*</str> <bool name= "lowercaseOperators" > false </bool> parsedquery_toString: boost(+(feature:depphotos feature:glowimages feature:ipurestockx (catchall:kitchen)~0.1) (catchall: "(feature:depphotos feature) depphotos (feature:glowimages feature) glowimages (feature:ipurestockx feature) ipurestockx kitchen" ~3^2.0)~0.1,1.0/(3.16E-11* float (ms( const (1349654400000),date(pd)))+1.0)) Adding some spaces fixes it: (( feature:depphotos ) OR ( feature:glowimages ) OR ( feature:ipurestockx ) (Kitchen)) becomes: boost(+(feature:depphotos feature:glowimages feature:ipurestockx (catchall:kitchen)~0.1) (),1.0/(3.16E-11* float (ms( const (1349654400000),date(pd)))+1.0)) It also suffers from the problem where the default operator (AND in our case) is ignored, but I believe that issue is already filed.
          Hide
          Shawn Heisey added a comment -

          I have the relevant pieces of schema.xml ready to include, but I will wait until I know whether this is the appropriate issue, or a new issue should be filed.

          Show
          Shawn Heisey added a comment - I have the relevant pieces of schema.xml ready to include, but I will wait until I know whether this is the appropriate issue, or a new issue should be filed.
          Hide
          Shawn Heisey added a comment - - edited

          I confirmed (using the solr example) that 4.0-BETA and lucene_solr_4_0 can parse a query similar to my test query perfectly. Query URL created by filling out the admin query interface:

          http://server:8983/solr/collection1/select?q=((cat%3Astring1)+OR+(cat%3Astring2)+OR+(cat%3Astring3)+(Kitchen+Sink))&wt=xml&debugQuery=true&defType=edismax&qf=text&pf=text%5E2

          Show
          Shawn Heisey added a comment - - edited I confirmed (using the solr example) that 4.0-BETA and lucene_solr_4_0 can parse a query similar to my test query perfectly. Query URL created by filling out the admin query interface: http://server:8983/solr/collection1/select?q=((cat%3Astring1)+OR+(cat%3Astring2)+OR+(cat%3Astring3)+(Kitchen+Sink))&wt=xml&debugQuery=true&defType=edismax&qf=text&pf=text%5E2
          Hide
          Leonhard Maylein added a comment -

          I do not agree that this issue is solved.

          I've tried the following combination with SOLR 4.0.0

          q: +sw(a b) +ti:(c d)
          qf: freitext exttext^0.5
          pf: freitext^6 exttext^3

          The result is:

          <str name="rawquerystring">+sw:(a b) +ti:(c d)</str>

          <str name="querystring">+sw:(a b) +ti:(c d)</str>

          <str name="parsedquery">(((sw:a sw:b) +(ti:c ti:d)) DisjunctionMaxQuery((freitext:"b d"^6.0)) DisjunctionMaxQuery((exttext:"b d"^3.0)))/no_coord</str>

          There should be no splitting on the qf/pf fields and therefore no DisjunctionMaxQueries.

          The query '+(sw:a sw:b) +(ti:c ti:d)' works as expected.

          Show
          Leonhard Maylein added a comment - I do not agree that this issue is solved. I've tried the following combination with SOLR 4.0.0 q: +sw(a b) +ti:(c d) qf: freitext exttext^0.5 pf: freitext^6 exttext^3 The result is: <str name="rawquerystring">+sw:(a b) +ti:(c d)</str> <str name="querystring">+sw:(a b) +ti:(c d)</str> <str name="parsedquery">( ( (sw:a sw:b) +(ti:c ti:d)) DisjunctionMaxQuery((freitext:"b d"^6.0)) DisjunctionMaxQuery((exttext:"b d"^3.0)))/no_coord</str> There should be no splitting on the qf/pf fields and therefore no DisjunctionMaxQueries. The query '+(sw:a sw:b) +(ti:c ti:d)' works as expected.
          Hide
          Jack Krupansky added a comment -

          Leonhard, your use case seems rather different from that of this Jira.

          I presume that you are referring to the generated phrase query boost being a little odd, or maybe that the phrase boost should not occur when the terms are queried against fields not listed in the "pf" parameter. Feel free to raise that as a separate issue.

          You refer to "splitting", but I don't see any term splitting in this example.

          Show
          Jack Krupansky added a comment - Leonhard, your use case seems rather different from that of this Jira. I presume that you are referring to the generated phrase query boost being a little odd, or maybe that the phrase boost should not occur when the terms are queried against fields not listed in the "pf" parameter. Feel free to raise that as a separate issue. You refer to "splitting", but I don't see any term splitting in this example.
          Hide
          Leonhard Maylein added a comment -

          Ok, I understand.
          The phrase boost queries are separated from the normal query expansion via the qf paramter.

          But, all terms are (equally) qualified by a field (field sw for the terms a and b, field ti for the terms c and d).
          Why do the eDismax handler only use the terms b and d to build the phrase boost query?
          Isn't it a bug?

          Show
          Leonhard Maylein added a comment - Ok, I understand. The phrase boost queries are separated from the normal query expansion via the qf paramter. But, all terms are (equally) qualified by a field (field sw for the terms a and b, field ti for the terms c and d). Why do the eDismax handler only use the terms b and d to build the phrase boost query? Isn't it a bug?
          Hide
          Jack Krupansky added a comment -

          Yes, it looks like a bug, but distinct from this current Jira. Actually, two bugs:

          1. Fielded terms should not be used in phrase boost except for the specified field.
          2. Some terms appear to have been skipped for phrase boost.

          Show
          Jack Krupansky added a comment - Yes, it looks like a bug, but distinct from this current Jira. Actually, two bugs: 1. Fielded terms should not be used in phrase boost except for the specified field. 2. Some terms appear to have been skipped for phrase boost.

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Jan Høydahl
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development