Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8409

Complex q param in Streaming Expression results in a bad query

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 6.0
    • Fix Version/s: 6.0
    • Component/s: SolrJ

      Description

      When providing an expression like

      stream=search(people, fl="id,first", sort="first asc", q="presentTitles:\"chief executive officer\" AND age:[36 TO *]")
      

      the following error is seen.

      no field name specified in query and no default specified via 'df' param
      

      I believe the issue is related to the \" (escaped quotes) and the spaces in the q field. If I remove the spaces then the query returns results as expected (though I've yet to validate if those results are accurate).

      This requires some investigation to get down to the root cause. I would like to fix it before Solr 6 is cut.

      1. SOLR-8409.patch
        5 kB
        Dennis Gove
      2. SOLR-8409.patch
        0.9 kB
        Dennis Gove

        Issue Links

          Activity

          Hide
          dpgove Dennis Gove added a comment -

          I've been unable to replicate this in a unit test but have seen it in a fully packaged version of trunk. (ant package was run and then the tarball was unpacked).

          Differences between unit test and packaged version:

          • unit test is using dynamic fields while packaged version is using static fields
          • unit test is not going through the StreamHandler
          Show
          dpgove Dennis Gove added a comment - I've been unable to replicate this in a unit test but have seen it in a fully packaged version of trunk. (ant package was run and then the tarball was unpacked). Differences between unit test and packaged version: unit test is using dynamic fields while packaged version is using static fields unit test is not going through the StreamHandler
          Hide
          joel.bernstein Joel Bernstein added a comment -

          What does the query look like after the Streaming Expression been's parsed?

          Show
          joel.bernstein Joel Bernstein added a comment - What does the query look like after the Streaming Expression been's parsed?
          Hide
          dpgove Dennis Gove added a comment -

          It looks like this

          presentTitles:\"chief executive officer\" AND age:[36 TO *]
          

          I suspect that the \" is the culprit here because the streaming expression parser does not remove the \ before the quote. As such, and this is a hunch, I suspect that the query parser is seeing \" and not considering it a quote that is starting a phase but instead a quote that is just part of the string being searched.

          chief executive officer
          

          I believe this can be fixed by adding logic into the expression parser that will transform \" into " and in fact I've written that code (very simple) but my lack of ability to replicate in a unit test is preventing me from ensuring the issue is actually fixed.

          Show
          dpgove Dennis Gove added a comment - It looks like this presentTitles:\ "chief executive officer\" AND age:[36 TO *] I suspect that the \" is the culprit here because the streaming expression parser does not remove the \ before the quote. As such, and this is a hunch, I suspect that the query parser is seeing \" and not considering it a quote that is starting a phase but instead a quote that is just part of the string being searched. chief executive officer I believe this can be fixed by adding logic into the expression parser that will transform \" into " and in fact I've written that code (very simple) but my lack of ability to replicate in a unit test is preventing me from ensuring the issue is actually fixed.
          Hide
          dpgove Dennis Gove added a comment -

          Backing up my hunch is that if I change the q to be

          presentTitles:\"chief\" AND age:[36 TO *]
          

          I get results back but a very small subset of the results I would expect to get back.

          I've yet to visually verify the source data but I would guess that there is a record containing a field value
          <presentTitles>"chief"</presentTitles>

          I'll check for that the next time I'm looking into this (by Monday I suspect) but I'd wager that I'll find it.

          Show
          dpgove Dennis Gove added a comment - Backing up my hunch is that if I change the q to be presentTitles:\ "chief\" AND age:[36 TO *] I get results back but a very small subset of the results I would expect to get back. I've yet to visually verify the source data but I would guess that there is a record containing a field value <presentTitles>"chief"</presentTitles> I'll check for that the next time I'm looking into this (by Monday I suspect) but I'd wager that I'll find it.
          Hide
          dpgove Dennis Gove added a comment - - edited

          This patch appears to fix the issues. Still am unable to replicate in a unit test but I have confirmed that the issue I was seeing in a packaged setup is fixed with this patch.

          I'll want to wait until I can get a replicated test before I commit this.

          Show
          dpgove Dennis Gove added a comment - - edited This patch appears to fix the issues. Still am unable to replicate in a unit test but I have confirmed that the issue I was seeing in a packaged setup is fixed with this patch. I'll want to wait until I can get a replicated test before I commit this.
          Hide
          dpgove Dennis Gove added a comment -

          Interestingly, if I leave the q param out entirely I don't see any raised exception. Also, if I leave out a field to filter on I also don't see any raised exception. I've confirmed the solrconfig-streaming.xml doesn't include either default q or df settings so I'd expect to see an exception in both of these cases.

          search(collection1, fl="id,a_s,a_i,a_f", sort="a_f asc, a_i asc")
          search(collection1, fl="id,a_s,a_i,a_f", sort="a_f asc, a_i asc", q="foo")
          
          Show
          dpgove Dennis Gove added a comment - Interestingly, if I leave the q param out entirely I don't see any raised exception. Also, if I leave out a field to filter on I also don't see any raised exception. I've confirmed the solrconfig-streaming.xml doesn't include either default q or df settings so I'd expect to see an exception in both of these cases. search(collection1, fl= "id,a_s,a_i,a_f" , sort= "a_f asc, a_i asc" ) search(collection1, fl= "id,a_s,a_i,a_f" , sort= "a_f asc, a_i asc" , q= "foo" )
          Hide
          dpgove Dennis Gove added a comment -

          I take that back. The file schema-streaming.xml contains the default query field

          <defaultSearchField>text</defaultSearchField>
          

          If I comment out that setting then I am able to replicate the failure described in this ticket - finally. I will create a couple valid tests replicating the issue and will commit the fix as soon as I can.

          Show
          dpgove Dennis Gove added a comment - I take that back. The file schema-streaming.xml contains the default query field <defaultSearchField>text</defaultSearchField> If I comment out that setting then I am able to replicate the failure described in this ticket - finally. I will create a couple valid tests replicating the issue and will commit the fix as soon as I can.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Dennis Gove, this looks like it's almost ready to go. Still shooting to get this in for 6.0?

          Show
          joel.bernstein Joel Bernstein added a comment - Dennis Gove , this looks like it's almost ready to go. Still shooting to get this in for 6.0?
          Hide
          dpgove Dennis Gove added a comment -

          Yeah. Planning to commit tomorrow.

          Show
          dpgove Dennis Gove added a comment - Yeah. Planning to commit tomorrow.
          Hide
          dpgove Dennis Gove added a comment -

          This new patch ensures that escaped quotes in the parameters of a CloudSolrStream stream are properly re-escaped when converting back to a string expression. This is necessary for situations where the expression is passed off to workers.

          Note the comment in CloudSolrStream

                // SOLR-8409: This is a special case where the params contain a " character
                // Do note that in any other BASE streams with parameters where a " might come into play
                // that this same replacement needs to take place.
                value = value.replace("\"", "\\\"");
          

          I hope that this comment is heeded if additional base streams dealing with solr params are ever added.

          Show
          dpgove Dennis Gove added a comment - This new patch ensures that escaped quotes in the parameters of a CloudSolrStream stream are properly re-escaped when converting back to a string expression. This is necessary for situations where the expression is passed off to workers. Note the comment in CloudSolrStream // SOLR-8409: This is a special case where the params contain a " character // Do note that in any other BASE streams with parameters where a " might come into play // that this same replacement needs to take place. value = value.replace( "\" ", " \\\""); I hope that this comment is heeded if additional base streams dealing with solr params are ever added.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          One thing to check is how the SQLHandler is handling quotes. All queries are quoted by default. The TupleStreams are first built through constructors but they are serialized to streaming expressions and sent to worker nodes. Somehow we never tripped this bug in that scenario.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited One thing to check is how the SQLHandler is handling quotes. All queries are quoted by default. The TupleStreams are first built through constructors but they are serialized to streaming expressions and sent to worker nodes. Somehow we never tripped this bug in that scenario.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Just tried to trip this bug with this test case:

          params.put(CommonParams.QT, "/sql");
                params.put("numWorkers", "2");
                params.put("stmt", "select str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i) from collection1 where text='XXXX XXXX' group by str_s order by sum(field_i) asc limit 2");
          
          

          This would have a query like this: text:"XXXX XXXX" which would be serialized and sent to workers. It works, so somehow the serialization was handling things.

          Show
          joel.bernstein Joel Bernstein added a comment - Just tried to trip this bug with this test case: params.put(CommonParams.QT, "/sql" ); params.put( "numWorkers" , "2" ); params.put( "stmt" , "select str_s, count(*), sum(field_i), min(field_i), max(field_i), avg(field_i) from collection1 where text='XXXX XXXX' group by str_s order by sum(field_i) asc limit 2" ); This would have a query like this: text:"XXXX XXXX" which would be serialized and sent to workers. It works, so somehow the serialization was handling things.
          Hide
          dpgove Dennis Gove added a comment - - edited

          I believe if it were changed to

          text="XXXX XXXX"
          

          it would trip the failure. The issue is with double quotes because double quotes are used to wrap the parameter values. So a q param of

          search(....., q="text:\"XXXX XXXX\"", ..... )
          

          would show the issue.

          The issue is that the q param given to Solr cannot include the \" and should instead include just the " around XXXX XXXX. So, to deal with that we have to replace \" with ". That then also requires that on the toExpression(...) we have to replace " with \".

          Show
          dpgove Dennis Gove added a comment - - edited I believe if it were changed to text= "XXXX XXXX" it would trip the failure. The issue is with double quotes because double quotes are used to wrap the parameter values. So a q param of search(....., q= "text:\" XXXX XXXX\"", ..... ) would show the issue. The issue is that the q param given to Solr cannot include the \" and should instead include just the " around XXXX XXXX. So, to deal with that we have to replace \" with ". That then also requires that on the toExpression(...) we have to replace " with \".
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 3528cc32cb634137cf389beaa9ecdc2036588d96 in lucene-solr's branch refs/heads/master from Dennis Gove
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3528cc3 ]

          SOLR-8409: Ensures that quotes in solr params (eg. q param) are properly handled

          Show
          jira-bot ASF subversion and git services added a comment - Commit 3528cc32cb634137cf389beaa9ecdc2036588d96 in lucene-solr's branch refs/heads/master from Dennis Gove [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3528cc3 ] SOLR-8409 : Ensures that quotes in solr params (eg. q param) are properly handled
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Ok this interesting. What's below is before this patch:

          I ran this test, which I expected to break because the parser would have trouble with quotes inside the quotes. I was surprised to find that it worked perfectly.

           expression = StreamExpressionParser.parse("parallel(collection1,"
                                                        + "rollup("
                                                          + "search(collection1, q=\"a_s:(\"hello0\" OR \"hello3\" OR \"hello4\")\", fl=\"a_s,a_i,a_f\", sort=\"a_s asc\", partitionKeys=\"a_s\"),"
                                                          + "over=\"a_s\","
                                                          + "sum(a_i),"
                                                          + "sum(a_f),"
                                                          + "min(a_i),"
                                                          + "min(a_f),"
                                                          + "max(a_i),"
                                                          + "max(a_f),"
                                                          + "avg(a_i),"
                                                          + "avg(a_f),"
                                                          + "count(*)"
                                                        + "),"
                                                        + "workers=\"2\", zkHost=\""+zkServer.getZkAddress()+"\", sort=\"a_s asc\")"
                                                        );
          

          Below is the what was sent to the workers. Notice the quotes are not escaped at all:

          rollup(search(collection1,q="a_s:("hello0" OR "hello3" OR "hello4")",fl="a_s,a_i,a_f",sort="a_s asc",partitionKeys=a_s,zkHost="127.0.0.1:52813/solr"),over=a_s,sum(a_i),sum(a_f),min(a_i),min(a_f),max(a_i),max(a_f),avg(a_i),avg(a_f),count(*))
          
          rollup(search(collection1,q="a_s:("hello0" OR "hello3" OR "hello4")",fl="a_s,a_i,a_f",sort="a_s asc",partitionKeys=a_s,zkHost="127.0.0.1:52813/solr"),over=a_s,sum(a_i),sum(a_f),min(a_i),min(a_f),max(a_i),max(a_f),avg(a_i),avg(a_f),count(*))
           

          Below is what was sent from the workers to the shards. Notice that the query is correct.

             [junit4]   2> 36804 INFO  (qtp1665513927-67) [n:127.0.0.1:52821_ c:collection1 s:shard2 r:core_node1 x:collection1] o.a.s.c.S.Request [collection1]  webapp= path=/select params={q=a_s:("hello0"+OR+"hello3"+OR+"hello4")&distrib=false&fl=a_s,a_i,a_f&fq={!hash+workers%3D2+worker%3D0}&sort=a_s+asc&partitionKeys=a_s&wt=json&version=2.2} hits=3 status=0 QTime=1
             [junit4]   2> 36804 INFO  (qtp637545437-123) [n:127.0.0.1:52829_ c:collection1 s:shard2 r:core_node3 x:collection1] o.a.s.c.S.Request [collection1]  webapp= path=/select params={q=a_s:("hello0"+OR+"hello3"+OR+"hello4")&distrib=false&fl=a_s,a_i,a_f&fq={!hash+workers%3D2+worker%3D1}&sort=a_s+asc&partitionKeys=a_s&wt=json&version=2.2} hits=3 status=0 QTime=1
             [junit4]   2> 36804 INFO  (qtp682745908-97) [n:127.0.0.1:52825_ c:collection1 s:shard1 r:core_node2 x:collection1] o.a.s.c.S.Request [collection1]  webapp= path=/select params={q=a_s:("hello0"+OR+"hello3"+OR+"hello4")&distrib=false&fl=a_s,a_i,a_f&fq={!hash+workers%3D2+worker%3D1}&sort=a_s+asc&partitionKeys=a_s&wt=json&version=2.2} hits=1 status=0 QTime=0
             [junit4]   2> 36804 INFO  (qtp682745908-471) [n:127.0.0.1:52825_ c:collection1 s:shard1 r:core_node2 x:collection1] o.a.s.c.S.Request [collection1]  webapp= path=/select params={q=a_s:("hello0"+OR+"hello3"+OR+"hello4")&distrib=false&fl=a_s,a_i,a_f&fq={!hash+workers%3D2+worker%3D0}&sort=a_s+asc&partitionKeys=a_s&wt=json&version=2.2} hits=3 status=0 QTime=0
          

          So it appears that the parser doesn't even need the quotes nested within quotes to be escaped. I'm assuming that's because it's splitting on the comma and the equals sign and then removing the outer quotes.

          Show
          joel.bernstein Joel Bernstein added a comment - Ok this interesting. What's below is before this patch: I ran this test, which I expected to break because the parser would have trouble with quotes inside the quotes. I was surprised to find that it worked perfectly. expression = StreamExpressionParser.parse( "parallel(collection1," + "rollup(" + "search(collection1, q=\" a_s:(\ "hello0\" OR \ "hello3\" OR \ "hello4\" )\ ", fl=\" a_s,a_i,a_f\ ", sort=\" a_s asc\ ", partitionKeys=\" a_s\ ")," + "over=\" a_s\ "," + "sum(a_i)," + "sum(a_f)," + "min(a_i)," + "min(a_f)," + "max(a_i)," + "max(a_f)," + "avg(a_i)," + "avg(a_f)," + "count(*)" + ")," + "workers=\" 2\ ", zkHost=\" "+zkServer.getZkAddress()+" \ ", sort=\" a_s asc\ ")" ); Below is the what was sent to the workers. Notice the quotes are not escaped at all: rollup(search(collection1,q= "a_s:(" hello0 " OR " hello3 " OR " hello4 ")" ,fl= "a_s,a_i,a_f" ,sort= "a_s asc" ,partitionKeys=a_s,zkHost= "127.0.0.1:52813/solr" ),over=a_s,sum(a_i),sum(a_f),min(a_i),min(a_f),max(a_i),max(a_f),avg(a_i),avg(a_f),count(*)) rollup(search(collection1,q= "a_s:(" hello0 " OR " hello3 " OR " hello4 ")" ,fl= "a_s,a_i,a_f" ,sort= "a_s asc" ,partitionKeys=a_s,zkHost= "127.0.0.1:52813/solr" ),over=a_s,sum(a_i),sum(a_f),min(a_i),min(a_f),max(a_i),max(a_f),avg(a_i),avg(a_f),count(*)) Below is what was sent from the workers to the shards. Notice that the query is correct. [junit4] 2> 36804 INFO (qtp1665513927-67) [n:127.0.0.1:52821_ c:collection1 s:shard2 r:core_node1 x:collection1] o.a.s.c.S.Request [collection1] webapp= path=/select params={q=a_s:( "hello0" +OR+ "hello3" +OR+ "hello4" )&distrib= false &fl=a_s,a_i,a_f&fq={!hash+workers%3D2+worker%3D0}&sort=a_s+asc&partitionKeys=a_s&wt=json&version=2.2} hits=3 status=0 QTime=1 [junit4] 2> 36804 INFO (qtp637545437-123) [n:127.0.0.1:52829_ c:collection1 s:shard2 r:core_node3 x:collection1] o.a.s.c.S.Request [collection1] webapp= path=/select params={q=a_s:( "hello0" +OR+ "hello3" +OR+ "hello4" )&distrib= false &fl=a_s,a_i,a_f&fq={!hash+workers%3D2+worker%3D1}&sort=a_s+asc&partitionKeys=a_s&wt=json&version=2.2} hits=3 status=0 QTime=1 [junit4] 2> 36804 INFO (qtp682745908-97) [n:127.0.0.1:52825_ c:collection1 s:shard1 r:core_node2 x:collection1] o.a.s.c.S.Request [collection1] webapp= path=/select params={q=a_s:( "hello0" +OR+ "hello3" +OR+ "hello4" )&distrib= false &fl=a_s,a_i,a_f&fq={!hash+workers%3D2+worker%3D1}&sort=a_s+asc&partitionKeys=a_s&wt=json&version=2.2} hits=1 status=0 QTime=0 [junit4] 2> 36804 INFO (qtp682745908-471) [n:127.0.0.1:52825_ c:collection1 s:shard1 r:core_node2 x:collection1] o.a.s.c.S.Request [collection1] webapp= path=/select params={q=a_s:( "hello0" +OR+ "hello3" +OR+ "hello4" )&distrib= false &fl=a_s,a_i,a_f&fq={!hash+workers%3D2+worker%3D0}&sort=a_s+asc&partitionKeys=a_s&wt=json&version=2.2} hits=3 status=0 QTime=0 So it appears that the parser doesn't even need the quotes nested within quotes to be escaped. I'm assuming that's because it's splitting on the comma and the equals sign and then removing the outer quotes.
          Hide
          dpgove Dennis Gove added a comment -

          Try putting a space inside the quote for the q param. That is where I saw the failure that prompted this ticket. So

          search(collection1, q=\"a_s:(\"hello0 hellospace\" OR \"hello3\" OR \"hello4\")\", fl=\"a_s,a_i,a_f\", sort=\"a_s asc\", partitionKeys=\"a_s\")
          
          Show
          dpgove Dennis Gove added a comment - Try putting a space inside the quote for the q param. That is where I saw the failure that prompted this ticket. So search(collection1, q=\ "a_s:(\" hello0 hellospace\ " OR \" hello3\ " OR \" hello4\ ")\" , fl=\ "a_s,a_i,a_f\" , sort=\ "a_s asc\" , partitionKeys=\ "a_s\" )
          Hide
          joel.bernstein Joel Bernstein added a comment -

          It took me while to grasp this ticket because it's so hard to reproduce the exception. But I think I've got it now. Basically the instructions to users are:

          If you have to use quotes in your query then escape the quotes:

          search(collection1, q="blah \"stuff stuff\" blah", ...)
          

          I'll do some more testing with the SQLHandler to make sure everything is working properly with the quotes.

          Show
          joel.bernstein Joel Bernstein added a comment - It took me while to grasp this ticket because it's so hard to reproduce the exception. But I think I've got it now. Basically the instructions to users are: If you have to use quotes in your query then escape the quotes: search(collection1, q= "blah \" stuff stuff\ " blah" , ...) I'll do some more testing with the SQLHandler to make sure everything is working properly with the quotes.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 3528cc32cb634137cf389beaa9ecdc2036588d96 in lucene-solr's branch refs/heads/lucene-6997 from Dennis Gove
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3528cc3 ]

          SOLR-8409: Ensures that quotes in solr params (eg. q param) are properly handled

          Show
          jira-bot ASF subversion and git services added a comment - Commit 3528cc32cb634137cf389beaa9ecdc2036588d96 in lucene-solr's branch refs/heads/lucene-6997 from Dennis Gove [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3528cc3 ] SOLR-8409 : Ensures that quotes in solr params (eg. q param) are properly handled

            People

            • Assignee:
              dpgove Dennis Gove
              Reporter:
              dpgove Dennis Gove
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development