Lucene - Core
  1. Lucene - Core
  2. LUCENE-1823

QueryParser with new features for Lucene 3

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: core/queryparser
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I'd like to have a new QueryParser implementation in Lucene 3.1, ideally based on the new QP framework in contrib. It should share as much code as possible with the current StandardQueryParser implementation for easy maintainability.

      Wish list (feel free to extend):

      1. Operator precedence: Support operator precedence for boolean operators
      2. Opaque terms: Ability to plugin an external parser for certain syntax extensions, e.g. XML query terms
      3. Improved RangeQuery syntax: Use more intuitive <=, =, >= instead of [] and {}
      4. Support for trierange queries: See LUCENE-1768
      5. Complex phrases: See LUCENE-1486
      6. ANY operator: E.g. (a b c d) ANY 3 should match if 3 of the 4 terms occur in the same document
      7. New syntax for Span queries: I think the surround parser supports this?
      8. Escaped wildcards: See LUCENE-588

        Issue Links

          Activity

          Hide
          Luis Alves added a comment - - edited

          2 Opaque terms

          I propose the following examples for the syntax

          syntax1:
          +a -b ::complexPhrase('other syntax') ::xml('/bookstore/book[price>35.00]') ::googlesyntax('2..20 doughnuts')
          
          syntax2:
          +a -b complexPhrase::'other syntax' xml::'/bookstore/book[price>35.00]' googlesyntax::'2..20 doughnuts'
          
          syntax3:
          +a -b complePhrase:'other syntax' xml:'/bookstore/book[price>35.00]' googlesyntax:'2..20 doughnuts'
          

          We can also have a default SyntaxExtension to make the syntax easier, for example if complexPhrase was the default Syntax Extension,
          the queries above could be written like this:

          syntax1:
          +a -b ::('other syntax') ::xml('/bookstore/book[price>35.00]') ::googlesyntax('2..20 doughnuts')
          syntax2:
          +a -b ::'other syntax' xml::'/bookstore/book[price>35.00]' googlesyntax::'2..20 doughnuts'
          syntax3:
          +a -b 'other syntax' xml:'/bookstore/book[price>35.00]' googlesyntax:'2..20 doughnuts'
          

          I would like to call it Query Parser Syntax extensions instead of Opaque Terms.

          + 1 for syntax 1

          Show
          Luis Alves added a comment - - edited 2 Opaque terms I propose the following examples for the syntax syntax1: +a -b ::complexPhrase('other syntax') ::xml('/bookstore/book[price>35.00]') ::googlesyntax('2..20 doughnuts') syntax2: +a -b complexPhrase::'other syntax' xml::'/bookstore/book[price>35.00]' googlesyntax::'2..20 doughnuts' syntax3: +a -b complePhrase:'other syntax' xml:'/bookstore/book[price>35.00]' googlesyntax:'2..20 doughnuts' We can also have a default SyntaxExtension to make the syntax easier, for example if complexPhrase was the default Syntax Extension, the queries above could be written like this: syntax1: +a -b ::('other syntax') ::xml('/bookstore/book[price>35.00]') ::googlesyntax('2..20 doughnuts') syntax2: +a -b ::'other syntax' xml::'/bookstore/book[price>35.00]' googlesyntax::'2..20 doughnuts' syntax3: +a -b 'other syntax' xml:'/bookstore/book[price>35.00]' googlesyntax:'2..20 doughnuts' I would like to call it Query Parser Syntax extensions instead of Opaque Terms. + 1 for syntax 1
          Hide
          Luis Alves added a comment -

          1. Operator precedence

          The new queryparser already supports this internally by disabling the GroupQueryNodeProcessor. But we don't have any tescases and we need to add a simpler interface for the users by creating a new Lucene3QueryParser class (with a diff name).

          2. Opaque terms
          5. Complex phrases

          We should also implement number 5(Complex phrases) using number 2 (Opaque terms)

          8 Escaped wildcards

          LUCENE-1820 is also related to this.

          Show
          Luis Alves added a comment - 1. Operator precedence The new queryparser already supports this internally by disabling the GroupQueryNodeProcessor. But we don't have any tescases and we need to add a simpler interface for the users by creating a new Lucene3QueryParser class (with a diff name). 2. Opaque terms 5. Complex phrases We should also implement number 5(Complex phrases) using number 2 (Opaque terms) 8 Escaped wildcards LUCENE-1820 is also related to this.
          Hide
          Luis Alves added a comment -

          8 Escaped wildcards

          The new queryparser already supports this in the StandardSyntaxParser and most processor, we just need to make it visible to the underlying lucene classes, on the builders.

          Show
          Luis Alves added a comment - 8 Escaped wildcards The new queryparser already supports this in the StandardSyntaxParser and most processor, we just need to make it visible to the underlying lucene classes, on the builders.
          Hide
          Michael Busch added a comment - - edited

          Hmm, Syntax 2 looks more intuitive to me... looks a bit strange in syntax one to have the :: in front of the syntax name?

          Show
          Michael Busch added a comment - - edited Hmm, Syntax 2 looks more intuitive to me... looks a bit strange in syntax one to have the :: in front of the syntax name?
          Hide
          Luis Alves added a comment -

          syntax1 is more similar to a function call. In the future we might extend it to support more parameters:

          ::xml('syntax', param2)
          
          Show
          Luis Alves added a comment - syntax1 is more similar to a function call. In the future we might extend it to support more parameters: ::xml('syntax', param2)
          Hide
          Luis Alves added a comment - - edited

          looks a bit strange in syntax one to have the :: in front of the syntax name

          The current list of escaped chars for the current lucene syntax is:
          "+", "-", "!", "(", ")", ":", "^", "[", "]", "\"", "

          {", "}

          ", "~", "*", "?", "
          ", """

          I was trying to avoid adding an extra ones, so I reused the ':'
          but we can select another char combination that makes more sense.

          • syntax1, the main idea on syntax1 is to make it look like a function call.
          • syntax2 , is very similar to lucene field query syntax but uses the :: operator to avoid overloading the field name syntax.
          • syntax3, will overload field name syntax, and will look very similar to current syntax, but will use single quotes to identify it is calling a syntax extension.

          I hope this helps.

          Show
          Luis Alves added a comment - - edited looks a bit strange in syntax one to have the :: in front of the syntax name The current list of escaped chars for the current lucene syntax is: "+", "-", "!", "(", ")", ":", "^", " [", "] ", "\"", " {", "} ", "~", "*", "?", " ", """ I was trying to avoid adding an extra ones, so I reused the ':' but we can select another char combination that makes more sense. syntax1, the main idea on syntax1 is to make it look like a function call. syntax2 , is very similar to lucene field query syntax but uses the :: operator to avoid overloading the field name syntax. syntax3, will overload field name syntax, and will look very similar to current syntax, but will use single quotes to identify it is calling a syntax extension. I hope this helps.
          Hide
          Grant Ingersoll added a comment -

          Boosting*Query's support?

          Show
          Grant Ingersoll added a comment - Boosting*Query's support?
          Hide
          Mark Miller added a comment -

          Option to not die on bad syntax - keep what works and the rest become terms - or something along those lines.

          Show
          Mark Miller added a comment - Option to not die on bad syntax - keep what works and the rest become terms - or something along those lines.
          Hide
          Michael Busch added a comment -

          I think Solr has a feature similar to what I called 'Opaque terms": Nested Queries.

          Show
          Michael Busch added a comment - I think Solr has a feature similar to what I called 'Opaque terms": Nested Queries.
          Hide
          Luis Alves added a comment -

          We can also implement:

          • foo~(>=1) should really just map to foo.

          details in LUCENE-950 issue.

          Show
          Luis Alves added a comment - We can also implement: foo~(>=1) should really just map to foo. details in LUCENE-950 issue.
          Hide
          Luis Alves added a comment - - edited

          LUCENE-167 will be implemented by item 1.

          Item 1 will support NOT - + AND OR operators with precedence.

          Show
          Luis Alves added a comment - - edited LUCENE-167 will be implemented by item 1. Item 1 will support NOT - + AND OR operators with precedence.
          Hide
          Luis Alves added a comment -

          Item 3

          will address LUCENE-995 using a new syntax with >= <= =

          Show
          Luis Alves added a comment - Item 3 will address LUCENE-995 using a new syntax with >= <= =
          Hide
          Luis Alves added a comment -

          I'll also want to fix LUCENE-375 as part of this issue

          • fish*~ parses to PrefixQuery - should be a parse exception
          Show
          Luis Alves added a comment - I'll also want to fix LUCENE-375 as part of this issue fish*~ parses to PrefixQuery - should be a parse exception
          Hide
          Adriano Crestani added a comment -

          We can also implement:

          • foo~(>=1) should really just map to foo.

          details in LUCENE-950 issue.

          This patch fixes on the contrib/queryparser this bug as discussed on LUCENE-950. It does not throw IllegalArgumentException anymore for fuzzy values greater or equals 1, it just ignores the fuzzy and create a simple field query. JUnits are also included.

          I used 'ant javacc-contrib-queryparser' to regenerate the StandardSyntaxParser with javacc 4.2.

          Show
          Adriano Crestani added a comment - We can also implement: foo~(>=1) should really just map to foo. details in LUCENE-950 issue. This patch fixes on the contrib/queryparser this bug as discussed on LUCENE-950 . It does not throw IllegalArgumentException anymore for fuzzy values greater or equals 1, it just ignores the fuzzy and create a simple field query. JUnits are also included. I used 'ant javacc-contrib-queryparser' to regenerate the StandardSyntaxParser with javacc 4.2.
          Hide
          Ali Oral added a comment -

          Proximity query support could be very nice. This definitely requires span queries.

          (john OR james OR mar*) NEAR/5 ( smith OR mil*)

          Show
          Ali Oral added a comment - Proximity query support could be very nice. This definitely requires span queries. (john OR james OR mar*) NEAR/5 ( smith OR mil*)
          Hide
          Luis Alves added a comment -

          Hi Ali,

          Here another suggestion for the proximity syntax:
          ( (john OR james OR mar*) ( smith OR mil*) ) WITHIN 5

          I'll see if I have time to put that on the new parser.

          Show
          Luis Alves added a comment - Hi Ali, Here another suggestion for the proximity syntax: ( (john OR james OR mar*) ( smith OR mil*) ) WITHIN 5 I'll see if I have time to put that on the new parser.
          Hide
          Luis Alves added a comment - - edited

          This patch is the first patch to implement the features described on lucene-1823.
          contains:

          • Operator precedence
          • Opaque terms
          • ANY operator

          The new parser is name standard2, I'm open to change this name please post suggestions

          Also included is a implementation for regex using the syntax discussed in LUCENE-2039. I wrote a simple junit and and RegexQueryParser in the test folder. This implementation use the Opaque terms implementation.

          Show
          Luis Alves added a comment - - edited This patch is the first patch to implement the features described on lucene-1823. contains: Operator precedence Opaque terms ANY operator The new parser is name standard2, I'm open to change this name please post suggestions Also included is a implementation for regex using the syntax discussed in LUCENE-2039 . I wrote a simple junit and and RegexQueryParser in the test folder. This implementation use the Opaque terms implementation.
          Hide
          Luis Alves added a comment -

          Operator precedence order is

          ANY, ~, ^, +, -, NOT, AND, OR
          

          For example:

          a OR b AND c 
          

          will now be executed as

          (a OR (b AND c))
          

          The syntax for the ANY operator is:

          ( a b c d ) ANY 2 
          

          Opaque syntax is:

          extensioName:field:term
          extensioName:field:"phrase"
          

          Default field:

          extensioName::term
          extensioName::"phrase"
          

          In the test folder standard2 there is a Opaque implementation for regex (contrib component),
          and the syntax to use this test RegexQueryParser is, all the lunece syntax and the above, plus:

          regex:field:"regular expression"
          

          For example:

          regex::"^.[aeiou]c.*$"
          
          Show
          Luis Alves added a comment - Operator precedence order is ANY, ~, ^, +, -, NOT, AND, OR For example: a OR b AND c will now be executed as (a OR (b AND c)) The syntax for the ANY operator is: ( a b c d ) ANY 2 Opaque syntax is: extensioName:field:term extensioName:field: "phrase" Default field: extensioName::term extensioName:: "phrase" In the test folder standard2 there is a Opaque implementation for regex (contrib component), and the syntax to use this test RegexQueryParser is, all the lunece syntax and the above, plus: regex:field: "regular expression" For example: regex:: "^.[aeiou]c.*$"
          Hide
          Luis Alves added a comment - - edited

          I forgot to say that the patch, includes LUCENE-1937 and LUCENE-1938 from Adriano Crestani to enable the precedence code.

          Show
          Luis Alves added a comment - - edited I forgot to say that the patch, includes LUCENE-1937 and LUCENE-1938 from Adriano Crestani to enable the precedence code.
          Hide
          Michael Busch added a comment -

          Luis and Adriano,

          the QP config looks quite overwhelming with all the Attributes. I'm not sure if the AttributeSource/Attribute stuff is a good fit for this type of configuration.

          Couldn't we achieve the same with a Properties (Hashtable) approach and constants or something similar. This would be a good place to start to reduce the complexity of the new QP.

          Show
          Michael Busch added a comment - Luis and Adriano, the QP config looks quite overwhelming with all the Attributes. I'm not sure if the AttributeSource/Attribute stuff is a good fit for this type of configuration. Couldn't we achieve the same with a Properties (Hashtable) approach and constants or something similar. This would be a good place to start to reduce the complexity of the new QP.
          Hide
          Shai Erera added a comment -

          I prefer syntax 2 for the opaque terms. If the idea is to plug in another QP for that opaque term, then it would be best IMO if that QP received the entire string and did what it knows with it. That way, I could pass my::'field1:value OR field2:value2 AND (something else)', and 'my' QP would parse the entire string.
          I don't see how this can be achieved w/ <syntax>:<field>:query, meaning, how can I pass a clause which contains two fields ORed or ANDed? IMO, the simpler the better and it's easy to explain that whatever comes after the '::' (double colons), is passed onto as-is to the assigned QP.

          Show
          Shai Erera added a comment - I prefer syntax 2 for the opaque terms. If the idea is to plug in another QP for that opaque term, then it would be best IMO if that QP received the entire string and did what it knows with it. That way, I could pass my::'field1:value OR field2:value2 AND (something else)', and 'my' QP would parse the entire string. I don't see how this can be achieved w/ <syntax>:<field>:query, meaning, how can I pass a clause which contains two fields ORed or ANDed? IMO, the simpler the better and it's easy to explain that whatever comes after the '::' (double colons), is passed onto as-is to the assigned QP.
          Hide
          Simon Willnauer added a comment -

          Linked issues for reference and heads up. @Luis, are you still working on that stuff and would you be willing to further maintain the QueryParser in Contrib?

          Show
          Simon Willnauer added a comment - Linked issues for reference and heads up. @Luis, are you still working on that stuff and would you be willing to further maintain the QueryParser in Contrib?
          Hide
          Adriano Crestani added a comment -

          I agree with Michael, AttributeSource was designed for another purpose, and does not really fit for configuration purposes.

          The map idea is really good and fits well as configuration for the QP, but I would like to restrict the key type, so the user doesn't use a String object as key. String keys may lead to runtime errors, mainly when they are inserted inline. I would prefer to use enums as keys, it would enforce the user to always pass the same object as key when referencing the same configuration. It also avoids duplicated configuration keys, once each enum type has only one instance per JVM.

          If nobody complains about using a Map<Enum<?>, Object> as configuration for QP framework, I will start working on a new patch including these changes soon.

          Show
          Adriano Crestani added a comment - I agree with Michael, AttributeSource was designed for another purpose, and does not really fit for configuration purposes. The map idea is really good and fits well as configuration for the QP, but I would like to restrict the key type, so the user doesn't use a String object as key. String keys may lead to runtime errors, mainly when they are inserted inline. I would prefer to use enums as keys, it would enforce the user to always pass the same object as key when referencing the same configuration. It also avoids duplicated configuration keys, once each enum type has only one instance per JVM. If nobody complains about using a Map<Enum<?>, Object> as configuration for QP framework, I will start working on a new patch including these changes soon.
          Hide
          Mark Harwood added a comment -

          Another one for the wishlist - support for the nested documents offered in LUCENE-2454

          An example query of a resume with a parent person doc and multiple child "employment" documents might be as follows:

          +(name:frederick OR name:fred) +CHILD(type:contract AND skill:java AND date>=2005 )

          The new feature here is the CHILD(...) construct that shifts query context to a nested document.
          I imagine there is some more formal syntax we could consider lifting from XPath but I thought I'd throw this in while you are contemplating new features.

          Show
          Mark Harwood added a comment - Another one for the wishlist - support for the nested documents offered in LUCENE-2454 An example query of a resume with a parent person doc and multiple child "employment" documents might be as follows: +(name:frederick OR name:fred) +CHILD(type:contract AND skill:java AND date>=2005 ) The new feature here is the CHILD(...) construct that shifts query context to a nested document. I imagine there is some more formal syntax we could consider lifting from XPath but I thought I'd throw this in while you are contemplating new features.
          Hide
          Adriano Crestani added a comment -

          Just a reminder that Luis's patch was blocked by LUCENE-1938, which is now resolved, the patch can finally be reviewed/committed.

          Show
          Adriano Crestani added a comment - Just a reminder that Luis's patch was blocked by LUCENE-1938 , which is now resolved, the patch can finally be reviewed/committed.
          Hide
          Robert Muir added a comment -

          Adriano, I will take a look at the patch.

          A few things have changed:

          the LUCENE-950 issue, I changed the
          FuzzyQuery syntax to allow for foo~1 foo~2 to support exact edit distances...
          so I don't think we need to change anything there.

          Additionally we also added proper regular expression support (via Lucene core's RegexpQuery).

          But i'll play with the patch, and see if i can bring it up to trunk.

          As far as using a Map instead of Attributes for configuration, I think this would be a really good step!
          Are you still interested in working up a patch for this one. At the moment I think all the attributes
          scare people away from the contrib/queryparser.

          Show
          Robert Muir added a comment - Adriano, I will take a look at the patch. A few things have changed: the LUCENE-950 issue, I changed the FuzzyQuery syntax to allow for foo~1 foo~2 to support exact edit distances... so I don't think we need to change anything there. Additionally we also added proper regular expression support (via Lucene core's RegexpQuery). But i'll play with the patch, and see if i can bring it up to trunk. As far as using a Map instead of Attributes for configuration, I think this would be a really good step! Are you still interested in working up a patch for this one. At the moment I think all the attributes scare people away from the contrib/queryparser.
          Hide
          Robert Muir added a comment -

          Looking at the patch, its a bit difficult to review since the patch creates a whole new queryparser (Standard2).

          This is just my opinion here:

          1. I think it would be good to just modify "Standard" with the improvements presented here. I think for contrib/queryparser to succeed, we should worry less about providing exact imitations of the core queryparser, and instead focus on trying to provide a framework and concrete implementation that solves the problems people are facing. In other words, fix what we don't like and provide a parser that works the way we want, and forget about exact compatibility with the core queryparser... if someone wants its exact behavior, they can just use it.
          2. It would be much easier if improvements could be on separate patches rather than bundled: For example, LUCENE-1938 was easy for me to commit because it was well-contained and covered one single improvement/feature.
          Show
          Robert Muir added a comment - Looking at the patch, its a bit difficult to review since the patch creates a whole new queryparser (Standard2). This is just my opinion here: I think it would be good to just modify "Standard" with the improvements presented here. I think for contrib/queryparser to succeed, we should worry less about providing exact imitations of the core queryparser, and instead focus on trying to provide a framework and concrete implementation that solves the problems people are facing. In other words, fix what we don't like and provide a parser that works the way we want, and forget about exact compatibility with the core queryparser... if someone wants its exact behavior, they can just use it. It would be much easier if improvements could be on separate patches rather than bundled: For example, LUCENE-1938 was easy for me to commit because it was well-contained and covered one single improvement/feature.
          Hide
          Adriano Crestani added a comment -

          Hi Robert,

          I completely agree with your statement, the config API scares me also. I would love to submit a patch for it, but I am working for IBM now, and, as a committer, I need to go through some bureaucratic paperwork before doing any new feature for Lucene and it might still take some time

          I had a better idea, I will propose it to be a GSOC project for this year. This way we can also get one more contributor to contrib QP.

          Show
          Adriano Crestani added a comment - Hi Robert, I completely agree with your statement, the config API scares me also. I would love to submit a patch for it, but I am working for IBM now, and, as a committer, I need to go through some bureaucratic paperwork before doing any new feature for Lucene and it might still take some time I had a better idea, I will propose it to be a GSOC project for this year. This way we can also get one more contributor to contrib QP.
          Hide
          Olivier Favre added a comment - - edited

          Relates to LUCENE-3343: Open range comparison operator >,>=,<,<= and =.

          Show
          Olivier Favre added a comment - - edited Relates to LUCENE-3343 : Open range comparison operator >,>=,<,<= and =.
          Hide
          Jan Høydahl added a comment -

          @Luis, still working on this?

          Show
          Jan Høydahl added a comment - @Luis, still working on this?
          Hide
          Jan Høydahl added a comment -

          Luis, are you there? Can you give a status on this. Think this issue needs a general update both the title, description and a strategy for how to proceed. I agree with Robert that incremental progress is better than trying to solve everything in one go.

          Starting to adopt the new Flex QP framework would accelerate QP development also in Solr camp.

          Show
          Jan Høydahl added a comment - Luis, are you there? Can you give a status on this. Think this issue needs a general update both the title, description and a strategy for how to proceed. I agree with Robert that incremental progress is better than trying to solve everything in one go. Starting to adopt the new Flex QP framework would accelerate QP development also in Solr camp.
          Hide
          Simon Willnauer added a comment -

          moving this over to 4.1 it seems dead to me though

          Show
          Simon Willnauer added a comment - moving this over to 4.1 it seems dead to me though
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.

            People

            • Assignee:
              Luis Alves
              Reporter:
              Michael Busch
            • Votes:
              6 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:

                Development