Lucene - Core
  1. Lucene - Core
  2. LUCENE-2039

Regex support and beyond in JavaCC QueryParser

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA, 4.0
    • Component/s: core/queryparser
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
      I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.

      The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.

      Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:

      protected Query newRegexQuery(Term t) {
        ... 
      }
      

      which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.

      I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

      1. LUCENE-2039.patch
        19 kB
        Simon Willnauer
      2. LUCENE-2039_field_ext.patch
        37 kB
        Simon Willnauer
      3. LUCENE-2039_field_ext.patch
        38 kB
        Simon Willnauer
      4. LUCENE-2039_field_ext.patch
        38 kB
        Simon Willnauer
      5. LUCENE-2039_field_ext.patch
        44 kB
        Simon Willnauer
      6. LUCENE-2039_field_ext.patch
        33 kB
        Simon Willnauer
      7. LUCENE-2039_wrap_master_parser.patch
        2 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          attached extension based patch

          Show
          Simon Willnauer added a comment - attached extension based patch
          Hide
          Uwe Schindler added a comment - - edited

          I do not like this extension.

          In my opinion, we should simply use the new QueryParser framework for it, where it is quite easy to plugin support for RegExQueries even if they live in contrib.

          Show
          Uwe Schindler added a comment - - edited I do not like this extension. In my opinion, we should simply use the new QueryParser framework for it, where it is quite easy to plugin support for RegExQueries even if they live in contrib.
          Hide
          Grant Ingersoll added a comment -

          The new QP framework is not proven out and doesn't have very many people using it and is still in contrib. This extension allows for a pretty simple way for people to add simple extensions to the current QP without having to do a whole lot of programming.

          Show
          Grant Ingersoll added a comment - The new QP framework is not proven out and doesn't have very many people using it and is still in contrib. This extension allows for a pretty simple way for people to add simple extensions to the current QP without having to do a whole lot of programming.
          Hide
          Luis Alves added a comment -

          I agree with Uwe,

          I think we should implement this on the new queryparser using the opaque terms framework described in LUCENE-1823.

          The current implementation of this patch will create backward compatibility syntax problems, for queries using "/" characters
          for example "file paths" or "urls" would be affected. If we are doing this we should change the syntax to allow for opaque terms.

          When we have support for opaque terms in the new queryparser, we can implement regex support with it.

          Opaque terms, is a framework to extend the queryparser syntax to bypass parts of the query to a smaller parsing code (not a full parser), or a analyzer, and allow extensions of the query syntax as needed, without requiring changing the lucene code.

          Show
          Luis Alves added a comment - I agree with Uwe, I think we should implement this on the new queryparser using the opaque terms framework described in LUCENE-1823 . The current implementation of this patch will create backward compatibility syntax problems, for queries using "/" characters for example "file paths" or "urls" would be affected. If we are doing this we should change the syntax to allow for opaque terms. When we have support for opaque terms in the new queryparser, we can implement regex support with it. Opaque terms, is a framework to extend the queryparser syntax to bypass parts of the query to a smaller parsing code (not a full parser), or a analyzer, and allow extensions of the query syntax as needed, without requiring changing the lucene code.
          Hide
          Grant Ingersoll added a comment -

          I have a need for this in the Lucene Query Parser. It simply isn't practical for me to switch to using the contrib Query Parser as that would involve a fair amount of changes in the application. As for the back compat issue, I think we can work around that by having a flag set. I'll look into it a bit more.

          Show
          Grant Ingersoll added a comment - I have a need for this in the Lucene Query Parser. It simply isn't practical for me to switch to using the contrib Query Parser as that would involve a fair amount of changes in the application. As for the back compat issue, I think we can work around that by having a flag set. I'll look into it a bit more.
          Hide
          Robert Muir added a comment -

          regardless of which query parser, I think it would be nice to have regex support in some query parser available.

          doesn't query parser now take Version as a required argument? Maybe the back compat issue could be solved with that???

          Show
          Robert Muir added a comment - regardless of which query parser, I think it would be nice to have regex support in some query parser available. doesn't query parser now take Version as a required argument? Maybe the back compat issue could be solved with that???
          Hide
          Simon Willnauer added a comment -

          I totally see you point but on the other hand I really miss the option to extend the old-fashion query parser. I do not see the new parser being THE lucene query parser by now.Many many people are using the javaCC parser and will do so in the future. I possibly have another solution which preserves backwards compatibility and would support the query extension too.

          The alternative idea is to utilize the fact that queries enclosed in double quotes are passed to getFieldQuery() and are not interpreted by the grammar. Extension queries could be embedded in quotes while the content needs to be escaped. (that is already the case though. To identify which extension should be used we could utilize the field name and a pattern so that users could plug in extension mapped to some field name pattern. Something like: re_field:"^.*$" -> (re_field, RegexExtension)

          that would not change anything in the parser as long as no extension is registered. No new character and no backwards compat issues.

          Thoughts?

          Show
          Simon Willnauer added a comment - I totally see you point but on the other hand I really miss the option to extend the old-fashion query parser. I do not see the new parser being THE lucene query parser by now.Many many people are using the javaCC parser and will do so in the future. I possibly have another solution which preserves backwards compatibility and would support the query extension too. The alternative idea is to utilize the fact that queries enclosed in double quotes are passed to getFieldQuery() and are not interpreted by the grammar. Extension queries could be embedded in quotes while the content needs to be escaped. (that is already the case though. To identify which extension should be used we could utilize the field name and a pattern so that users could plug in extension mapped to some field name pattern. Something like: re_field:"^.*$" -> (re_field, RegexExtension) that would not change anything in the parser as long as no extension is registered. No new character and no backwards compat issues. Thoughts?
          Hide
          Simon Willnauer added a comment -

          I think we can work around that by having a flag set. I'll look into it a bit more.

          Grant, JavaCC only generates parsers, a flag is a semantic check. You need to do a lot more work to do those checks. First step would be to build a tree using jjtree. Then you need to build the symbol table and then you can traverse the tree to do your checks.

          One solution would be creating a parser from two javacc files one for < 3.0 and one or 3.0 - something like robert suggested. Then we could use the Version to choose the corresponding parser impl.

          simon

          Show
          Simon Willnauer added a comment - I think we can work around that by having a flag set. I'll look into it a bit more. Grant, JavaCC only generates parsers, a flag is a semantic check. You need to do a lot more work to do those checks. First step would be to build a tree using jjtree. Then you need to build the symbol table and then you can traverse the tree to do your checks. One solution would be creating a parser from two javacc files one for < 3.0 and one or 3.0 - something like robert suggested. Then we could use the Version to choose the corresponding parser impl. simon
          Hide
          Robert Muir added a comment -

          Simon, personally I would prefer the Version argument used for such things.

          I know this isn't popular, but I'd actually be for having say, a 3.0 javacc grammar file that differs from the 2.9 one, with version driving it.

          yeah it would be duplicated code, but its mostly auto-generated code anyway, and I think it would be simple to understand what is going on.

          Show
          Robert Muir added a comment - Simon, personally I would prefer the Version argument used for such things. I know this isn't popular, but I'd actually be for having say, a 3.0 javacc grammar file that differs from the 2.9 one, with version driving it. yeah it would be duplicated code, but its mostly auto-generated code anyway, and I think it would be simple to understand what is going on.
          Hide
          Luis Alves added a comment - - edited

          Hi Simon,

          I think one problem lucene has today, is that the queryparser code in very tightly integrated with the javacc code. If we continue to do that it will always be very difficult to create a standard way of making small changes to the current queryparser.

          I like the implementation proposed by Simon, is very similar to the opaque term idea, but I would prefer not to overload the fileds names.

          The alternative idea is to utilize the fact that queries enclosed in double quotes are passed to getFieldQuery() and are not interpreted by the grammar. Extension queries could be embedded in quotes while the content needs to be escaped. (that is already the case though. To identify which extension should be used we could utilize the field name and a pattern so that users could plug in extension mapped to some field name pattern. Something like: re_field:"^.*$" -> (re_field, RegexExtension)

          We should decouple the user extensions from the JAVACC generated code. Just like in the new queryparser framework does, the queryparser should allow for the user to register these extensions at run time, and have Interface that extensions should implement.

          For example, something like this:

          QueryParser  qp = QueryParserFactory.getInstance("3.0");
          qp.registerOpaqueTerm("regexp", new QueryParserRegExpParser());
          qp.registerOpaqueTerm("complex_phrases", new QueryParserComplexPhraseParser());
          ...
          qp.parser(" regexp:\"/blah*/\" complex_phrase:\"(sun OR sunny) sky\" ",...);
          

          Of course this is not possible with the lucene queryparser code today ,
          but this is the idea I think we should try to implement.

          For the problem of field overload:
          In your proposal we lose the field name information for the extensions, so we need to another solution that would allow the fieldname to be available for the extensions.

          Here is another idea, that would allow for fieldnames not to be overloaded,
          and allow regular term or phrase syntax for extensions.

          syntax:
          extension:fieldname:"syntax"
          
          examples:
          regexp:title:"/blah[a-z]+[0-9]+/"  <- regexp extension, title index field
          complex_phrase:title:"(sun OR sunny) sky" <- complex_phrase extension, title index field
          
          regexp_phrase::"/blah[a-z]+[0-9]+/"  <- regexp extension, default field
          complex_phrase::"(sun OR sunny) sky" <- complex_phrase extension, default field
          
          title:"blah" <- regular field query
          
          
          Show
          Luis Alves added a comment - - edited Hi Simon, I think one problem lucene has today, is that the queryparser code in very tightly integrated with the javacc code. If we continue to do that it will always be very difficult to create a standard way of making small changes to the current queryparser. I like the implementation proposed by Simon, is very similar to the opaque term idea, but I would prefer not to overload the fileds names. The alternative idea is to utilize the fact that queries enclosed in double quotes are passed to getFieldQuery() and are not interpreted by the grammar. Extension queries could be embedded in quotes while the content needs to be escaped. (that is already the case though. To identify which extension should be used we could utilize the field name and a pattern so that users could plug in extension mapped to some field name pattern. Something like: re_field:"^.*$" -> (re_field, RegexExtension) We should decouple the user extensions from the JAVACC generated code. Just like in the new queryparser framework does, the queryparser should allow for the user to register these extensions at run time, and have Interface that extensions should implement. For example, something like this: QueryParser qp = QueryParserFactory.getInstance( "3.0" ); qp.registerOpaqueTerm( "regexp" , new QueryParserRegExpParser()); qp.registerOpaqueTerm( "complex_phrases" , new QueryParserComplexPhraseParser()); ... qp.parser( " regexp:\" /blah*/\ " complex_phrase:\" (sun OR sunny) sky\ " " ,...); Of course this is not possible with the lucene queryparser code today , but this is the idea I think we should try to implement. For the problem of field overload: In your proposal we lose the field name information for the extensions, so we need to another solution that would allow the fieldname to be available for the extensions. Here is another idea, that would allow for fieldnames not to be overloaded, and allow regular term or phrase syntax for extensions. syntax: extension:fieldname: "syntax" examples: regexp:title: "/blah[a-z]+[0-9]+/" <- regexp extension, title index field complex_phrase:title: "(sun OR sunny) sky" <- complex_phrase extension, title index field regexp_phrase:: "/blah[a-z]+[0-9]+/" <- regexp extension, default field complex_phrase:: "(sun OR sunny) sky" <- complex_phrase extension, default field title: "blah" <- regular field query
          Hide
          Luis Alves added a comment - - edited

          Grant, JavaCC only generates parsers, a flag is a semantic check. You need to do a lot more work to do those checks.
          First step would be to build a tree using jjtree.
          Then you need to build the symbol table and then you can traverse the tree to do your checks.

          In the new queryparser we don't use jjtree, but the same concept is implemented in the new queryparser,
          the ouput from the SyntaxParser interface is a syntax tree, this tree is not related with any lucene objects just like jjtree.
          But I think this is a ugly solution.

          I think if we use the new queryparser, it allows for multiple SyntaxParsers to use the same Processors and the Builders.
          And with a small implementation of a SyntaxParser(javacc, jflex, antlr, java tokenizer, etc), you can use the same Processors and Builders to create a lucene query.
          This will avoid duplicate code and allow for multiple syntaxes.

          I don't want to be preacher here, but some of these problems are already solved in the new queryparser framework, we just need to keep improving it, by adding more syntaxes, extensions and features to it.

          I know the new queryparser is not in main, but that can be fixed in 3.1.
          If the community thinks it is stable, we should move it to main.

          Show
          Luis Alves added a comment - - edited Grant, JavaCC only generates parsers, a flag is a semantic check. You need to do a lot more work to do those checks. First step would be to build a tree using jjtree. Then you need to build the symbol table and then you can traverse the tree to do your checks. In the new queryparser we don't use jjtree, but the same concept is implemented in the new queryparser, the ouput from the SyntaxParser interface is a syntax tree, this tree is not related with any lucene objects just like jjtree. But I think this is a ugly solution. I think if we use the new queryparser, it allows for multiple SyntaxParsers to use the same Processors and the Builders. And with a small implementation of a SyntaxParser(javacc, jflex, antlr, java tokenizer, etc), you can use the same Processors and Builders to create a lucene query. This will avoid duplicate code and allow for multiple syntaxes. I don't want to be preacher here, but some of these problems are already solved in the new queryparser framework, we just need to keep improving it, by adding more syntaxes, extensions and features to it. I know the new queryparser is not in main, but that can be fixed in 3.1. If the community thinks it is stable, we should move it to main.
          Hide
          Yonik Seeley added a comment -

          I think one problem lucene has today, is that the queryparser code in very tightly integrated with the javacc code.

          This almost seems more of an issue for core lucene developers - it's an annoyance that one needs to recompile the javacc grammar when just tweaking what one of the methods does. Seems like this could easily be solved by just separating into two files... the javacc grammar would have a base class that left things like getFieldQuery() unimplemented, and then the standard QueryParser (in a different java file) would override and implement those methods.

          We should decouple the user extensions from the JAVACC generated code.

          It already is today via subclassing QueryParser and overriding methods like getFieldQuery... that's very simple for users to understand and to leverage.

          Just like in the new queryparser framework does, the queryparser should allow for the user to register these extensions at run time, and have Interface that extensions should implement.

          I don't understand the motivation for this - it's complex and harder for a user to understand. Java's own extension mechanism (overriding) has worked perfectly fine in the past.

          Show
          Yonik Seeley added a comment - I think one problem lucene has today, is that the queryparser code in very tightly integrated with the javacc code. This almost seems more of an issue for core lucene developers - it's an annoyance that one needs to recompile the javacc grammar when just tweaking what one of the methods does. Seems like this could easily be solved by just separating into two files... the javacc grammar would have a base class that left things like getFieldQuery() unimplemented, and then the standard QueryParser (in a different java file) would override and implement those methods. We should decouple the user extensions from the JAVACC generated code. It already is today via subclassing QueryParser and overriding methods like getFieldQuery... that's very simple for users to understand and to leverage. Just like in the new queryparser framework does, the queryparser should allow for the user to register these extensions at run time, and have Interface that extensions should implement. I don't understand the motivation for this - it's complex and harder for a user to understand. Java's own extension mechanism (overriding) has worked perfectly fine in the past.
          Hide
          Luis Alves added a comment -

          Hi Yonik,

          This almost seems more of an issue for core lucene developers - it's an annoyance that one needs to recompile the javacc grammar when just tweaking what one of the methods does. Seems like this could easily be solved by just separating into two files... the javacc grammar would have a base class that left things like getFieldQuery() unimplemented, and then the standard QueryParser (in a different java file) would override and implement those methods.

          This solution does not fix the problem of having multiple syntaxes sharing the same lucene processing code. For example if you have one javacc grammar and one in antlr, you can't use lucene QueryParser, to process the output of both. You will need to re-implement the QueryParser recursive logic in a diff class to be able to use antlr.

          It already is today via subclassing QueryParser and overriding methods like getFieldQuery... that's very simple for users to understand and to leverage.

          True. This is simple, but is not customizable.

          • You can't change the syntax.
          • You can't reuse the QueryParser logic with other parsers
          • If you do have to change syntax, you can't reuse QueryParser class anymore, you need to maintain your own copy of the class.

          You can read LUCENE-1567 to understand the reasons for the new queryparser.
          But the focus of the new queryparser is extensibility and customization,
          without changing lucene code, but reusing lucene logic as much as possible.

          If you look at TestSpanQueryParserSimpleSample in queryparser contrib, or LUCENE-1938 Precedence query parser.
          It illustrates two cases that would be very difficult to do in the current QueryParser in lucene by overriding methods.

          Actually the a implementation PrecedenceQueryParser exists today in contrib/misc. That contains a seperated javacc grammar and does not share any code with the main lucene Queryparser, and it illustrates the problem I described above (code duplication, impossible to reuse if grammar is different, easily gets outdated when the core queryparser changes)

          I'm not trying to say the QueryParser in main is worst than the one in contrib,

          What I'm trying to describe is that the one in contrib is more modular and if we build the modules
          for the lucene users. The users will be able to build smarter and more sophisticated solutions using Lucene in less time.
          Users can decide what modules to use in the queryparser and build their query pipelines with less work.

          Users can also use the pre-built ones like StandardQueryParser or PrecedenceQueryParser, these should be as easy to use as the old queryparser in main.

          Show
          Luis Alves added a comment - Hi Yonik, This almost seems more of an issue for core lucene developers - it's an annoyance that one needs to recompile the javacc grammar when just tweaking what one of the methods does. Seems like this could easily be solved by just separating into two files... the javacc grammar would have a base class that left things like getFieldQuery() unimplemented, and then the standard QueryParser (in a different java file) would override and implement those methods. This solution does not fix the problem of having multiple syntaxes sharing the same lucene processing code. For example if you have one javacc grammar and one in antlr, you can't use lucene QueryParser, to process the output of both. You will need to re-implement the QueryParser recursive logic in a diff class to be able to use antlr. It already is today via subclassing QueryParser and overriding methods like getFieldQuery... that's very simple for users to understand and to leverage. True. This is simple, but is not customizable. You can't change the syntax. You can't reuse the QueryParser logic with other parsers If you do have to change syntax, you can't reuse QueryParser class anymore, you need to maintain your own copy of the class. You can read LUCENE-1567 to understand the reasons for the new queryparser. But the focus of the new queryparser is extensibility and customization, without changing lucene code, but reusing lucene logic as much as possible. If you look at TestSpanQueryParserSimpleSample in queryparser contrib, or LUCENE-1938 Precedence query parser. It illustrates two cases that would be very difficult to do in the current QueryParser in lucene by overriding methods. Actually the a implementation PrecedenceQueryParser exists today in contrib/misc. That contains a seperated javacc grammar and does not share any code with the main lucene Queryparser, and it illustrates the problem I described above (code duplication, impossible to reuse if grammar is different, easily gets outdated when the core queryparser changes) I'm not trying to say the QueryParser in main is worst than the one in contrib, What I'm trying to describe is that the one in contrib is more modular and if we build the modules for the lucene users. The users will be able to build smarter and more sophisticated solutions using Lucene in less time. Users can decide what modules to use in the queryparser and build their query pipelines with less work. Users can also use the pre-built ones like StandardQueryParser or PrecedenceQueryParser, these should be as easy to use as the old queryparser in main.
          Hide
          Adriano Crestani added a comment -

          This is a new feature already suggested by Luis and Shai (maybe others too) before, the ability to delegate to another parser the syntax processing of certain piece of the query string. This feature is a new feature to both: core QP and contrib QP.

          So, I think we should focus more on how/when a query substring will be delegated to another parser and not discuss about how/when any logic will be applied to it. I think in both QPs, this part is already defined.

          First, to identify this substring we would need a open and close token. It could be either double-quote, slash or whatever. The ideal solution would allow the user to specify these two tokens. Unfortunately, I think JavaCC is not so flexible to allow defining these tokens programatically (after parser generation by JavaCC). So we need to stick with some specific open/close token, that's one decision we need to take. Maybe we could provide a property file, where the user could specify the open/close token and regenerate Lucene QP using 'ant javacc' (which is pretty easy today). Anyway, by default, we could use any new token. I don't agree with double-quotes (as I think someone suggested), it's already used by phrases, so, slash is fine for me, as already defined in Simon's patch.

          Now, about any semantic(logic) processing performed on any query substring, it will be up to the QP implementation. In the core QP, its own extension would be responsible to do this processing. In the contrib QP, the extension parser would only parse the substring and return a QueryNode, which will be later processed, after the syntax parsing is complete, by the query node processors. As I said before, this part is defined and I don't think we should discuss it on this topic.

          I like Simon's patch, I think the same approach can be applied to the contrib QP. The only part I disagree is when you pass the fieldname to the extension parser, I wouldn't implement that on the contrib parser, because it assumes the syntax always has field names. Anyway, for the core QP, I see the reason why you pass the fieldname, and it's completely related to the way the core QP implements the semantic (logic) processing. So, in future, if the main core QP needs to pass a new info to its extension parser, the extension parser interface would have to be changed :S...here I go again starting a new discussion about how semantic (logic) processing should be handled

          Show
          Adriano Crestani added a comment - This is a new feature already suggested by Luis and Shai (maybe others too) before, the ability to delegate to another parser the syntax processing of certain piece of the query string. This feature is a new feature to both: core QP and contrib QP. So, I think we should focus more on how/when a query substring will be delegated to another parser and not discuss about how/when any logic will be applied to it. I think in both QPs, this part is already defined. First, to identify this substring we would need a open and close token. It could be either double-quote, slash or whatever. The ideal solution would allow the user to specify these two tokens. Unfortunately, I think JavaCC is not so flexible to allow defining these tokens programatically (after parser generation by JavaCC). So we need to stick with some specific open/close token, that's one decision we need to take. Maybe we could provide a property file, where the user could specify the open/close token and regenerate Lucene QP using 'ant javacc' (which is pretty easy today). Anyway, by default, we could use any new token. I don't agree with double-quotes (as I think someone suggested), it's already used by phrases, so, slash is fine for me, as already defined in Simon's patch. Now, about any semantic(logic) processing performed on any query substring, it will be up to the QP implementation. In the core QP, its own extension would be responsible to do this processing. In the contrib QP, the extension parser would only parse the substring and return a QueryNode, which will be later processed, after the syntax parsing is complete, by the query node processors. As I said before, this part is defined and I don't think we should discuss it on this topic. I like Simon's patch, I think the same approach can be applied to the contrib QP. The only part I disagree is when you pass the fieldname to the extension parser, I wouldn't implement that on the contrib parser, because it assumes the syntax always has field names. Anyway, for the core QP, I see the reason why you pass the fieldname, and it's completely related to the way the core QP implements the semantic (logic) processing. So, in future, if the main core QP needs to pass a new info to its extension parser, the extension parser interface would have to be changed :S...here I go again starting a new discussion about how semantic (logic) processing should be handled
          Hide
          Luis Alves added a comment -

          Simon and Adriano,
          Can you comment on the example below.

          syntax:
          extension:fieldname:"syntax"

          examples:
          regexp:title:"/blah[a-z][0-9]/" <- regexp extension, title index field
          complex_phrase:title:"(sun OR sunny) sky" <- complex_phrase extension, title index field

          regexp_phrase::"/blah[a-z][0-9]/" <- regexp extension, default field
          complex_phrase::"(sun OR sunny) sky" <- complex_phrase extension, default field

          title:"blah" <- regular field query

          This would allow the filedname and phrases or terms to be passed to a extension, and still be very compatible with the old syntax.
          (only double quotes and backslash need to be escaped in a phrase, so it should cover a big number of future extensions)

          Something like this would work for base64, but it would be target at programmatic layer, since users will not be able to generate that base64 strings, and it is supported by the syntax described above.

          binary:image:"base64:TWFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5IGhpcyByZWFzb24sIGJ1dCBieSB0aGlz"

          For extensions that won't work well with escaping double quotes and back-slash, we probably need some other delimiter, probably more than a single character
          some sugestions below:

          xml style:
          1) xpath:xmlfield:<[[ //title[@lang="c:\windowspath\folder" ]]>
          2) xpath:xmlfield:<![CDATA[ //title[@lang="c:\windowspath\folder" ]]>

          another one
          3) xpath:xmlfield:![CDATA[ //title[@lang="c:\windowspath\folder" ]]!

          Any of the sequences above is good OK with me.
          This should not affect old queries very much since the new syntax tokens would be
          ":<[[ " and "]]>" and these shouldn't be common on any lucene queries.
          Still not very user friendly, but better than the base64 approach.

          Show
          Luis Alves added a comment - Simon and Adriano, Can you comment on the example below. syntax: extension:fieldname:"syntax" examples: regexp:title:"/blah [a-z] [0-9] /" <- regexp extension, title index field complex_phrase:title:"(sun OR sunny) sky" <- complex_phrase extension, title index field regexp_phrase::"/blah [a-z] [0-9] /" <- regexp extension, default field complex_phrase::"(sun OR sunny) sky" <- complex_phrase extension, default field title:"blah" <- regular field query This would allow the filedname and phrases or terms to be passed to a extension, and still be very compatible with the old syntax. (only double quotes and backslash need to be escaped in a phrase, so it should cover a big number of future extensions) Something like this would work for base64, but it would be target at programmatic layer, since users will not be able to generate that base64 strings, and it is supported by the syntax described above. binary:image:"base64:TWFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5IGhpcyByZWFzb24sIGJ1dCBieSB0aGlz" For extensions that won't work well with escaping double quotes and back-slash, we probably need some other delimiter, probably more than a single character some sugestions below: xml style: 1) xpath:xmlfield:<[[ //title [@lang="c:\windowspath\folder" ] ]> 2) xpath:xmlfield:<![CDATA[ //title [@lang="c:\windowspath\folder" ] ]> another one 3) xpath:xmlfield:![CDATA[ //title [@lang="c:\windowspath\folder" ] ]! Any of the sequences above is good OK with me. This should not affect old queries very much since the new syntax tokens would be ":<[[ " and "]]>" and these shouldn't be common on any lucene queries. Still not very user friendly, but better than the base64 approach.
          Hide
          Simon Willnauer added a comment -

          Luis,

          syntax:
          extension:fieldname:"syntax"

          examples:
          regexp:title:"/blah[a-z][0-9]/" <- regexp extension, title index field
          complex_phrase:title:"(sun OR sunny) sky" <- complex_phrase extension, title index field

          regexp_phrase::"/blah[a-z][0-9]/" <- regexp extension, default field
          complex_phrase::"(sun OR sunny) sky" <- complex_phrase extension, default field

          title:"blah" <- regular field query

          This is pretty much what I suggested above. We can extend the queryparser without breaking the backwards compatibility just by adding some code which is aware of the fieldname scheme. Even this could be extendable. FieldNames are terms and therefore they can not contain unescaped special chars like : { ] ... I would not even hard code the separator into the query parser but have the field name processed by something pluggable. So If somebody wants to have a regex extension they could use re\:field: or re\:: or re_field:....
          Escaping a field is easy, just like you would do it with a term.
          More interesting is that we do not change any syntax, no special character but we can add a default implementation with a default implementation for extensions. This could be a whole API which takes are of creating and escaping the field name, building the query once it is passed to the extension etc.
          In a first step we can resolve the extension the second step calls the extension and build the query. If no extension is registered the query parser works like in previous versions so it is all up to the user.

          @Adriano:

          The only part I disagree is when you pass the fieldname to the extension parser, I wouldn't implement that on the contrib parser, because it assumes the syntax always has field names. Anyway, for the core QP, I see the reason why you pass the fieldname

          You need the field to create you query in the extension, the field will always be set to either the default field or the explicitly defined field in the query. No reason why we should not pass it.
          I agree with you that we should wrap the information in a class so that we do not need to change the method signature if something has to be changed in the future. Instead we just add a new member to the wrapper though.

          Show
          Simon Willnauer added a comment - Luis, syntax: extension:fieldname:"syntax" examples: regexp:title:"/blah [a-z] [0-9] /" <- regexp extension, title index field complex_phrase:title:"(sun OR sunny) sky" <- complex_phrase extension, title index field regexp_phrase::"/blah [a-z] [0-9] /" <- regexp extension, default field complex_phrase::"(sun OR sunny) sky" <- complex_phrase extension, default field title:"blah" <- regular field query This is pretty much what I suggested above. We can extend the queryparser without breaking the backwards compatibility just by adding some code which is aware of the fieldname scheme. Even this could be extendable. FieldNames are terms and therefore they can not contain unescaped special chars like : { ] ... I would not even hard code the separator into the query parser but have the field name processed by something pluggable. So If somebody wants to have a regex extension they could use re\:field: or re\:: or re_field:.... Escaping a field is easy, just like you would do it with a term. More interesting is that we do not change any syntax, no special character but we can add a default implementation with a default implementation for extensions. This could be a whole API which takes are of creating and escaping the field name, building the query once it is passed to the extension etc. In a first step we can resolve the extension the second step calls the extension and build the query. If no extension is registered the query parser works like in previous versions so it is all up to the user. @Adriano: The only part I disagree is when you pass the fieldname to the extension parser, I wouldn't implement that on the contrib parser, because it assumes the syntax always has field names. Anyway, for the core QP, I see the reason why you pass the fieldname You need the field to create you query in the extension, the field will always be set to either the default field or the explicitly defined field in the query. No reason why we should not pass it. I agree with you that we should wrap the information in a class so that we do not need to change the method signature if something has to be changed in the future. Instead we just add a new member to the wrapper though.
          Hide
          Adriano Crestani added a comment -

          This is pretty much what I suggested above. We can extend the queryparser without breaking the backwards compatibility just by adding some code which is aware of the fieldname scheme. Even this could be extendable. FieldNames are terms and therefore they can not contain unescaped special chars like : { ] ... I would not even hard code the separator into the query parser but have the field name processed by something pluggable. So If somebody wants to have a regex extension they could use re\:field: or re\:: or re_field:....
          Escaping a field is easy, just like you would do it with a term.
          More interesting is that we do not change any syntax, no special character but we can add a default implementation with a default implementation for extensions. This could be a whole API which takes are of creating and escaping the field name, building the query once it is passed to the extension etc.
          In a first step we can resolve the extension the second step calls the extension and build the query. If no extension is registered the query parser works like in previous versions so it is all up to the user.

          +1

          I agree with you that we should wrap the information in a class so that we do not need to change the method signature if something has to be changed in the future. Instead we just add a new member to the wrapper though.

          A Map should solve this problem

          Show
          Adriano Crestani added a comment - This is pretty much what I suggested above. We can extend the queryparser without breaking the backwards compatibility just by adding some code which is aware of the fieldname scheme. Even this could be extendable. FieldNames are terms and therefore they can not contain unescaped special chars like : { ] ... I would not even hard code the separator into the query parser but have the field name processed by something pluggable. So If somebody wants to have a regex extension they could use re\:field: or re\:: or re_field:.... Escaping a field is easy, just like you would do it with a term. More interesting is that we do not change any syntax, no special character but we can add a default implementation with a default implementation for extensions. This could be a whole API which takes are of creating and escaping the field name, building the query once it is passed to the extension etc. In a first step we can resolve the extension the second step calls the extension and build the query. If no extension is registered the query parser works like in previous versions so it is all up to the user. +1 I agree with you that we should wrap the information in a class so that we do not need to change the method signature if something has to be changed in the future. Instead we just add a new member to the wrapper though. A Map should solve this problem
          Hide
          Luis Alves added a comment - - edited

          +1

          I'm work on changing the queryparser on Contrib, to implement that syntax for the opaque terms.

          Show
          Luis Alves added a comment - - edited +1 I'm work on changing the queryparser on Contrib, to implement that syntax for the opaque terms.
          Hide
          Simon Willnauer added a comment -

          This patch implements the field:ext: approach. I will do some more work on the javadoc - pushing it out for comments!

          Comments on class naming are welcome too

          Show
          Simon Willnauer added a comment - This patch implements the field:ext: approach. I will do some more work on the javadoc - pushing it out for comments! Comments on class naming are welcome too
          Hide
          Luis Alves added a comment -

          Hi Simon,

          I also posted a patch in LUCENE-1823, that implements the ext:field approach, and added a junit that implements a new QParser for regex.

          If you have time can you take a look at the classes in the standart2 test folder, RegexQueryParser amd TestOpaqueExtensionQuery and review the testcase

          Show
          Luis Alves added a comment - Hi Simon, I also posted a patch in LUCENE-1823 , that implements the ext:field approach, and added a junit that implements a new QParser for regex. If you have time can you take a look at the classes in the standart2 test folder, RegexQueryParser amd TestOpaqueExtensionQuery and review the testcase
          Hide
          David Kaelbling added a comment - - edited

          I apologize if I haven't read the comments carefully enough, but in LUCENE-2039_field_ext.patch why is ExtendableQueryParser final? That means (for example) that ComplexPhraseQueryParser cannot subclass it. In the earlier LUCENE-2039.patch the complex phrase parser picked up the changes for free.

          And would RegexParserExtension maybe be easier to use if it set the RegexCapabilities on the new RegexQuery it is returning?

          Show
          David Kaelbling added a comment - - edited I apologize if I haven't read the comments carefully enough, but in LUCENE-2039 _field_ext.patch why is ExtendableQueryParser final? That means (for example) that ComplexPhraseQueryParser cannot subclass it. In the earlier LUCENE-2039 .patch the complex phrase parser picked up the changes for free. And would RegexParserExtension maybe be easier to use if it set the RegexCapabilities on the new RegexQuery it is returning?
          Hide
          Robert Muir added a comment -

          Hi, in my opinion RegexParserExtension should not be tied to RegexQuery/RegexCapabilities.
          This is only one possible implementation of regex support and has some scalability problems.

          Show
          Robert Muir added a comment - Hi, in my opinion RegexParserExtension should not be tied to RegexQuery/RegexCapabilities. This is only one possible implementation of regex support and has some scalability problems.
          Hide
          Simon Willnauer added a comment -

          That means (for example) that ComplexPhraseQueryParser cannot subclass it

          This patch was not meant to include ComplexPhraseQueryParser it is rather a proposal for the concept of field "overloading". But you are right the parser should not be final at all especially if you wanna override a get*query method it should be expendable.

          Hi, in my opinion RegexParserExtension should not be tied to RegexQuery/RegexCapabilities.

          This is only one possible implementation of regex support and has some scalability problems.

          Also true, but again this is just a POC to show how it would look like. Comments on the concept would be more useful by now.
          I did write that up during a train ride and aimed to get some comments. I already have worked on it and will upload a new patch soon which includes RegexCapabilities + tests.
          Thanks again for the pointer with the final class.

          Show
          Simon Willnauer added a comment - That means (for example) that ComplexPhraseQueryParser cannot subclass it This patch was not meant to include ComplexPhraseQueryParser it is rather a proposal for the concept of field "overloading". But you are right the parser should not be final at all especially if you wanna override a get*query method it should be expendable. Hi, in my opinion RegexParserExtension should not be tied to RegexQuery/RegexCapabilities. This is only one possible implementation of regex support and has some scalability problems. Also true, but again this is just a POC to show how it would look like. Comments on the concept would be more useful by now. I did write that up during a train ride and aimed to get some comments. I already have worked on it and will upload a new patch soon which includes RegexCapabilities + tests. Thanks again for the pointer with the final class.
          Hide
          Simon Willnauer added a comment -

          Updated the patch

          • removed final modifier from ExtendableQueryParser
          • added RegexCapabilities ctor to RegexParserExtension

          I still need to work on the Extensions JavaDoc - and I'm not too happy with the name.

          Comments on the concept are very welcome.

          Show
          Simon Willnauer added a comment - Updated the patch removed final modifier from ExtendableQueryParser added RegexCapabilities ctor to RegexParserExtension I still need to work on the Extensions JavaDoc - and I'm not too happy with the name. Comments on the concept are very welcome.
          Hide
          Mark Miller added a comment -

          It looks like the patch puts this in core? Any compelling reason? Offhand I'd think it would go in the misc contrib with the other queryparsers that extend the core queryparser.

          Show
          Mark Miller added a comment - It looks like the patch puts this in core? Any compelling reason? Offhand I'd think it would go in the misc contrib with the other queryparsers that extend the core queryparser.
          Hide
          Simon Willnauer added a comment -

          Offhand I'd think it would go in the misc contrib with the other queryparsers that extend the core queryparser.

          For sure. I will attach another patch - did not thing about that too much when I moved from the first proposal which modified the core one.

          Show
          Simon Willnauer added a comment - Offhand I'd think it would go in the misc contrib with the other queryparsers that extend the core queryparser. For sure. I will attach another patch - did not thing about that too much when I moved from the first proposal which modified the core one.
          Hide
          Simon Willnauer added a comment -

          moved ext parser to contrib/misc

          Show
          Simon Willnauer added a comment - moved ext parser to contrib/misc
          Hide
          Simon Willnauer added a comment -

          Took over from Grant

          Show
          Simon Willnauer added a comment - Took over from Grant
          Hide
          Simon Willnauer added a comment -

          I finished the JavaDoc, added package.html file.
          I again refactored RegexParserExtension to use a factory method to obtain an instance of RegexCapabilities. RegexCapabilities is stateful and can not be shared so subclassing to change seems to be reasonable. JavaUtil seems to be reasonable anyway after the latest Jakarta Regexp drama .

          I had to introduce a compile time dependency from regex to misc to build the extension - should add misc as a dependency to maven in this case?

          Show
          Simon Willnauer added a comment - I finished the JavaDoc, added package.html file. I again refactored RegexParserExtension to use a factory method to obtain an instance of RegexCapabilities. RegexCapabilities is stateful and can not be shared so subclassing to change seems to be reasonable. JavaUtil seems to be reasonable anyway after the latest Jakarta Regexp drama . I had to introduce a compile time dependency from regex to misc to build the extension - should add misc as a dependency to maven in this case?
          Hide
          Robert Muir added a comment -

          JavaUtil seems to be reasonable anyway after the latest Jakarta Regexp drama

          yeah, we shouldn't mislead anyone into believing the constant prefix with jakarta actually works even now.
          for example the constant prefix of (ab|ac) is not "a" but instead empty string.

          Show
          Robert Muir added a comment - JavaUtil seems to be reasonable anyway after the latest Jakarta Regexp drama yeah, we shouldn't mislead anyone into believing the constant prefix with jakarta actually works even now. for example the constant prefix of (ab|ac) is not "a" but instead empty string.
          Hide
          Simon Willnauer added a comment -

          The contrib/regex dependency on contrib/misc buggs me a bit though. I have the impression that this regex default extension should not be part of this patch. The extension seems to be so trivial that users could implement it on their own. This would save us the dependency and IMO would not be a problem for users though.

          Any thoughts?

          Show
          Simon Willnauer added a comment - The contrib/regex dependency on contrib/misc buggs me a bit though. I have the impression that this regex default extension should not be part of this patch. The extension seems to be so trivial that users could implement it on their own. This would save us the dependency and IMO would not be a problem for users though. Any thoughts?
          Hide
          Simon Willnauer added a comment -

          I removed the RegexExtension in this patch as in my opinion this is not worth the dependency. As soon as roberts automation patch is in core we won't have this problem anymore anyway.
          Users should add trivial extensions like on their own.
          One other thing I was thinking about is adding another ExtensionParser which subclasses ComplexPhraseQueryParser. This is a benefit from moving it into misc, I think we should do it.
          Thoughts?

          Show
          Simon Willnauer added a comment - I removed the RegexExtension in this patch as in my opinion this is not worth the dependency. As soon as roberts automation patch is in core we won't have this problem anymore anyway. Users should add trivial extensions like on their own. One other thing I was thinking about is adding another ExtensionParser which subclasses ComplexPhraseQueryParser. This is a benefit from moving it into misc, I think we should do it. Thoughts?
          Hide
          Mark Miller added a comment -

          One other thing I was thinking about is adding another ExtensionParser which subclasses ComplexPhraseQueryParser.

          We should be careful here - the thought at one time was to remove ComplexPhraseQueryParser soon and replace it with one written for the new QueryParser. In that case, we might not want to add more that relies on it - the new one might not match the feature/results of the old one. Just an FYI though. In my mind, who knows if that will end up happening, and it shouldn't necessarily block other good work just because it might. Just thought I'd mention it.

          Show
          Mark Miller added a comment - One other thing I was thinking about is adding another ExtensionParser which subclasses ComplexPhraseQueryParser. We should be careful here - the thought at one time was to remove ComplexPhraseQueryParser soon and replace it with one written for the new QueryParser. In that case, we might not want to add more that relies on it - the new one might not match the feature/results of the old one. Just an FYI though. In my mind, who knows if that will end up happening, and it shouldn't necessarily block other good work just because it might. Just thought I'd mention it.
          Hide
          Simon Willnauer added a comment -

          ...was to remove ComplexPhraseQueryParser soon and replace it with one written for the new QueryParser.

          Thanks Mark, I do not recall this so good that you mention it again. Either way, this was just a suggestion which came to my mind. Another suggestion for maybe a future development is to move this into the core query parser as it does not change any behavior as long as you do not explicitly specify any extension.
          To me these ideas are future considerations which could be done in sep. issues once this is in. I plan to commit the current patch until 12/06/09 if nobody objects.

          Show
          Simon Willnauer added a comment - ...was to remove ComplexPhraseQueryParser soon and replace it with one written for the new QueryParser. Thanks Mark, I do not recall this so good that you mention it again. Either way, this was just a suggestion which came to my mind. Another suggestion for maybe a future development is to move this into the core query parser as it does not change any behavior as long as you do not explicitly specify any extension. To me these ideas are future considerations which could be done in sep. issues once this is in. I plan to commit the current patch until 12/06/09 if nobody objects.
          Hide
          Simon Willnauer added a comment -

          I will commit this tomorrow if nobody objects.

          Show
          Simon Willnauer added a comment - I will commit this tomorrow if nobody objects.
          Hide
          Simon Willnauer added a comment -

          Commited in revision 887533

          Show
          Simon Willnauer added a comment - Commited in revision 887533
          Hide
          David Kaelbling added a comment -

          Currently the master parser doesn't pass settings down to the extension parsers (things like setAllowLeadingWildcard, setMultiTermRewriteMethod, etc.) Should it?

          Show
          David Kaelbling added a comment - Currently the master parser doesn't pass settings down to the extension parsers (things like setAllowLeadingWildcard, setMultiTermRewriteMethod, etc.) Should it?
          Hide
          Simon Willnauer added a comment - - edited

          Currently the master parser doesn't pass settings down to the extension parsers (things like setAllowLeadingWildcard, setMultiTermRewriteMethod, etc.) Should it?

          David, so some sort of parsers this would make sense though. Yet, we either reflect all of the setting from the "master" parser or none of it. I would suggest to modify the ExtensionQuery ctor to take a QueryParser instance and add the corresponding getters to it. That way we can maintain a consistent view on the setting even if they are reset on the master parser without overriding all setters though. Would that make sense?
          Another way of enable this is to pass the query parser instance into ExtensionQuery and simply add a getter so extension parsers can access the parser and its utilities too.
          Anyhow, we should open a new issue for that. I will do so

          Show
          Simon Willnauer added a comment - - edited Currently the master parser doesn't pass settings down to the extension parsers (things like setAllowLeadingWildcard, setMultiTermRewriteMethod, etc.) Should it? David, so some sort of parsers this would make sense though. Yet, we either reflect all of the setting from the "master" parser or none of it. I would suggest to modify the ExtensionQuery ctor to take a QueryParser instance and add the corresponding getters to it. That way we can maintain a consistent view on the setting even if they are reset on the master parser without overriding all setters though. Would that make sense? Another way of enable this is to pass the query parser instance into ExtensionQuery and simply add a getter so extension parsers can access the parser and its utilities too. Anyhow, we should open a new issue for that. I will do so
          Hide
          David Kaelbling added a comment -

          > I would suggest to modify the ExtensionQuery ctor to take a QueryParser instance and add
          > the corresponding getters to it. That way we can maintain a consistent view on the setting
          > even if they are reset on the master parser without overriding all setters though. Would
          > that make sense?

          Simon, it sounds like the right direction! But relying on ExtensionParser implementors to manually copy all the parent settings to the child seems like a maintenance problem. We add new settings relatively often. Unfortunately there's nothing like the token Attribute stuff for QueryParser...

          > Another way of enable this is to pass the query parser instance into ExtensionQuery and
          > simply add a getter so extension parsers can access the parser and its utilities too.

          Umm, I don't quite follow. If the extension wraps an existing parser, the extension would have to subclass it, override all the getters/setters to delegate to the parent, and trust that everyone uses them? That's not currently true – for example QueryParser.getPrefixQuery() directly accesses allowLeadingWildcard, without using the getter. There are probably other cases too, that's just the first one I checked.

          Show
          David Kaelbling added a comment - > I would suggest to modify the ExtensionQuery ctor to take a QueryParser instance and add > the corresponding getters to it. That way we can maintain a consistent view on the setting > even if they are reset on the master parser without overriding all setters though. Would > that make sense? Simon, it sounds like the right direction! But relying on ExtensionParser implementors to manually copy all the parent settings to the child seems like a maintenance problem. We add new settings relatively often. Unfortunately there's nothing like the token Attribute stuff for QueryParser... > Another way of enable this is to pass the query parser instance into ExtensionQuery and > simply add a getter so extension parsers can access the parser and its utilities too. Umm, I don't quite follow. If the extension wraps an existing parser, the extension would have to subclass it, override all the getters/setters to delegate to the parent, and trust that everyone uses them? That's not currently true – for example QueryParser.getPrefixQuery() directly accesses allowLeadingWildcard, without using the getter. There are probably other cases too, that's just the first one I checked.
          Hide
          Simon Willnauer added a comment - - edited

          Simon, it sounds like the right direction! But relying on ExtensionParser implementors to manually copy all the parent settings to the child seems like a maintenance problem. We add new settings relatively often. Unfortunately there's nothing like the token Attribute stuff for QueryParser...

          yeah that is the reason why I suggested passing the top level parser instance itself to the extension and expose it there.

          Umm, I don't quite follow. If the extension wraps an existing parser, the extension would have to subclass it, override all the getters/setters to delegate to the parent, and trust that everyone uses them? That's not currently true - for example QueryParser.getPrefixQuery() directly accesses allowLeadingWildcard, without using the getter. There are probably other cases too, that's just the first one I checked.

          David here is an example of what I mean by wrapping the top level parser. This would give you access to all the settings right away inside your extension.

          Show
          Simon Willnauer added a comment - - edited Simon, it sounds like the right direction! But relying on ExtensionParser implementors to manually copy all the parent settings to the child seems like a maintenance problem. We add new settings relatively often. Unfortunately there's nothing like the token Attribute stuff for QueryParser... yeah that is the reason why I suggested passing the top level parser instance itself to the extension and expose it there. Umm, I don't quite follow. If the extension wraps an existing parser, the extension would have to subclass it, override all the getters/setters to delegate to the parent, and trust that everyone uses them? That's not currently true - for example QueryParser.getPrefixQuery() directly accesses allowLeadingWildcard, without using the getter. There are probably other cases too, that's just the first one I checked. David here is an example of what I mean by wrapping the top level parser. This would give you access to all the settings right away inside your extension.
          Hide
          David Kaelbling added a comment -

          > David here is an example of what I mean by wrapping the top level parser. This would give
          > you access to all the settings right away inside your extension.

          Maybe I'm oversimplifying. As a lazy implementor I want to write something like this:

          ext.add("regex", new ParserExtension() {
          public Query parse(ExtensionQuery q) throws ParseException

          { return new RegexQuery(new Term(q.getField(), q.getRawQueryString())); }

          } );

          My ParserExtension can query any top level parser settings it likes, but how do I get those
          values down into RegexQuery? Or whatever full parser I'm invoking?

          Show
          David Kaelbling added a comment - > David here is an example of what I mean by wrapping the top level parser. This would give > you access to all the settings right away inside your extension. Maybe I'm oversimplifying. As a lazy implementor I want to write something like this: ext.add("regex", new ParserExtension() { public Query parse(ExtensionQuery q) throws ParseException { return new RegexQuery(new Term(q.getField(), q.getRawQueryString())); } } ); My ParserExtension can query any top level parser settings it likes, but how do I get those values down into RegexQuery? Or whatever full parser I'm invoking?
          Hide
          Simon Willnauer added a comment - - edited

          Am I missing something? Why not using the getTopLevelParser() method?
          like that:

           extensions.add("regex", new ParserExtension() {
          		
          		@Override
          		public Query parse(ExtensionQuery query) throws ParseException {
          			boolean enableWildcards = query.getTopLevelParser().getAllowLeadingWildcard();
          			return new RegexQuery(new Term(query.getField(), query.getRawQueryString()));
          		}
          	});
          
          

          I mean enableWildcards doesn't do anything here but you can take whatever settings you like as shown above.
          simon

          Show
          Simon Willnauer added a comment - - edited Am I missing something? Why not using the getTopLevelParser() method? like that: extensions.add( "regex" , new ParserExtension() { @Override public Query parse(ExtensionQuery query) throws ParseException { boolean enableWildcards = query.getTopLevelParser().getAllowLeadingWildcard(); return new RegexQuery( new Term(query.getField(), query.getRawQueryString())); } }); I mean enableWildcards doesn't do anything here but you can take whatever settings you like as shown above. simon
          Hide
          David Kaelbling added a comment -

          Hi Simon – sorry, an example using a full query parser rather than just RegexQuery would have been clearer. What I'm worried about is that I don't in general know all the parent settings that need to be mirrored in the nested parser. Writing nestedParser.setFoo(topLevelParser.getFoo()) calls for all possible Foo, past, present, and future, is going to bite me eventually.

          Show
          David Kaelbling added a comment - Hi Simon – sorry, an example using a full query parser rather than just RegexQuery would have been clearer. What I'm worried about is that I don't in general know all the parent settings that need to be mirrored in the nested parser. Writing nestedParser.setFoo(topLevelParser.getFoo()) calls for all possible Foo, past, present, and future, is going to bite me eventually.
          Hide
          Simon Willnauer added a comment -

          David, I assume you use a subclass of QueryParser inside your ParserExtension and your are worried about reflecting all properties of the top level parser into your sub parser?! I understand what your issues are but I do not see another way than pulling the properties from the top level parser once you need it. If you rely on an already existing parser that does not allow you to call the getters in place you probably have to set them each time if you really need all of those. I don't see another way to do it, do you?

          simon

          Show
          Simon Willnauer added a comment - David, I assume you use a subclass of QueryParser inside your ParserExtension and your are worried about reflecting all properties of the top level parser into your sub parser?! I understand what your issues are but I do not see another way than pulling the properties from the top level parser once you need it. If you rely on an already existing parser that does not allow you to call the getters in place you probably have to set them each time if you really need all of those. I don't see another way to do it, do you? simon
          Hide
          David Kaelbling added a comment - - edited

          I do not see another way than pulling the properties from the top level parser once you need it.

          Exactly! I don't see another way either, hence my original lament that QueryParser didn't have anything like Token attributes where all the state was extracted into a separate sharable entity.

          I can live with hard-coding knowledge of both the top level- and embedded-parser settings in the ParserExtension implementation. I was just hoping for a better way.

          Show
          David Kaelbling added a comment - - edited I do not see another way than pulling the properties from the top level parser once you need it. Exactly! I don't see another way either, hence my original lament that QueryParser didn't have anything like Token attributes where all the state was extracted into a separate sharable entity. I can live with hard-coding knowledge of both the top level- and embedded-parser settings in the ParserExtension implementation. I was just hoping for a better way.
          Hide
          Simon Willnauer added a comment -

          Exactly! I don't see another way either, hence my original lament that QueryParser didn't have anything like Token attributes where all the state was extracted into a separate sharable entity.

          David, FYI I created a new issue for that (LUCENE-2162)

          Show
          Simon Willnauer added a comment - Exactly! I don't see another way either, hence my original lament that QueryParser didn't have anything like Token attributes where all the state was extracted into a separate sharable entity. David, FYI I created a new issue for that ( LUCENE-2162 )
          Hide
          Yonik Seeley added a comment -

          Reopening, as I believe the grammar as implemented is a bit flawed.

          A simple query of foo/bar will now fail since the slash in the middle of the term is seen as the start of a regex.

          Show
          Yonik Seeley added a comment - Reopening, as I believe the grammar as implemented is a bit flawed. A simple query of foo/bar will now fail since the slash in the middle of the term is seen as the start of a regex.
          Hide
          Hoss Man added a comment -

          Yonik: strongly suggest you open a new issue to address this, since LUCENE-2039 is already listed as a feature added in 4.0-ALPHA.

          if you "fix" this mid string slash issue, you're going to want a unique jira id to refer to when citing the bug fix as a CHANGES.txt for 4.0-final, or no one will have any clear idea what works where.

          Show
          Hoss Man added a comment - Yonik: strongly suggest you open a new issue to address this, since LUCENE-2039 is already listed as a feature added in 4.0-ALPHA. if you "fix" this mid string slash issue, you're going to want a unique jira id to refer to when citing the bug fix as a CHANGES.txt for 4.0-final, or no one will have any clear idea what works where.
          Hide
          Alexandre Rafalovitch added a comment -

          This can be re-closed as it was fixed in beta by SOLR-3467 .

          Show
          Alexandre Rafalovitch added a comment - This can be re-closed as it was fixed in beta by SOLR-3467 .
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Simon Willnauer
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development