Lucene - Core
  1. Lucene - Core
  2. LUCENE-1486

Wildcards, ORs etc inside Phrase queries

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 4.8
    • Component/s: core/queryparser
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.

      The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include:

      checkMatches("\"j* smyth~\"", "1,2"); //wildcards and fuzzies are OK in phrases
      checkMatches("\"(jo* -john) smith\"", "2"); // boolean logic works
      checkMatches("\"jo* smith\"~2", "1,2,3"); // position logic works.

      checkBadQuery("\"jo* id:1 smith\""); //mixing fields in a phrase is bad
      checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases is bad
      checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries inside phrases not supported

      Code plus Junit test to follow...

      1. LUCENE-1486.patch
        6 kB
        Ahmet Arslan
      2. LUCENE-1486.patch
        6 kB
        Tomás Fernández Löbbe
      3. LUCENE-1486.patch
        43 kB
        Mark Miller
      4. Lucene-1486 non default field.patch
        1 kB
        Mark Harwood
      5. junit_complex_phrase_qp_07_22_2009.patch
        1 kB
        Luis Alves
      6. junit_complex_phrase_qp_07_21_2009.patch
        8 kB
        Adriano Crestani
      7. LUCENE-1486.patch
        21 kB
        Mark Miller
      8. LUCENE-1486.patch
        21 kB
        Mark Miller
      9. LUCENE-1486.patch
        19 kB
        Mark Harwood
      10. LUCENE-1486.patch
        18 kB
        Mark Miller
      11. TestComplexPhraseQuery.java
        5 kB
        Mark Harwood
      12. ComplexPhraseQueryParser.java
        12 kB
        Mark Harwood

        Issue Links

          Activity

          Hide
          Mark Harwood added a comment -

          Junit test

          Show
          Mark Harwood added a comment - Junit test
          Hide
          Mark Harwood added a comment -

          QueryParser extension

          Show
          Mark Harwood added a comment - QueryParser extension
          Hide
          Mark Harwood added a comment -

          Added tests for range queries and plain PhraseQueries

          Show
          Mark Harwood added a comment - Added tests for range queries and plain PhraseQueries
          Hide
          Mark Harwood added a comment -

          Fixed bug with plain phrase query, added support for range queries

          Show
          Mark Harwood added a comment - Fixed bug with plain phrase query, added support for range queries
          Hide
          Michael McCandless added a comment -

          (Added 2.9 fix version in addition to 2.4.1).

          Show
          Michael McCandless added a comment - (Added 2.9 fix version in addition to 2.4.1).
          Hide
          Mark Harwood added a comment -

          Added support for "Nots" in phrase queries e.g. "-not interested"

          Show
          Mark Harwood added a comment - Added support for "Nots" in phrase queries e.g. "-not interested"
          Hide
          Mark Harwood added a comment -

          More tests for Nots

          Show
          Mark Harwood added a comment - More tests for Nots
          Hide
          Mark Harwood added a comment -

          Updated to cater for phrase clauses that produce no matches

          Show
          Mark Harwood added a comment - Updated to cater for phrase clauses that produce no matches
          Hide
          Mark Harwood added a comment -

          Updated Junit test to test for phrases with clauses that produce no matches

          Show
          Mark Harwood added a comment - Updated Junit test to test for phrases with clauses that produce no matches
          Hide
          Mark Miller added a comment -

          What do you think about this for 2.9 Mark H?

          The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax.

          That leads me to think we might want to push to 3.0? Or have you moved beyond that with all of these updates?

          Show
          Mark Miller added a comment - What do you think about this for 2.9 Mark H? The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. That leads me to think we might want to push to 3.0? Or have you moved beyond that with all of these updates?
          Hide
          Mark Harwood added a comment -

          Perhaps "hacky" was too strong a word. I think it's a reasonable approach to handling the complexity involved in this logic.

          A colleague of mine has this running in production on a big installation with lots of users

          Show
          Mark Harwood added a comment - Perhaps "hacky" was too strong a word. I think it's a reasonable approach to handling the complexity involved in this logic. A colleague of mine has this running in production on a big installation with lots of users
          Hide
          Michael McCandless added a comment -

          Is there some reason not to include this in QueryParser instead? Ie, it accepts a superset of QueryParser's current syntax?

          Show
          Michael McCandless added a comment - Is there some reason not to include this in QueryParser instead? Ie, it accepts a superset of QueryParser's current syntax?
          Hide
          Mark Harwood added a comment -

          The primary reason (and perhaps not a particularly good one) was I didn't want to wade around in the Javacc syntax of the .jj file that generates the QueryParser and the required extensions could be made in a subclass.

          Also there is invariably a performance hit for supporting things like wildcards in phrase queries so rather than adding another "off by default" flag in the main parser and conditional logic to test if "wildcards etc in phrases" are allowed, the subclass could be seen as a specialised extension that is to be used by those that understand the trade-offs between functionality and performance.

          I can sympathise with the purist approach of having all parser syntax defined in Javacc though.

          Show
          Mark Harwood added a comment - The primary reason (and perhaps not a particularly good one) was I didn't want to wade around in the Javacc syntax of the .jj file that generates the QueryParser and the required extensions could be made in a subclass. Also there is invariably a performance hit for supporting things like wildcards in phrase queries so rather than adding another "off by default" flag in the main parser and conditional logic to test if "wildcards etc in phrases" are allowed, the subclass could be seen as a specialised extension that is to be used by those that understand the trade-offs between functionality and performance. I can sympathise with the purist approach of having all parser syntax defined in Javacc though.
          Hide
          Mark Miller added a comment -

          Should this go in contrib rather than core? That seems to have been the approach so far, any reason to vary it up here?

          Well, actually, looks like I see the multi field parser in core. Makes sense to put subclasses there I guess.

          You think this is ready to commit Mark? If so, I should be able to review it (unless you want to commit it yourself).

          Show
          Mark Miller added a comment - Should this go in contrib rather than core? That seems to have been the approach so far, any reason to vary it up here? Well, actually, looks like I see the multi field parser in core. Makes sense to put subclasses there I guess. You think this is ready to commit Mark? If so, I should be able to review it (unless you want to commit it yourself).
          Hide
          Mark Miller added a comment -

          Reformatted to lucene formatting, removed author tag, removed a couple unused fields, changed to patch format

          Tests don't pass because it doesnt work quite correctly with the new constantscore multi term queries yet.

          Show
          Mark Miller added a comment - Reformatted to lucene formatting, removed author tag, removed a couple unused fields, changed to patch format Tests don't pass because it doesnt work quite correctly with the new constantscore multi term queries yet.
          Hide
          Mark Miller added a comment -

          Hey Mark, this doesn't work correctly with the new constant score mode. I'm hesitant to put something in core that only works with boolean expansion.

          I'm not sure what needs to be done (I started and realized my interest wasn't high enough). Could you update this? Otherwise I'm tempted to push off to 3.0...

          Unless another brave soul steps of course. Or I may jump back in - my brain is fickle.

          Show
          Mark Miller added a comment - Hey Mark, this doesn't work correctly with the new constant score mode. I'm hesitant to put something in core that only works with boolean expansion. I'm not sure what needs to be done (I started and realized my interest wasn't high enough). Could you update this? Otherwise I'm tempted to push off to 3.0... Unless another brave soul steps of course. Or I may jump back in - my brain is fickle.
          Hide
          Mark Harwood added a comment -

          Added fix for ConstantScoreQuery changes

          Show
          Mark Harwood added a comment - Added fix for ConstantScoreQuery changes
          Hide
          Mark Harwood added a comment -

          The fix was relatively straight-forward from what I could see. Just temporarily unset the QueryParser's ConstantScoreRewrite mode when performing the pass that is just evaluating query elements inside phrase queries. These clauses need to resolve to traditional BooleanQuery-full-of-termQueries in order that they can be inspected and rewritten as Span equivalents for complex phrases.

          Should do the job.

          Cheers
          Mark
          (Been far too busy with other things and missing getting my hands dirty here with Lucene!)

          Show
          Mark Harwood added a comment - The fix was relatively straight-forward from what I could see. Just temporarily unset the QueryParser's ConstantScoreRewrite mode when performing the pass that is just evaluating query elements inside phrase queries. These clauses need to resolve to traditional BooleanQuery-full-of-termQueries in order that they can be inspected and rewritten as Span equivalents for complex phrases. Should do the job. Cheers Mark (Been far too busy with other things and missing getting my hands dirty here with Lucene!)
          Hide
          Mark Miller added a comment -

          Figured thats all it would take. I just was feeling a bit too lazy to try and understand the whole class after I put it up in front of me for a few seconds Figured I'd try and pawn off a piece. I made some adjustments to the patch last time, but they were basically cosmetic.

          Looks like I didnt escape much work this time though - I'll review and commit shortly.

          Thanks a lot.

          Show
          Mark Miller added a comment - Figured thats all it would take. I just was feeling a bit too lazy to try and understand the whole class after I put it up in front of me for a few seconds Figured I'd try and pawn off a piece. I made some adjustments to the patch last time, but they were basically cosmetic. Looks like I didnt escape much work this time though - I'll review and commit shortly. Thanks a lot.
          Hide
          Mark Miller added a comment -

          Whoops - almost let some 1.5 slip by: throw new IllegalArgumentException(pe.getMessage(), pe) is not in 1.4.

          Last patch. I'll commit later today.

          Show
          Mark Miller added a comment - Whoops - almost let some 1.5 slip by: throw new IllegalArgumentException(pe.getMessage(), pe) is not in 1.4. Last patch. I'll commit later today.
          Hide
          Mark Harwood added a comment -

          Hi Mark,
          Mind if I try committing this patch?
          I've just switched from PC to Mac and my dev environment is all changed (Subclipse vs TortoiseSvn etc) and I wouldn't mind checking my config and commit rights still work in this new environment.
          If anyone has any mac/subclipse-related "gotchas" I should be aware of, do let me know.

          Cheers
          Mark

          Show
          Mark Harwood added a comment - Hi Mark, Mind if I try committing this patch? I've just switched from PC to Mac and my dev environment is all changed (Subclipse vs TortoiseSvn etc) and I wouldn't mind checking my config and commit rights still work in this new environment. If anyone has any mac/subclipse-related "gotchas" I should be aware of, do let me know. Cheers Mark
          Hide
          Mark Miller added a comment -

          Please, by all means !

          Show
          Mark Miller added a comment - Please, by all means !
          Hide
          Mark Harwood added a comment -
          Show
          Mark Harwood added a comment - Committed in 791579 - http://svn.apache.org/viewvc?rev=791579&view=rev
          Hide
          Adriano Crestani added a comment -

          Hi,

          I'm trying to understand what kind of syntax this query parser supports. I read the code and it does not say much. Is there any documentation (wiki, javadoc, etc) that specifies the syntax? Because it's not clear for me.

          Thanks in advance,
          Adriano Crestani Campos

          Show
          Adriano Crestani added a comment - Hi, I'm trying to understand what kind of syntax this query parser supports. I read the code and it does not say much. Is there any documentation (wiki, javadoc, etc) that specifies the syntax? Because it's not clear for me. Thanks in advance, Adriano Crestani Campos
          Hide
          Mark Miller added a comment -

          You might check the test class - it has a few basic examples. Its not much different than whats posted in the summary:

          Just experiment.

          + checkMatches("\"john smith\"", "1"); // Simple multi-term still works
          + checkMatches("\"j* smyth~\"", "1,2"); // wildcards and fuzzies are OK in
          + // phrases
          + checkMatches("\"(jo* -john) smith\"", "2"); // boolean logic works
          + checkMatches("\"jo* smith\"~2", "1,2,3"); // position logic works.
          + checkMatches("\"jo* [sma TO smZ]\" ", "1,2"); // range queries supported
          + checkMatches("\"john\"", "1,3"); // Simple single-term still works
          + checkMatches("\"(john OR johathon) smith\"", "1,2"); // boolean logic with
          + // brackets works.
          + checkMatches("\"(jo* -john) smyth~\"", "2"); // boolean logic with
          + // brackets works.
          +
          + // checkMatches("\"john -percival\"", "1"); // not logic doesn't work
          + // currently .
          +
          + checkMatches("\"john nosuchword*\"", ""); // phrases with clauses producing
          + // empty sets
          +
          + checkBadQuery("\"jo* id:1 smith\""); // mixing fields in a phrase is bad
          + checkBadQuery("\"jo* \"smith\" \""); // phrases inside phrases is bad

          Show
          Mark Miller added a comment - You might check the test class - it has a few basic examples. Its not much different than whats posted in the summary: Just experiment. + checkMatches("\"john smith\"", "1"); // Simple multi-term still works + checkMatches("\"j* smyth~\"", "1,2"); // wildcards and fuzzies are OK in + // phrases + checkMatches("\"(jo* -john) smith\"", "2"); // boolean logic works + checkMatches("\"jo* smith\"~2", "1,2,3"); // position logic works. + checkMatches("\"jo* [sma TO smZ] \" ", "1,2"); // range queries supported + checkMatches("\"john\"", "1,3"); // Simple single-term still works + checkMatches("\"(john OR johathon) smith\"", "1,2"); // boolean logic with + // brackets works. + checkMatches("\"(jo* -john) smyth~\"", "2"); // boolean logic with + // brackets works. + + // checkMatches("\"john -percival\"", "1"); // not logic doesn't work + // currently . + + checkMatches("\"john nosuchword*\"", ""); // phrases with clauses producing + // empty sets + + checkBadQuery("\"jo* id:1 smith\""); // mixing fields in a phrase is bad + checkBadQuery("\"jo* \"smith\" \""); // phrases inside phrases is bad
          Hide
          Adriano Crestani added a comment -

          Thanks for the quick response Mark!

          OK, I'm trying now to figure out what is supported reading the junits only, and I ran into some issues:

          What do you mean on the last check by phrase inside phrase, I don't see any phrase inside a phrase (I'm not sure either what it would be, because there is no open and close phrase delimiter), all I see is a phrase <"jo*">, followed by a term <smith> and an empty phrase <" ">. And the check passes because the query parser throws an exception complaning about the empty phrase, it seems to not be supported. I just changed the empty phrase to a valid phrase and the query works (failing the test case). But as I said, I'm not sure what you were exactly trying to do there, could you give me more explation about that?

          I'm also getting a java.util.ConcurrentModificationException when I type an escaped double quotes inside phrases. So, I suppose it's not supported, but shouldn't it throw a better exception?

          I also have an issue with the parse exceptions, if it comes from inside a phrase, it does not tell the correct position in the query string. I think it considers the beginning of the phrase as the beginning of the query and it only prints the phrase that contains the problem.

          I'm attaching some changes I did in the TestComplexPhraseQuery junit that shows these problems I'm getting, I think it's easier to understand if you read and run it.

          Sorry for so many questions, but I'm just trying to understand what exactly this query parser supports or not.

          Thanks,
          Adriano Crestani Campos

          Show
          Adriano Crestani added a comment - Thanks for the quick response Mark! OK, I'm trying now to figure out what is supported reading the junits only, and I ran into some issues: What do you mean on the last check by phrase inside phrase, I don't see any phrase inside a phrase (I'm not sure either what it would be, because there is no open and close phrase delimiter), all I see is a phrase <"jo*">, followed by a term <smith> and an empty phrase <" ">. And the check passes because the query parser throws an exception complaning about the empty phrase, it seems to not be supported. I just changed the empty phrase to a valid phrase and the query works (failing the test case). But as I said, I'm not sure what you were exactly trying to do there, could you give me more explation about that? I'm also getting a java.util.ConcurrentModificationException when I type an escaped double quotes inside phrases. So, I suppose it's not supported, but shouldn't it throw a better exception? I also have an issue with the parse exceptions, if it comes from inside a phrase, it does not tell the correct position in the query string. I think it considers the beginning of the phrase as the beginning of the query and it only prints the phrase that contains the problem. I'm attaching some changes I did in the TestComplexPhraseQuery junit that shows these problems I'm getting, I think it's easier to understand if you read and run it. Sorry for so many questions, but I'm just trying to understand what exactly this query parser supports or not. Thanks, Adriano Crestani Campos
          Hide
          Mark Miller added a comment -

          You may have to wait for the author, Mark Harwood to respond. I just reviewed the issue. A couple points though:

          What do you mean on the last check by phrase inside phrase, I don't see any phrase inside a phrase (I'm not sure either what it would be, because there is no open and close phrase delimiter), all I see is a phrase <"jo*">, followed by a term <smith> and an empty phrase <" ">

          Its kind of a phrase within a phrase (though the "smith" phrase could be turned into a term query) - unescaped: "jo* "smith"" - the full thing is phrase one, and smith is the inner phrase (though yes, only a term in the phrase).

          If Mark Harwood doesn't have time to answer soon, I'll dig in more and respond to your other questions/comments.

          Show
          Mark Miller added a comment - You may have to wait for the author, Mark Harwood to respond. I just reviewed the issue. A couple points though: What do you mean on the last check by phrase inside phrase, I don't see any phrase inside a phrase (I'm not sure either what it would be, because there is no open and close phrase delimiter), all I see is a phrase <"jo*">, followed by a term <smith> and an empty phrase <" "> Its kind of a phrase within a phrase (though the "smith" phrase could be turned into a term query) - unescaped: "jo* "smith"" - the full thing is phrase one, and smith is the inner phrase (though yes, only a term in the phrase). If Mark Harwood doesn't have time to answer soon, I'll dig in more and respond to your other questions/comments.
          Hide
          Michael Busch added a comment -

          Looking at the problems Adriano is seeing it almost seems like this was a bit prematurely committed? It seems like a lot of queries you could enter here are not really supported and might throw strange exceptions.

          Maybe it should live in contrib for now (with experimental warnings)?

          Show
          Michael Busch added a comment - Looking at the problems Adriano is seeing it almost seems like this was a bit prematurely committed? It seems like a lot of queries you could enter here are not really supported and might throw strange exceptions. Maybe it should live in contrib for now (with experimental warnings)?
          Hide
          Mark Miller added a comment - - edited

          I originally thought it might live in contrib as well (see above), but I'm personally fine with it being in core.

          It seems like a lot of queries you could enter here are not really supported and might throw strange exceptions.

          A lot of queries? I think Adriano is just having trouble with phrases inside phrases, which is unsupported. Other things that are not supported might throw exceptions too, but I think thats to be expected? I see what Adriano was talking about now - technically the first 2 quotes would match, and then the second two - I think Mark H was just demonstrating that you shouldn't try that query though - a user might think they are quoting smith, but for the example, it doesn't matter. I think he just trying to show that you shouldn't try and "nest" phrases - even though they wouldn't be interpreted that way anyway.

          It only supports a limited subset of the Lucene query language - perhaps we could improve the exceptions being thrown, but the exceptions the queryparser throws often leave just as much to be desired. I don't think its experimental because of that.

          Personally, I think the class does what it intends - allows a limited subset of the Lucene query language in phrases. Though of course it could be improved.

          I'll let Mark H respond though. I also don't mind seeing it moved to contrib, but I'm not sure anything glaring points to it being moved at the moment. It lives up to its limited contract I think.

          Show
          Mark Miller added a comment - - edited I originally thought it might live in contrib as well (see above), but I'm personally fine with it being in core. It seems like a lot of queries you could enter here are not really supported and might throw strange exceptions. A lot of queries? I think Adriano is just having trouble with phrases inside phrases, which is unsupported. Other things that are not supported might throw exceptions too, but I think thats to be expected? I see what Adriano was talking about now - technically the first 2 quotes would match, and then the second two - I think Mark H was just demonstrating that you shouldn't try that query though - a user might think they are quoting smith, but for the example, it doesn't matter. I think he just trying to show that you shouldn't try and "nest" phrases - even though they wouldn't be interpreted that way anyway. It only supports a limited subset of the Lucene query language - perhaps we could improve the exceptions being thrown, but the exceptions the queryparser throws often leave just as much to be desired. I don't think its experimental because of that. Personally, I think the class does what it intends - allows a limited subset of the Lucene query language in phrases. Though of course it could be improved. I'll let Mark H respond though. I also don't mind seeing it moved to contrib, but I'm not sure anything glaring points to it being moved at the moment. It lives up to its limited contract I think.
          Hide
          Adriano Crestani added a comment -

          I see what Adriano was talking about now - technically the first 2 quotes would match, and then the second two - I think Mark H was just demonstrating that you shouldn't try that query though - a user might think they are quoting smith, but for the example, it doesn't matter. I think he just trying to show that you shouldn't try and "nest" phrases - even though they wouldn't be interpreted that way anyway.

          Well, if you guessed his intention correctly, the comment is misleading: "phrases inside phrases is bad". But lets wait for his response.

          Other things that are not supported might throw exceptions too

          I think a user would expect a ParseException. Probably, every query parser user catches ParserException and show a nice message to its final user. Now, if the query parser starts throwing random exception to say the syntax is invalid, every software that uses Lucene query parser is gonna start crashing. For me it's like if a compiler started throwing segmentation fault every time you forget a } in the code.

          Show
          Adriano Crestani added a comment - I see what Adriano was talking about now - technically the first 2 quotes would match, and then the second two - I think Mark H was just demonstrating that you shouldn't try that query though - a user might think they are quoting smith, but for the example, it doesn't matter. I think he just trying to show that you shouldn't try and "nest" phrases - even though they wouldn't be interpreted that way anyway. Well, if you guessed his intention correctly, the comment is misleading: "phrases inside phrases is bad". But lets wait for his response. Other things that are not supported might throw exceptions too I think a user would expect a ParseException. Probably, every query parser user catches ParserException and show a nice message to its final user. Now, if the query parser starts throwing random exception to say the syntax is invalid, every software that uses Lucene query parser is gonna start crashing. For me it's like if a compiler started throwing segmentation fault every time you forget a } in the code.
          Hide
          Mark Miller added a comment -

          I think a user would expect a ParseException. Probably, every query parser user catches ParserException and show a nice message to its final user. Now, if the query parser starts throwing random exception to say the syntax is invalid, every software that uses Lucene query parser is gonna start crashing. For me it's like if a compiler started throwing segmentation fault every time you forget a } in the code.

          That's a fair point - addressable though - we can likely catch and rethrow in the worst case.

          I'll admit, the ... non exactness ... of this parser troubled me at first - one of the reasons I liked contrib as a landing spot early on. I took it for what it is in the end I suppose. I think the shortfalls brought up so far can be addressed to a large degree though.

          Show
          Mark Miller added a comment - I think a user would expect a ParseException. Probably, every query parser user catches ParserException and show a nice message to its final user. Now, if the query parser starts throwing random exception to say the syntax is invalid, every software that uses Lucene query parser is gonna start crashing. For me it's like if a compiler started throwing segmentation fault every time you forget a } in the code. That's a fair point - addressable though - we can likely catch and rethrow in the worst case. I'll admit, the ... non exactness ... of this parser troubled me at first - one of the reasons I liked contrib as a landing spot early on. I took it for what it is in the end I suppose. I think the shortfalls brought up so far can be addressed to a large degree though.
          Hide
          Mark Miller added a comment -

          Well, if you guessed his intention correctly, the comment is misleading: "phrases inside phrases is bad". But lets wait for his response.

          I think thats a bit of judgement call. We know that the way the query is parsed, you cannot really ever do "phrases inside phrases". However, a user of this parser might think, that like the other syntax, perhaps you can use "phrases inside phrases" - and if you thought that, the example given is likely how you'd imagine it to work. The outside phrase, and then the inside phrase. I certainly agree some comments would clear it up, but I think its a useful example.

          Show
          Mark Miller added a comment - Well, if you guessed his intention correctly, the comment is misleading: "phrases inside phrases is bad". But lets wait for his response. I think thats a bit of judgement call. We know that the way the query is parsed, you cannot really ever do "phrases inside phrases". However, a user of this parser might think, that like the other syntax, perhaps you can use "phrases inside phrases" - and if you thought that, the example given is likely how you'd imagine it to work. The outside phrase, and then the inside phrase. I certainly agree some comments would clear it up, but I think its a useful example.
          Hide
          Adriano Crestani added a comment -

          I'll admit, the ... non exactness ... of this parser troubled me at first - one of the reasons I liked contrib as a landing spot early on. I took it for what it is in the end I suppose. I think the shortfalls brought up so far can be addressed to a large degree though.

          I think contrib would be a good place for now, until it gets more stable and better documented.

          Show
          Adriano Crestani added a comment - I'll admit, the ... non exactness ... of this parser troubled me at first - one of the reasons I liked contrib as a landing spot early on. I took it for what it is in the end I suppose. I think the shortfalls brought up so far can be addressed to a large degree though. I think contrib would be a good place for now, until it gets more stable and better documented.
          Hide
          Mark Miller added a comment -

          I think contrib would be a good place for now, until it gets more stable and better documented.

          If Mark H thinks it should be moved, I won't disagree. But I still don't see a convincing reason. It could use some more documentation, but so could quite a few other classes in core. Its something of a subjective call, and more importantly, it can be addressed now.

          I'm not yet convinced its unstable - the only major issue I see so far is the exception issue - but that wouldn't seem to prompt a move to contrib, but an update to address the concern. Moving to contrib is always an option, but I don't think its the default move based on whats been brought up. The standard move would be to address whatever issues are brought up ... so far I am just seeing the exception issue as a large one, and I think that is fairly easily addressable.

          Show
          Mark Miller added a comment - I think contrib would be a good place for now, until it gets more stable and better documented. If Mark H thinks it should be moved, I won't disagree. But I still don't see a convincing reason. It could use some more documentation, but so could quite a few other classes in core. Its something of a subjective call, and more importantly, it can be addressed now. I'm not yet convinced its unstable - the only major issue I see so far is the exception issue - but that wouldn't seem to prompt a move to contrib, but an update to address the concern. Moving to contrib is always an option, but I don't think its the default move based on whats been brought up. The standard move would be to address whatever issues are brought up ... so far I am just seeing the exception issue as a large one, and I think that is fairly easily addressable.
          Hide
          Michael Busch added a comment -

          It only supports a limited subset of the Lucene query language - perhaps we could improve the exceptions being thrown, but the exceptions the queryparser throws often leave just as much to be desired. I don't think its experimental because of that.

          Because it only supports a limited subset of the language, I feel like we could have taken a different approach here? Why not add the features that are supported and make sense to the main query parser?

          The documentation does not tell me what is supported and what is not currently. And looking through the code some methods now throw RuntimeExceptions, because the overridden methods themselves don't throw anything. These things feel a bit unfinished.

          I'm not saying these issues are not fixable. But maybe we should rethink the design. My biggest concern is that this new parser doesn't seem to have a well-defined syntax. So since it doesn't check if a query is actually valid or not, it might be hard to maintain. E.g. if you add new language features to the main QP, it's currently not defined what will happen if you use them with this one.

          That's why I'm proposing to move it to contrib and mark it as experimental. Then we have more time to decide if the approach of adding the new features to the main QP makes more sense.

          Show
          Michael Busch added a comment - It only supports a limited subset of the Lucene query language - perhaps we could improve the exceptions being thrown, but the exceptions the queryparser throws often leave just as much to be desired. I don't think its experimental because of that. Because it only supports a limited subset of the language, I feel like we could have taken a different approach here? Why not add the features that are supported and make sense to the main query parser? The documentation does not tell me what is supported and what is not currently. And looking through the code some methods now throw RuntimeExceptions, because the overridden methods themselves don't throw anything. These things feel a bit unfinished. I'm not saying these issues are not fixable. But maybe we should rethink the design. My biggest concern is that this new parser doesn't seem to have a well-defined syntax. So since it doesn't check if a query is actually valid or not, it might be hard to maintain. E.g. if you add new language features to the main QP, it's currently not defined what will happen if you use them with this one. That's why I'm proposing to move it to contrib and mark it as experimental. Then we have more time to decide if the approach of adding the new features to the main QP makes more sense.
          Hide
          Luis Alves added a comment - - edited

          I share same opinion as Michael,
          the implementation has a lot of undefined/undocumented behaviors,
          simple because it reuses the queryparser to parse the text inside a phrase.
          All the lucene syntax needs to be accounted on this design, but it does not seem to be the case.

          Problems like Adriano described, phrase inside a phrase, position reporting for errors.

          I also have a lot of concerns about having the full lucene syntax inside phrases
          and trying to restrict this by throwing exceptions for particular cases does not seem the best design.

          Here is a example of with OR, AND, PARENTESIS with a proximity search
          "(( jakarta OR green) AND (blue AND orange) AND black~0.5) apache"~10

          What should a user expect from this query, without looking at the code. I'm not sure.
          Does it even make sense to support this complex syntax? In my opinion. no

          I think we should define what is the subset of the language we want to support inside the phrases with a well defined behavior.
          If Mark describes all the syntax he wants to support inside phrases, I actually don't mind to implement a new parser.for this.

          My view is, contrib is probably a better place to have this code, until we figure out a implementation that does not impose as many restrictions on changes to the original queryparser and describes a well defined syntax to be applied inside phrases.

          Show
          Luis Alves added a comment - - edited I share same opinion as Michael, the implementation has a lot of undefined/undocumented behaviors, simple because it reuses the queryparser to parse the text inside a phrase. All the lucene syntax needs to be accounted on this design, but it does not seem to be the case. Problems like Adriano described, phrase inside a phrase, position reporting for errors. I also have a lot of concerns about having the full lucene syntax inside phrases and trying to restrict this by throwing exceptions for particular cases does not seem the best design. Here is a example of with OR, AND, PARENTESIS with a proximity search "(( jakarta OR green) AND (blue AND orange) AND black~0.5) apache"~10 What should a user expect from this query, without looking at the code. I'm not sure. Does it even make sense to support this complex syntax? In my opinion. no I think we should define what is the subset of the language we want to support inside the phrases with a well defined behavior. If Mark describes all the syntax he wants to support inside phrases, I actually don't mind to implement a new parser.for this. My view is, contrib is probably a better place to have this code, until we figure out a implementation that does not impose as many restrictions on changes to the original queryparser and describes a well defined syntax to be applied inside phrases.
          Hide
          Luis Alves added a comment - - edited

          I added 2 testcases that return doc 3.
          These queries do not make much sense,
          I added it just to prove the point that we need more information
          describing the use case for complex phrase qp.
          We also should define a subset of the supported syntax we want to support inside phrases,
          with well defined behaviors.

          checkMatches("\"(goos~0.5 AND (mike OR smith) AND NOT ( percival AND john) ) vacation\"~3","3"); // proximity with fuzzy, OR, AND, NOT
          checkMatches("\"(goos~0.5 AND (mike OR smith) AND ( percival AND john) ) vacation\"~3","3"); // proximity with fuzzy, OR, AND

          Show
          Luis Alves added a comment - - edited I added 2 testcases that return doc 3. These queries do not make much sense, I added it just to prove the point that we need more information describing the use case for complex phrase qp. We also should define a subset of the supported syntax we want to support inside phrases, with well defined behaviors. checkMatches("\"(goos~0.5 AND (mike OR smith) AND NOT ( percival AND john) ) vacation\"~3","3"); // proximity with fuzzy, OR, AND, NOT checkMatches("\"(goos~0.5 AND (mike OR smith) AND ( percival AND john) ) vacation\"~3","3"); // proximity with fuzzy, OR, AND
          Hide
          Mark Harwood added a comment -

          I'll try and catch up with some of the issues raised here:

          What do you mean on the last check by phrase inside phrase, I don't see any phrase inside a phrase

          Correct, the "inner phrase" example was a term not a phrase. This is perhaps a better example:

          checkBadQuery("\"jo* \"percival smith\" \""); //phrases inside phrases is bad

          I'm trying now to figure out what is supported

          The Junit is currently the main form of documentation - unlike the XMLQueryParser (which has a DTD) there is no syntax to formally capture the logic.
          Here is a basic summary of the syntax supported and how it differs from normal non-phrase use of the same operators:

          • Wildcard/fuzzy/range clauses can be used to define a phrase element (as opposed to simply single terms)
          • Brackets are used to group/define the acceptable variations for a given phrase element e.g. "(john OR jonathon) smith"
          • "AND" is irrelevant - there is effectively an implied "AND_NEXT_TO" binding all phrase elements

          To move this forward I would suggest we consider following one of these options:

          1) Keep in core and improve error reporting and documentation
          2) Move into "contrib" as experimental
          3) Retain in core but simplify it to support only the simplest syntax (as in my Britney~ example)
          4) Re-engineer the QueryParser.jj to support a formally defined syntax for acceptable "within phrase" operators e.g. *, ~, ( )

          I think 1) is achievable if we carefully define where the existing parser breaks (e.g. ANDs and nested brackets)
          2) is unnecessary if we can achieve 1).
          3) would be a shame if we lost useful features for some very convoluted edge cases
          4) is beyond my JavaCC skills.

          Show
          Mark Harwood added a comment - I'll try and catch up with some of the issues raised here: What do you mean on the last check by phrase inside phrase, I don't see any phrase inside a phrase Correct, the "inner phrase" example was a term not a phrase. This is perhaps a better example: checkBadQuery("\"jo* \"percival smith\" \""); //phrases inside phrases is bad I'm trying now to figure out what is supported The Junit is currently the main form of documentation - unlike the XMLQueryParser (which has a DTD) there is no syntax to formally capture the logic. Here is a basic summary of the syntax supported and how it differs from normal non-phrase use of the same operators: Wildcard/fuzzy/range clauses can be used to define a phrase element (as opposed to simply single terms) Brackets are used to group/define the acceptable variations for a given phrase element e.g. "(john OR jonathon) smith" "AND" is irrelevant - there is effectively an implied "AND_NEXT_TO" binding all phrase elements To move this forward I would suggest we consider following one of these options: 1) Keep in core and improve error reporting and documentation 2) Move into "contrib" as experimental 3) Retain in core but simplify it to support only the simplest syntax (as in my Britney~ example) 4) Re-engineer the QueryParser.jj to support a formally defined syntax for acceptable "within phrase" operators e.g. *, ~, ( ) I think 1) is achievable if we carefully define where the existing parser breaks (e.g. ANDs and nested brackets) 2) is unnecessary if we can achieve 1). 3) would be a shame if we lost useful features for some very convoluted edge cases 4) is beyond my JavaCC skills.
          Hide
          Mark Miller added a comment -

          My first thought is, if we can address some of the issues brought up, there is no reason to keep this out of core IMHO.

          My second thought is, I have a feeling a lot of this concern stems from the fact that these guys (or one of them) has to duplicate this thing with the QueryParser code in contrib. That could be reason enough to move it to contrib. But it doesn't solve the issue longer term when the old QueryParser is removed. It would need to be replaced then, or dropped from contrib.

          With the new info from Mark H, how hard would it be to create a new imp for the new parser that did a lot of this, in a more defined way? It seems you basically just want to be able to use multiterm queries and group/or things, right? We could even relax a little if we have to. This hasn't been released, so there is still a lot of wiggle room I think. But there does have to be a resolution with this and the new parser at some point either way.

          Show
          Mark Miller added a comment - My first thought is, if we can address some of the issues brought up, there is no reason to keep this out of core IMHO. My second thought is, I have a feeling a lot of this concern stems from the fact that these guys (or one of them) has to duplicate this thing with the QueryParser code in contrib. That could be reason enough to move it to contrib. But it doesn't solve the issue longer term when the old QueryParser is removed. It would need to be replaced then, or dropped from contrib. With the new info from Mark H, how hard would it be to create a new imp for the new parser that did a lot of this, in a more defined way? It seems you basically just want to be able to use multiterm queries and group/or things, right? We could even relax a little if we have to. This hasn't been released, so there is still a lot of wiggle room I think. But there does have to be a resolution with this and the new parser at some point either way.
          Hide
          Adriano Crestani added a comment -

          Hi Mark H.,

          Thanks for the response, some comments inline:

          Correct, the "inner phrase" example was a term not a phrase. This is perhaps a better example:

          checkBadQuery("\"jo* \"percival smith\" \""); //phrases inside phrases is bad

          I think you did not get what I meant, even with your new example, there is no inner phrase, it is: a phrase <"jo* ">, followed by a term <percival>, followed by another term <smith>, and an empty phrase <" ">. So, with your change, the junit passes, but for the wrong reason. It gets an exception complaining about the empty phrase and not because there is an inner phrase (I still don't see how you can type an inner phrase with the current syntax). I think it's not a big deal, but I'm just trying to understand and raise a probable wrong test. I expect you understood what I mean, let me know if I did not make it clear.

          The Junit is currently the main form of documentation

          But not the ideal, because the source code (junit code) is not released in the binary release. So, the ideal place should be in the javadocs.

          • Wildcard/fuzzy/range clauses can be used to define a phrase element (as opposed to simply single terms)
          • Brackets are used to group/define the acceptable variations for a given phrase element e.g. "(john OR jonathon) smith"
          • "AND" is irrelevant - there is effectively an implied "AND_NEXT_TO" binding all phrase elements

          Thanks, now it's clearer for me what is supported or not. I have some questions:

          I understand this AND_NEXT_TO implicit operator between the queries inside the phrase. However, what happens if the user do not type any explicit boolean operator between two terms inside parentheses: "(query parser) lucene". Is the operator between 'query' and 'parser' the implicit AND_NEXT_TO or the default boolean operator (usually OR)?

          What happens if I type "(query AND parser) lucene". In my point of view it is: "(query AND parser) AND_NEXT_TO lucene". Which means for me: find any document that contains the term 'query' and the term 'parser' in the position x, and the term 'lucene' in the position x+1. Is this the expected behaviour?

          1) Keep in core and improve error reporting and documentation
          2) Move into "contrib" as experimental
          3) Retain in core but simplify it to support only the simplest syntax (as in my Britney~ example)
          4) Re-engineer the QueryParser.jj to support a formally defined syntax for acceptable "within phrase" operators e.g. *, ~, ( )

          1 is good, but I would prefer 4 too. Documentation and throw the right exception are necessary. I just don't feel confortable on the complex phrase query parser relying on the main query parser syntax, any change on the main one could easialy brake the complex phrase QP. Anyway, 4 may be done in future

          Mark M.:

          With the new info from Mark H, how hard would it be to create a new imp for the new parser that did a lot of this, in a more defined way? It seems you basically just want to be able to use multiterm queries and group/or things, right? We could even relax a little if we have to. This hasn't been released, so there is still a lot of wiggle room I think. But there does have to be a resolution with this and the new parser at some point either way.

          Yes, I am working on the new query parser code. I started recently to read and understand how the ComplexPhraseQP works, so I could reproduce the behaviour using the new QP framework. I first tried to look at this QP as a user and could not figure out what exactly I can or not do with it. I think now we are hitting a big problem, which is related to documentation. That is why I started raising these question, because others could also have the same issues in future.

          So, yes, I can start coding some equivalent QP using the new QP framework, I'm just questioning and trying to understand everything before I start any coding. I don't wanna code anything that wil throw ConcurrentModificationExceptions, that's why I'm raising these issues now, before I start moving it to the new QP.

          Best Regards,
          Adriano Crestani Campos

          Show
          Adriano Crestani added a comment - Hi Mark H., Thanks for the response, some comments inline: Correct, the "inner phrase" example was a term not a phrase. This is perhaps a better example: checkBadQuery("\"jo* \"percival smith\" \""); //phrases inside phrases is bad I think you did not get what I meant, even with your new example, there is no inner phrase, it is: a phrase <"jo* ">, followed by a term <percival>, followed by another term <smith>, and an empty phrase <" ">. So, with your change, the junit passes, but for the wrong reason. It gets an exception complaining about the empty phrase and not because there is an inner phrase (I still don't see how you can type an inner phrase with the current syntax). I think it's not a big deal, but I'm just trying to understand and raise a probable wrong test. I expect you understood what I mean, let me know if I did not make it clear. The Junit is currently the main form of documentation But not the ideal, because the source code (junit code) is not released in the binary release. So, the ideal place should be in the javadocs. Wildcard/fuzzy/range clauses can be used to define a phrase element (as opposed to simply single terms) Brackets are used to group/define the acceptable variations for a given phrase element e.g. "(john OR jonathon) smith" "AND" is irrelevant - there is effectively an implied "AND_NEXT_TO" binding all phrase elements Thanks, now it's clearer for me what is supported or not. I have some questions: I understand this AND_NEXT_TO implicit operator between the queries inside the phrase. However, what happens if the user do not type any explicit boolean operator between two terms inside parentheses: "(query parser) lucene". Is the operator between 'query' and 'parser' the implicit AND_NEXT_TO or the default boolean operator (usually OR)? What happens if I type "(query AND parser) lucene". In my point of view it is: "(query AND parser) AND_NEXT_TO lucene". Which means for me: find any document that contains the term 'query' and the term 'parser' in the position x, and the term 'lucene' in the position x+1. Is this the expected behaviour? 1) Keep in core and improve error reporting and documentation 2) Move into "contrib" as experimental 3) Retain in core but simplify it to support only the simplest syntax (as in my Britney~ example) 4) Re-engineer the QueryParser.jj to support a formally defined syntax for acceptable "within phrase" operators e.g. *, ~, ( ) 1 is good, but I would prefer 4 too. Documentation and throw the right exception are necessary. I just don't feel confortable on the complex phrase query parser relying on the main query parser syntax, any change on the main one could easialy brake the complex phrase QP. Anyway, 4 may be done in future Mark M.: With the new info from Mark H, how hard would it be to create a new imp for the new parser that did a lot of this, in a more defined way? It seems you basically just want to be able to use multiterm queries and group/or things, right? We could even relax a little if we have to. This hasn't been released, so there is still a lot of wiggle room I think. But there does have to be a resolution with this and the new parser at some point either way. Yes, I am working on the new query parser code. I started recently to read and understand how the ComplexPhraseQP works, so I could reproduce the behaviour using the new QP framework. I first tried to look at this QP as a user and could not figure out what exactly I can or not do with it. I think now we are hitting a big problem, which is related to documentation. That is why I started raising these question, because others could also have the same issues in future. So, yes, I can start coding some equivalent QP using the new QP framework, I'm just questioning and trying to understand everything before I start any coding. I don't wanna code anything that wil throw ConcurrentModificationExceptions, that's why I'm raising these issues now, before I start moving it to the new QP. Best Regards, Adriano Crestani Campos
          Hide
          Michael Busch added a comment -

          I think the best thing to do here is do exactly define what syntax is supposed to be supported (which Mark H. did in his latest comment), and then implement the new syntax with the new queryparser. It will enforce correct syntax and give meaningful exceptions if a query was entered that is not supported.

          I think we can still reuse big portions of Mark's patch: we should be able to write a new QueryBuilder that produces the new ComplexPhraseQuery.

          Adriano/Luis: how long would it take to implement? Can we contain it for 2.9?

          This would mean that these new features would go into contrib in 2.9 as part of the new query parser framework, and then be moved to core in 3.0. Also from 3.0 these new features would then be part of Lucene's main query syntax. Would this makes sense?

          Show
          Michael Busch added a comment - I think the best thing to do here is do exactly define what syntax is supposed to be supported (which Mark H. did in his latest comment), and then implement the new syntax with the new queryparser. It will enforce correct syntax and give meaningful exceptions if a query was entered that is not supported. I think we can still reuse big portions of Mark's patch: we should be able to write a new QueryBuilder that produces the new ComplexPhraseQuery. Adriano/Luis: how long would it take to implement? Can we contain it for 2.9? This would mean that these new features would go into contrib in 2.9 as part of the new query parser framework, and then be moved to core in 3.0. Also from 3.0 these new features would then be part of Lucene's main query syntax. Would this makes sense?
          Hide
          Michael Busch added a comment -

          Reopening this issues; we haven't made a final decision on how we want to go forward yet, but in any case there's remaining work here.

          Show
          Michael Busch added a comment - Reopening this issues; we haven't made a final decision on how we want to go forward yet, but in any case there's remaining work here.
          Hide
          Luis Alves added a comment -

          Hi Mark H

          I would like to propose 5,
          5) Re-engineer the QueryParser.jj to support a formally defined syntax for acceptable "within phrase" operators e.g. *, ~, ( )
          I propose doing this using using the new QP implementation. (I can write the new javacc QP for this)
          (this implies that the code will be in contrib in 2.9 and be part of core on 3.0)

          I also want to propose to change the complexphrase to use single quotes,
          this way we can have both implementation for phrases.

          Here is a summary:

          • the complexqueryparser would support all Lucene syntax even for phrases
          • and we could add singlequoted text to identify complexphrases
            1) Wildcard/fuzzy/range clauses can be used to define a phrase element (as opposed to simply single terms)
            2) Brackets are used to group/define the acceptable variations for a given phrase element e.g. "(john OR jonathon) smith"
            3) supported operators: OR, *, ~, ( ), ?
            4) disallow fields, proximity, boosting and operators on single quoted phrases (I'm making an assumption here, Mark H please comment)
            5) singlequotes need to be escaped, double quotes will be treated as regular punctuation characters inside single quoted strings

          Mark H, can you please elaborate more on the these other operators "+" "-" "^" "AND" "&&" "||" "NOT" "!" ":" "[" "]" "

          {" "}

          ".

          Example:
          A query with single quoted (complexphrase) followed by a term and a normal phrase:

          query: '(john OR jonathon) smith~0.3 order*' order:sell "stock market"

          Show
          Luis Alves added a comment - Hi Mark H I would like to propose 5, 5) Re-engineer the QueryParser.jj to support a formally defined syntax for acceptable "within phrase" operators e.g. *, ~, ( ) I propose doing this using using the new QP implementation. (I can write the new javacc QP for this) (this implies that the code will be in contrib in 2.9 and be part of core on 3.0) I also want to propose to change the complexphrase to use single quotes, this way we can have both implementation for phrases. Here is a summary: the complexqueryparser would support all Lucene syntax even for phrases and we could add singlequoted text to identify complexphrases 1) Wildcard/fuzzy/range clauses can be used to define a phrase element (as opposed to simply single terms) 2) Brackets are used to group/define the acceptable variations for a given phrase element e.g. "(john OR jonathon) smith" 3) supported operators: OR, *, ~, ( ), ? 4) disallow fields, proximity, boosting and operators on single quoted phrases (I'm making an assumption here, Mark H please comment) 5) singlequotes need to be escaped, double quotes will be treated as regular punctuation characters inside single quoted strings Mark H, can you please elaborate more on the these other operators "+" "-" "^" "AND" "&&" "||" "NOT" "!" ":" " [" "] " " {" "} ". Example: A query with single quoted (complexphrase) followed by a term and a normal phrase: query: '(john OR jonathon) smith~0.3 order*' order:sell "stock market"
          Hide
          Luis Alves added a comment - - edited

          Mark H -

          Question 1)

          I added a doc 5 and 6

          TestComplexPhraseQuery.java
          ...
            DocData docsContent[] = { new DocData("john smith", "1"),
                new DocData("johathon smith", "2"),      
                new DocData("john percival smith goes on  a b c vacation", "3"),
                new DocData("jackson waits tom", "4"),
                new DocData("johathon smith john", "5"),
                new DocData("johathon mary gomes smith", "6"),
                };
          ...
          

          for test
          checkMatches("\"(jo* -john) smyth\"", "2"); // boolean logic with

          would document 5 be returned or just doc 2 should be returned,
          I'm assuming position is always important and doc 5 is supposed to be returned.
          Is this the correct behavior?

          Question 2)
          Should these 2 queries behave the same when we fix the problem
          // checkMatches("\"john -percival\"", "1"); // not logic doesn't work
          // checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

          Question 3)
          for query:
          checkMatches("\"jo* smith\"~2", "1,2,3,5"); // position logic works.
          doc 6 is also returned, so this feature does not seem to be working.

          Question 4)
          The usage of AND and AND_NEXT_TO is confusing to me
          the query
          checkMatches("\"(jo* AND mary) smith\"", "1,2,5"); // boolean logic with

          returns 1,2,5 and not 6, but I was only expecting 6 to be returned,
          seems that like the AND is converted into a OR.
          What is the behavior you want to implement?

          Show
          Luis Alves added a comment - - edited Mark H - Question 1) I added a doc 5 and 6 TestComplexPhraseQuery.java ... DocData docsContent[] = { new DocData( "john smith" , "1" ), new DocData( "johathon smith" , "2" ), new DocData( "john percival smith goes on a b c vacation" , "3" ), new DocData( "jackson waits tom" , "4" ), new DocData( "johathon smith john" , "5" ), new DocData( "johathon mary gomes smith" , "6" ), }; ... for test checkMatches("\"(jo* -john) smyth\"", "2"); // boolean logic with would document 5 be returned or just doc 2 should be returned, I'm assuming position is always important and doc 5 is supposed to be returned. Is this the correct behavior? Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches("\"john -percival\"", "1"); // not logic doesn't work // checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work Question 3) for query: checkMatches("\"jo* smith\"~2", "1,2,3,5"); // position logic works. doc 6 is also returned, so this feature does not seem to be working. Question 4) The usage of AND and AND_NEXT_TO is confusing to me the query checkMatches("\"(jo* AND mary) smith\"", "1,2,5"); // boolean logic with returns 1,2,5 and not 6, but I was only expecting 6 to be returned, seems that like the AND is converted into a OR. What is the behavior you want to implement?
          Hide
          Luis Alves added a comment -

          Sorry for all the emails,
          I'm still new to JIRA and only now I realized that for every edit I do,a email is sent.

          But now that I found the preview button, it won't happen again.

          Show
          Luis Alves added a comment - Sorry for all the emails, I'm still new to JIRA and only now I realized that for every edit I do,a email is sent. But now that I found the preview button, it won't happen again.
          Hide
          Mark Harwood added a comment -

          I think it's not a big deal, but I'm just trying to understand and raise a probable wrong test.

          Granted, the test fails for a reason other than the one for which I wanted it to fail.
          We can probably strike the test and leave a note saying phrase-within-a-phrase just does not make sense and is not supported.

          Is the operator between 'query' and 'parser' the implicit AND_NEXT_TO or the default boolean operator (usually OR)?

          In brackets it's an OR - the brackets are used to suggest that the current phrase element at position X is composed of some choices that are evaluated as a subclause in the same way that in normal query logic sub-clauses are defined in brackets e.g. +a +(b OR c). There seems to be a reasonable logic to this.

          Ideally the ComplexPhraseQueryParser should explicitly turn this setting on while evaluating the bracketed innards of phrases just in case the base class has AND as the default.

          Mark H, can you please elaborate more on the these other operators "+" "-" "^" "AND" "&&" "||" "NOT" "!" ":" "[" "]" "{" "}".

          OK I'll try and deal with them one by one but these are not necessarily definitive answers or guarantees of correctly implemented support

          OR,||,+, AND, && ..... ignored. The implicit operator is AND_NEXT_TO apart from in bracketed sections where all elements at this level are ORed
          ^ .....boosts are carried through from TermQuerys to SpanTermQuerys
          NOT, ! ....Creates SpanNotQueries
          []{} ....range queries are supported as are wildcards *, fuzzies ~, ?

          query: '(john OR jonathon) smith~0.3 order*' order:sell "stock market"

          I'll post the XML query syntax equivalent of what should be parsed here shortly (just seen your next comment come in)

          Show
          Mark Harwood added a comment - I think it's not a big deal, but I'm just trying to understand and raise a probable wrong test. Granted, the test fails for a reason other than the one for which I wanted it to fail. We can probably strike the test and leave a note saying phrase-within-a-phrase just does not make sense and is not supported. Is the operator between 'query' and 'parser' the implicit AND_NEXT_TO or the default boolean operator (usually OR)? In brackets it's an OR - the brackets are used to suggest that the current phrase element at position X is composed of some choices that are evaluated as a subclause in the same way that in normal query logic sub-clauses are defined in brackets e.g. +a +(b OR c). There seems to be a reasonable logic to this. Ideally the ComplexPhraseQueryParser should explicitly turn this setting on while evaluating the bracketed innards of phrases just in case the base class has AND as the default. Mark H, can you please elaborate more on the these other operators "+" "-" "^" "AND" "&&" "||" "NOT" "!" ":" " [" "] " "{" "}". OK I'll try and deal with them one by one but these are not necessarily definitive answers or guarantees of correctly implemented support OR,||,+, AND, && ..... ignored. The implicit operator is AND_NEXT_TO apart from in bracketed sections where all elements at this level are ORed ^ .....boosts are carried through from TermQuerys to SpanTermQuerys NOT, ! ....Creates SpanNotQueries []{} ....range queries are supported as are wildcards *, fuzzies ~, ? query: '(john OR jonathon) smith~0.3 order*' order:sell "stock market" I'll post the XML query syntax equivalent of what should be parsed here shortly (just seen your next comment come in)
          Hide
          Mark Harwood added a comment -

          for test checkMatches("\"(jo* -john) smyth\"", "2");
          would document 5 be returned or just doc 2 should be returned,

          I presume you mean smith not smyth here otherwise nothing would match? If so, doc 5 should match and position is relevant (subject to slop factors).

          Question 2)
          Should these 2 queries behave the same when we fix the problem
          // checkMatches("\"john -percival\"", "1"); // not logic doesn't work
          // checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

          I suppose there's an open question as to if the second example is legal (the brackets are unnecessary)

          Question 3)
          checkMatches("\"jo* smith\"~2", "1,2,3,5"); // position logic works.
          doc 6 is also returned, so this feature does not seem to be working.

          That looks like a bug related to slop factor?

          Question 4)
          The usage of AND and AND_NEXT_TO is confusing to me
          the query
          checkMatches("\"(jo* AND mary) smith\"", "1,2,5"); // boolean logic with

          ANDs are ignored and turned into ORs (see earlier comments) but maybe a query parse error should be thrown to emphasise this.

          Show
          Mark Harwood added a comment - for test checkMatches("\"(jo* -john) smyth\"", "2"); would document 5 be returned or just doc 2 should be returned, I presume you mean smith not smyth here otherwise nothing would match? If so, doc 5 should match and position is relevant (subject to slop factors). Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches("\"john -percival\"", "1"); // not logic doesn't work // checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work I suppose there's an open question as to if the second example is legal (the brackets are unnecessary) Question 3) checkMatches("\"jo* smith\"~2", "1,2,3,5"); // position logic works. doc 6 is also returned, so this feature does not seem to be working. That looks like a bug related to slop factor? Question 4) The usage of AND and AND_NEXT_TO is confusing to me the query checkMatches("\"(jo* AND mary) smith\"", "1,2,5"); // boolean logic with ANDs are ignored and turned into ORs (see earlier comments) but maybe a query parse error should be thrown to emphasise this.
          Hide
          Mark Harwood added a comment -

          query: '(john OR jonathon) smith~0.3 order*' order:sell "stock market"

          Would be parsed as follows (shown as equivalent XMLQueryParser syntax)

           
          <BooleanQuery>
            <Clause occurs="should">
               <SpanNear >		
          			<SpanOr>
          				<SpanOrTerms>john jonathon </SpanOrTerms>
          			</SpanOr>
          			<SpanOr>
          				<SpanOrTerms> smith smyth</SpanOrTerms>
          			</SpanOr>
          			<SpanOr>
          				<SpanOrTerms> order orders</SpanOrTerms>
          			</SpanOr>
             </SpanNear>
           </Clause>
          <Clause occurs="should">
               <TermQuery fieldName="order" >sell</TermQuery>		
           </Clause>
          <Clause occurs="should">
               <UserQuery>"stock market"</UserQuery >		
           </Clause>
          </BooleanQuery> 
          
          Show
          Mark Harwood added a comment - query: '(john OR jonathon) smith~0.3 order*' order:sell "stock market" Would be parsed as follows (shown as equivalent XMLQueryParser syntax) <BooleanQuery> <Clause occurs= "should" > <SpanNear > <SpanOr> <SpanOrTerms> john jonathon </SpanOrTerms> </SpanOr> <SpanOr> <SpanOrTerms> smith smyth </SpanOrTerms> </SpanOr> <SpanOr> <SpanOrTerms> order orders </SpanOrTerms> </SpanOr> </SpanNear> </Clause> <Clause occurs= "should" > <TermQuery fieldName= "order" > sell </TermQuery> </Clause> <Clause occurs= "should" > <UserQuery> "stock market" </UserQuery > </Clause> </BooleanQuery>
          Hide
          Adriano Crestani added a comment -

          I propose doing this using using the new QP implementation. (I can write the new javacc QP for this)
          (this implies that the code will be in contrib in 2.9 and be part of core on 3.0)

          That would be good!

          Granted, the test fails for a reason other than the one for which I wanted it to fail.
          We can probably strike the test and leave a note saying phrase-within-a-phrase just does not make sense and is not supported.

          Cool, I agree to remove it. But I still don't see how an user can type a phrase inside a phrase with the current syntax definition, can you give me an example?

          In brackets it's an OR - the brackets are used to suggest that the current phrase element at position X is composed of some choices that are evaluated as a subclause in the same way that in normal query logic sub-clauses are defined in brackets e.g. +a +(b OR c). There seems to be a reasonable logic to this.

          Ideally the ComplexPhraseQueryParser should explicitly turn this setting on while evaluating the bracketed innards of phrases just in case the base class has AND as the default.

          If we use the implemented java cc code Luis suggested, we would have already a query parser that throws ParseExceptions whenever the user types an AND inside a phrase.

          OR,||,+, AND, && ..... ignored

          So we should throw an excpetion if any of these is found inside a phrase. It could confuse the user if we just ignore it.

          Question 2)
          Should these 2 queries behave the same when we fix the problem
          // checkMatches("\"john -percival\"", "1"); // not logic doesn't work
          // checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work

          I suppose there's an open question as to if the second example is legal (the brackets are unnecessary)

          Yes, the second is unnecessary, but I don't think it's illegal. The user could type <(smith)> outside the phrase, it makes sense to support it inside also.

          Question 3)
          checkMatches("\"jo* smith\"~2", "1,2,3,5"); // position logic works.
          doc 6 is also returned, so this feature does not seem to be working.

          That looks like a bug related to slop factor?

          I have not checked yet, but I think it's working fine. The slop means how many switches between the terms inside the phrase is allowed to match the query. It matches doc 6, because the term <smith> switches twice to the right and matched "johathon mary gomes smith". Twice = slop 2

          ANDs are ignored and turned into ORs (see earlier comments) but maybe a query parse error should be thrown to emphasise this.

          I think we could support AND also. I agree there are few cases where the user would use that. It would work as I explained before:

          What happens if I type "(query AND parser) lucene". In my point of view it is: "(query AND parser) AND_NEXT_TO lucene". Which means for me: find any document that contains the term 'query' and the term 'parser' in the position x, and the term 'lucene' in the position x+1. Is this the expected behaviour?

          Show
          Adriano Crestani added a comment - I propose doing this using using the new QP implementation. (I can write the new javacc QP for this) (this implies that the code will be in contrib in 2.9 and be part of core on 3.0) That would be good! Granted, the test fails for a reason other than the one for which I wanted it to fail. We can probably strike the test and leave a note saying phrase-within-a-phrase just does not make sense and is not supported. Cool, I agree to remove it. But I still don't see how an user can type a phrase inside a phrase with the current syntax definition, can you give me an example? In brackets it's an OR - the brackets are used to suggest that the current phrase element at position X is composed of some choices that are evaluated as a subclause in the same way that in normal query logic sub-clauses are defined in brackets e.g. +a +(b OR c). There seems to be a reasonable logic to this. Ideally the ComplexPhraseQueryParser should explicitly turn this setting on while evaluating the bracketed innards of phrases just in case the base class has AND as the default. If we use the implemented java cc code Luis suggested, we would have already a query parser that throws ParseExceptions whenever the user types an AND inside a phrase. OR,||,+, AND, && ..... ignored So we should throw an excpetion if any of these is found inside a phrase. It could confuse the user if we just ignore it. Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches("\"john -percival\"", "1"); // not logic doesn't work // checkMatches("\"john (-percival)\"", "1"); // not logic doesn't work I suppose there's an open question as to if the second example is legal (the brackets are unnecessary) Yes, the second is unnecessary, but I don't think it's illegal. The user could type <(smith)> outside the phrase, it makes sense to support it inside also. Question 3) checkMatches("\"jo* smith\"~2", "1,2,3,5"); // position logic works. doc 6 is also returned, so this feature does not seem to be working. That looks like a bug related to slop factor? I have not checked yet, but I think it's working fine. The slop means how many switches between the terms inside the phrase is allowed to match the query. It matches doc 6, because the term <smith> switches twice to the right and matched "johathon mary gomes smith". Twice = slop 2 ANDs are ignored and turned into ORs (see earlier comments) but maybe a query parse error should be thrown to emphasise this. I think we could support AND also. I agree there are few cases where the user would use that. It would work as I explained before: What happens if I type "(query AND parser) lucene". In my point of view it is: "(query AND parser) AND_NEXT_TO lucene". Which means for me: find any document that contains the term 'query' and the term 'parser' in the position x, and the term 'lucene' in the position x+1. Is this the expected behaviour?
          Hide
          Ahmet Arslan added a comment - - edited

          Hi everyone,

          I am using your ComplexPhraseQueryParser. I integrated it into Solr.
          I am interested in it mainly because it supports OR operator and wildcards inside proximity search.

          Specifically : "(john johathon) smith"~10 and "j* smith"
          They both work perfectly, thank you for your work.

          I downloaded source code of it from http://svn.apache.org/viewvc?view=rev&revision=791579
          And then edited the code a little bit since I am using lucene 2.4.1:

          I replaced those:
          1-) TermRangeQuery to RangeQuery.
          2-) getConstantScoreRewrite() to getUseOldRangeQuery();
          3-) setConstantScoreRewrite(false); to setUseOldRangeQuery(true);
          4-) On line 168 of ComplexPhraseQueryParser.java there are two semicolons ( ; ; )

          I am not sure what I did is the way to start using this query parser with latest versions of lucene/solr.
          If it is not can you suggest me better ways or where to get/download latest source code of query parser.

          I am having problems with multi-field searches.

          Query "(john johathon) smith"~10 works on default field, e.g. text.

          But when I want to run the same query on another field (other than default field)
          title:"(john johathon) smith"~10
          it gives exception below:
          Cannot have clause for field "text" nested in phrase for field "title"

          When I ran the query distibuting field name to all terms it works:
          title:"(title:john title:johathon) title:smith"~10

          Is there an easy way to set field of all terms (without specifying)?

          And about boosts of multi-field queries, is this query legal? (default operator = OR, default field = text)

          title:"(title:john title:johathon) title:smith"~10^1.5 OR "(john johathon) smith"~10^3.0

          Shortly I want to use this queryparser to query on multi-fields with different boosts.

          I am not sure if I am allowed to ask such question in here, if not please accept my apologies.

          Thank you for your consideration.

          Ahmet Arslan

          Show
          Ahmet Arslan added a comment - - edited Hi everyone, I am using your ComplexPhraseQueryParser. I integrated it into Solr. I am interested in it mainly because it supports OR operator and wildcards inside proximity search. Specifically : "(john johathon) smith"~10 and "j* smith" They both work perfectly, thank you for your work. I downloaded source code of it from http://svn.apache.org/viewvc?view=rev&revision=791579 And then edited the code a little bit since I am using lucene 2.4.1: I replaced those: 1-) TermRangeQuery to RangeQuery. 2-) getConstantScoreRewrite() to getUseOldRangeQuery(); 3-) setConstantScoreRewrite(false); to setUseOldRangeQuery(true); 4-) On line 168 of ComplexPhraseQueryParser.java there are two semicolons ( ; ; ) I am not sure what I did is the way to start using this query parser with latest versions of lucene/solr. If it is not can you suggest me better ways or where to get/download latest source code of query parser. I am having problems with multi-field searches. Query "(john johathon) smith"~10 works on default field, e.g. text. But when I want to run the same query on another field (other than default field) title:"(john johathon) smith"~10 it gives exception below: Cannot have clause for field "text" nested in phrase for field "title" When I ran the query distibuting field name to all terms it works: title:"(title:john title:johathon) title:smith"~10 Is there an easy way to set field of all terms (without specifying)? And about boosts of multi-field queries, is this query legal? (default operator = OR, default field = text) title:"(title:john title:johathon) title:smith"~10^1.5 OR "(john johathon) smith"~10^3.0 Shortly I want to use this queryparser to query on multi-fields with different boosts. I am not sure if I am allowed to ask such question in here, if not please accept my apologies. Thank you for your consideration. Ahmet Arslan
          Hide
          Mark Harwood added a comment -

          Fix for phrases using QueryParser's non-default field e.g.
          author:"j* smith"

          Show
          Mark Harwood added a comment - Fix for phrases using QueryParser's non-default field e.g. author:"j* smith"
          Hide
          Mark Miller added a comment -

          If we don't have a clear path for this very soon I think we should pull it from this release.

          Show
          Mark Miller added a comment - If we don't have a clear path for this very soon I think we should pull it from this release.
          Hide
          Luis Alves added a comment -

          My understanding is that with "New flexible query parser" (LUCENE-1567),
          the old QueryParser classes will be deprecated in 2.9
          and removed in 3.0 (or moved to contrib in 3.0).

          This change will also make ComplexPhraseQueryParser deprecated
          because it currently extends the old queryparser.

          ComplexPhraseQueryParser was not part of any lucene release
          and was only checked in 2 months ago in trunk.

          For the reasons above I think we should re-implement this functionality
          using the new flexible query parser.

          3.0 and 2.9 releases will be very similar
          but 3.0 will have all deprecated APIs removed (at least this is my understanding).

          In my view the path should be:

          • Wait for LUCENE-1567 to be in trunk
          • re-implement this feature using the "New flexible query parser"
          • and probably do it using a super set of the current syntax with a new TextParser.

          I'm not sure if I'll have the time to implement a compatible implementation of
          ComplexPhraseQueryParser before 2.9 release

          I'm currently working on 1567 to finalize the patch,
          cleaning up javadocs and some small clean up to the APIs.

          I'll try to work on ComplexPhraseQueryParser,
          once lucene-1567 is in the trunk.

          So in my view, ComplexPhraseQueryParser depends on 1567,
          and will require some extra work after 1567 is in the trunk.

          I think we have the following, options:

          1. We could wait until 1567 is in trunk and wait for a compatible implementation of ComplexPhraseQueryParser using 1567,
            before we release 2.9. (this would still remove the current ComplexPhraseQueryParser class, and provide this features with LuceneQueryParserHelper class, or with a new TextParser name complexphrase)
          2. We can release 2.9 with only 1567, but that will require ComplexPhraseQueryParser to be removed from trunk or at least deprecated in 2.9, and in 3.X re-implement it using the "New flexible query parser" APIs

          I hope this helps

          Show
          Luis Alves added a comment - My understanding is that with "New flexible query parser" ( LUCENE-1567 ), the old QueryParser classes will be deprecated in 2.9 and removed in 3.0 (or moved to contrib in 3.0). This change will also make ComplexPhraseQueryParser deprecated because it currently extends the old queryparser. ComplexPhraseQueryParser was not part of any lucene release and was only checked in 2 months ago in trunk. For the reasons above I think we should re-implement this functionality using the new flexible query parser. 3.0 and 2.9 releases will be very similar but 3.0 will have all deprecated APIs removed (at least this is my understanding). In my view the path should be: Wait for LUCENE-1567 to be in trunk re-implement this feature using the "New flexible query parser" and probably do it using a super set of the current syntax with a new TextParser. I'm not sure if I'll have the time to implement a compatible implementation of ComplexPhraseQueryParser before 2.9 release I'm currently working on 1567 to finalize the patch, cleaning up javadocs and some small clean up to the APIs. I'll try to work on ComplexPhraseQueryParser, once lucene-1567 is in the trunk. So in my view, ComplexPhraseQueryParser depends on 1567, and will require some extra work after 1567 is in the trunk. I think we have the following, options: We could wait until 1567 is in trunk and wait for a compatible implementation of ComplexPhraseQueryParser using 1567, before we release 2.9. (this would still remove the current ComplexPhraseQueryParser class, and provide this features with LuceneQueryParserHelper class, or with a new TextParser name complexphrase) We can release 2.9 with only 1567, but that will require ComplexPhraseQueryParser to be removed from trunk or at least deprecated in 2.9, and in 3.X re-implement it using the "New flexible query parser" APIs I hope this helps
          Hide
          Mark Miller added a comment -

          Okay thanks. I think we should pull it for 2.9.

          Show
          Mark Miller added a comment - Okay thanks. I think we should pull it for 2.9.
          Hide
          Mark Miller added a comment - - edited

          Okay, so I guess the question is - who objects to pulling this from 2.9? I don't think we should release a class that extends a deprecated class and I don't think we want to hold up 2.9 waiting for an adequate non deprecated replacement.

          Show
          Mark Miller added a comment - - edited Okay, so I guess the question is - who objects to pulling this from 2.9? I don't think we should release a class that extends a deprecated class and I don't think we want to hold up 2.9 waiting for an adequate non deprecated replacement.
          Hide
          Mark Harwood added a comment -

          No objections to pulling from core given the impending deprecation of the QueryParser base class.

          I know of at least 2 folks using it so moving it to contrib would help provide somewhere to maintain fixes while we wait for the new QueryParser to incorporate the complex phrase features.

          Show
          Mark Harwood added a comment - No objections to pulling from core given the impending deprecation of the QueryParser base class. I know of at least 2 folks using it so moving it to contrib would help provide somewhere to maintain fixes while we wait for the new QueryParser to incorporate the complex phrase features.
          Hide
          Michael Busch added a comment -

          +1 for moving it to conrib. Then the users Mark H. mentioned can consume it from a contrib jar until these features are in the new QP.

          Show
          Michael Busch added a comment - +1 for moving it to conrib. Then the users Mark H. mentioned can consume it from a contrib jar until these features are in the new QP.
          Hide
          Mark Miller added a comment -

          Alright, then - do you have time to handle that soon Mark H? If not I can probably make some time for it.

          Show
          Mark Miller added a comment - Alright, then - do you have time to handle that soon Mark H? If not I can probably make some time for it.
          Hide
          Mark Miller added a comment -

          patch that moves to contrib

          Show
          Mark Miller added a comment - patch that moves to contrib
          Hide
          Michael McCandless added a comment -

          Reopening so we don't forget to do this one...

          Come 3.0, how will this work, even in contrib? (Because the plan is to replace the old queryParser with the new one for 3.0).

          Show
          Michael McCandless added a comment - Reopening so we don't forget to do this one... Come 3.0, how will this work, even in contrib? (Because the plan is to replace the old queryParser with the new one for 3.0).
          Hide
          Mark Miller added a comment -

          The plan is to remove it and add a replacement built on the new QueryParser. The replacement may not be exactly the same, but it should be very similar.

          My inclination was to leave it out of this release - its a single class and so easy to manage and plug it in separately if you want. I don't know that we should release a class that may or may not get a replacement (promises, promises ) and extends a deprecated class. Contrib was once called sandbox though and consensus appeared to be to put it in contrib - so I went with that and added a warning that the class might change soon.

          Show
          Mark Miller added a comment - The plan is to remove it and add a replacement built on the new QueryParser. The replacement may not be exactly the same, but it should be very similar. My inclination was to leave it out of this release - its a single class and so easy to manage and plug it in separately if you want. I don't know that we should release a class that may or may not get a replacement (promises, promises ) and extends a deprecated class. Contrib was once called sandbox though and consensus appeared to be to put it in contrib - so I went with that and added a warning that the class might change soon.
          Hide
          Grant Ingersoll added a comment -

          I'm not sure why the ComplexPhraseQuery itself is buried in the Parser. Can't the query stand on it's own? Seems like it could be a useful class outside of the specific content of a QueryParser, no?

          Show
          Grant Ingersoll added a comment - I'm not sure why the ComplexPhraseQuery itself is buried in the Parser. Can't the query stand on it's own? Seems like it could be a useful class outside of the specific content of a QueryParser, no?
          Hide
          Mark Harwood added a comment -

          It does not stand on it's own as it is merely a temporary object used as a peculiarity in the way the parsing works. The SpanQuery family would be the legitimate standalone equivalents of this class.

          ComplexPhraseQuery objects are constructed during the the first pass of parsing to capture everything between quotes as an opaque string.
          The ComplexPhraseQueryParser then calls "parsePhraseElements(...)" on these objects to complete the process of parsing in a second pass where in this context any brackets etc take on a different meaning
          There is no merit in making this externally visible.

          Show
          Mark Harwood added a comment - It does not stand on it's own as it is merely a temporary object used as a peculiarity in the way the parsing works. The SpanQuery family would be the legitimate standalone equivalents of this class. ComplexPhraseQuery objects are constructed during the the first pass of parsing to capture everything between quotes as an opaque string. The ComplexPhraseQueryParser then calls "parsePhraseElements(...)" on these objects to complete the process of parsing in a second pass where in this context any brackets etc take on a different meaning There is no merit in making this externally visible.
          Hide
          Luis Alves added a comment -

          We hope to implement this on LUCENE-1823, along with other features.

          Show
          Luis Alves added a comment - We hope to implement this on LUCENE-1823 , along with other features.
          Hide
          Uwe Schindler added a comment -

          Move to 3.1 as this is a new feature.

          Show
          Uwe Schindler added a comment - Move to 3.1 as this is a new feature.
          Hide
          Ahmet Arslan added a comment -

          Hi Mark,

          Up to now, I was consuming ComplexPhraseQueryParser.java by means of copy paste into my source code, so I didn't notice.
          Today I find out that ComplexPhraseQuery.java in Lucene 2.9.1 Misc has missed the non default field.patch.
          It gives exception with author:"fred* smith" style queries.
          I am writing a solr plugin to contribute for this query parser and Solr 1.4.0 directly depends on lucene-misc-2.9.1.jar.
          Should I edit and include source code of ComplexPhraseQueryParser.java in my patch to solve this problem?
          Or is there a more convenient way to do it?

          Thank you for your consideration.

          Show
          Ahmet Arslan added a comment - Hi Mark, Up to now, I was consuming ComplexPhraseQueryParser.java by means of copy paste into my source code, so I didn't notice. Today I find out that ComplexPhraseQuery.java in Lucene 2.9.1 Misc has missed the non default field.patch. It gives exception with author:"fred* smith" style queries. I am writing a solr plugin to contribute for this query parser and Solr 1.4.0 directly depends on lucene-misc-2.9.1.jar. Should I edit and include source code of ComplexPhraseQueryParser.java in my patch to solve this problem? Or is there a more convenient way to do it? Thank you for your consideration.
          Hide
          Mark Harwood added a comment -

          Ugh. There's probably two separate actions required here then:
          1) a bug needs raising on Lucene.
          2) guidance needed from the Solr team about preferred course of action

          Show
          Mark Harwood added a comment - Ugh. There's probably two separate actions required here then: 1) a bug needs raising on Lucene. 2) guidance needed from the Solr team about preferred course of action
          Hide
          David Kaelbling added a comment -

          Could someone link the new Lucene bug mentioned above to this issue? I couldn't find it.

          Show
          David Kaelbling added a comment - Could someone link the new Lucene bug mentioned above to this issue? I couldn't find it.
          Hide
          Mark Harwood added a comment -

          Double Ugh. Applying the patch for the "non-default field" bug doesn't work any more because the latest ComplexPhraseQueryParser source sitting in contrib now has a different package to the QueryParser base class . This means that this subclass doesn't have the required write access to the package-protected "field" variable. This is needed to temporarily set the context of the parser when processing the inner contents of the phrase.

          Fixing this would require changing the package name of ComplexPhraseQueryParser or changing the visibility of "field" in the QueryParser base class to "protected".
          Anyone have any strong feelings about which of these is the most acceptable?

          Show
          Mark Harwood added a comment - Double Ugh. Applying the patch for the "non-default field" bug doesn't work any more because the latest ComplexPhraseQueryParser source sitting in contrib now has a different package to the QueryParser base class . This means that this subclass doesn't have the required write access to the package-protected "field" variable. This is needed to temporarily set the context of the parser when processing the inner contents of the phrase. Fixing this would require changing the package name of ComplexPhraseQueryParser or changing the visibility of "field" in the QueryParser base class to "protected". Anyone have any strong feelings about which of these is the most acceptable?
          Hide
          Terje Eggestad added a comment -

          Hi

          I'm about begin using the ComplexPhraseQueryParser with 3.0.2 as we need wildcard with phrases and proximity

          Our customers have a habit of including '-' in phrases which seem to trigger a bug :

          If you add the following tests to the TestComplexPhraseQueryParser class:

          checkMatches("\"joe john nosuchword\"", "");
          checkMatches("\"joe-john-nosuchword\"", "");
          checkMatches("\"john-nosuchword smith\"", "");

          AND add a rewrite() in checkMatches() just after parse :
          Query q = qp.parse(qString);
          IndexReader reader = searcher.getIndexReader(); // need for rewrite
          q = q.rewrite(reader);

          The first two is OK, and is rewritten to:

          spanNear([name:joe, name:john, name:nosuchword], 0, true)
          name:"joe john nosuchword"

          The third bomb out on

          java.lang.IllegalArgumentException: Unknown query type "org.apache.lucene.search.PhraseQuery" found in phrase query string "john-nosuchword smith"
          at org.apache.lucene.queryParser.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:281)
          at org.apache.lucene.queryParser.TestComplexPhraseQuery.checkMatches(TestComplexPhraseQuery.java:120)
          .
          .
          .

          I made a fix that seem to fixit, but I feel on very shaky ground here.
          I've made so many debugging hack around that I can't make a propper patch, but I added this fix to ComplexPhraseQueryParser::rewrite()
          just before the place the exception is thrown:

          } else {
          if (qc instanceof TermQuery)

          { TermQuery tq = (TermQuery) qc; allSpanClauses[i] = new SpanTermQuery(tq.getTerm()); // START FIX "A-B C" phrases }

          else if (qc instanceof PhraseQuery) {
          PhraseQuery pq = (PhraseQuery) qc;
          Term[] subterms = pq.getTerms();

          SpanQuery[] clauses = new SpanQuery[subterms.length];
          for (int j = 0; j < subterms.length; j++)

          { clauses[j] = new SpanTermQuery(subterms[j]); }

          allSpanClauses[i] = new SpanNearQuery(clauses, 0, true);
          // END FIX
          } else

          { throw new IllegalArgumentException("Unknown query type \"" + qc.getClass().getName() + "\" found in phrase query string \"" + phrasedQueryStringContents + "\""); }
          Show
          Terje Eggestad added a comment - Hi I'm about begin using the ComplexPhraseQueryParser with 3.0.2 as we need wildcard with phrases and proximity Our customers have a habit of including '-' in phrases which seem to trigger a bug : If you add the following tests to the TestComplexPhraseQueryParser class: checkMatches("\"joe john nosuchword\"", ""); checkMatches("\"joe-john-nosuchword\"", ""); checkMatches("\"john-nosuchword smith\"", ""); AND add a rewrite() in checkMatches() just after parse : Query q = qp.parse(qString); IndexReader reader = searcher.getIndexReader(); // need for rewrite q = q.rewrite(reader); The first two is OK, and is rewritten to: spanNear( [name:joe, name:john, name:nosuchword] , 0, true) name:"joe john nosuchword" The third bomb out on java.lang.IllegalArgumentException: Unknown query type "org.apache.lucene.search.PhraseQuery" found in phrase query string "john-nosuchword smith" at org.apache.lucene.queryParser.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:281) at org.apache.lucene.queryParser.TestComplexPhraseQuery.checkMatches(TestComplexPhraseQuery.java:120) . . . I made a fix that seem to fixit, but I feel on very shaky ground here. I've made so many debugging hack around that I can't make a propper patch, but I added this fix to ComplexPhraseQueryParser::rewrite() just before the place the exception is thrown: } else { if (qc instanceof TermQuery) { TermQuery tq = (TermQuery) qc; allSpanClauses[i] = new SpanTermQuery(tq.getTerm()); // START FIX "A-B C" phrases } else if (qc instanceof PhraseQuery) { PhraseQuery pq = (PhraseQuery) qc; Term[] subterms = pq.getTerms(); SpanQuery[] clauses = new SpanQuery [subterms.length] ; for (int j = 0; j < subterms.length; j++) { clauses[j] = new SpanTermQuery(subterms[j]); } allSpanClauses [i] = new SpanNearQuery(clauses, 0, true); // END FIX } else { throw new IllegalArgumentException("Unknown query type \"" + qc.getClass().getName() + "\" found in phrase query string \"" + phrasedQueryStringContents + "\""); }
          Hide
          Mark Miller added a comment -

          changing the visibility of "field" in the QueryParser base class to "protected".

          This seems reasonable?

          Show
          Mark Miller added a comment - changing the visibility of "field" in the QueryParser base class to "protected". This seems reasonable?
          Hide
          Tomás Fernández Löbbe added a comment -

          Hi I'm working in this change to allow field queries. I noted that queries like:
          name:"de*"
          name:de*
          fail due to the exception thrown in the "rewrite" method:

              public Query rewrite(IndexReader reader) throws IOException {
                // ArrayList spanClauses = new ArrayList();
                if (contents instanceof TermQuery) {
                  return contents;
                }
                
                // Build a sequence of Span clauses arranged in a SpanNear - child
                // clauses can be complex
                // Booleans e.g. nots and ors etc
                int numNegatives = 0;
                if (!(contents instanceof BooleanQuery)) {
                  throw new IllegalArgumentException("Unknown query type \""
                      + contents.getClass().getName()
                      + "\" found in phrase query string \"" + phrasedQueryStringContents
                      + "\"");
                }
          ...
          

          By changing it to something like:

          if (!(contents instanceof BooleanQuery)) {
                  return contents;
          }
          

          queries like the one above work, together with all the other queries available in the unit test. Is there something I'm missing with the previous change? I know the ComplexPhraseQueryParser is not intended to be used for queries like the ones I'm proposing, but why does it needs to fail in those cases?

          Show
          Tomás Fernández Löbbe added a comment - Hi I'm working in this change to allow field queries. I noted that queries like: name:"de*" name:de* fail due to the exception thrown in the "rewrite" method: public Query rewrite(IndexReader reader) throws IOException { // ArrayList spanClauses = new ArrayList(); if (contents instanceof TermQuery) { return contents; } // Build a sequence of Span clauses arranged in a SpanNear - child // clauses can be complex // Booleans e.g. nots and ors etc int numNegatives = 0; if (!(contents instanceof BooleanQuery)) { throw new IllegalArgumentException( "Unknown query type \" " + contents.getClass().getName() + "\" found in phrase query string \"" + phrasedQueryStringContents + "\" "); } ... By changing it to something like: if (!(contents instanceof BooleanQuery)) { return contents; } queries like the one above work, together with all the other queries available in the unit test. Is there something I'm missing with the previous change? I know the ComplexPhraseQueryParser is not intended to be used for queries like the ones I'm proposing, but why does it needs to fail in those cases?
          Hide
          Tomás Fernández Löbbe added a comment -

          I attached a patch with the change of my previous comment plus the change that allows fielded queries.

          Show
          Tomás Fernández Löbbe added a comment - I attached a patch with the change of my previous comment plus the change that allows fielded queries.
          Hide
          Ahmet Arslan added a comment -

          Mark's and Tomas' non default field patches are combined.

          Show
          Ahmet Arslan added a comment - Mark's and Tomas' non default field patches are combined.
          Hide
          Ahmet Arslan added a comment -

          Thanks for looking into this, Mark and Tomas. Do you think this issue is the right place to introduce boolean inOrder parameter? Currently always inOrder=true is passed to SpanNearQuery's ctor.

          Show
          Ahmet Arslan added a comment - Thanks for looking into this, Mark and Tomas. Do you think this issue is the right place to introduce boolean inOrder parameter? Currently always inOrder=true is passed to SpanNearQuery's ctor.
          Hide
          Tomás Fernández Löbbe added a comment -

          Ahmet, I created a Jira for the "inOrder" in the ComplexPhraseQueryParser. See LUCENE-3758.

          Show
          Tomás Fernández Löbbe added a comment - Ahmet, I created a Jira for the "inOrder" in the ComplexPhraseQueryParser. See LUCENE-3758 .
          Hide
          Otis Gospodnetic added a comment -

          The JIRA cleaning man is here. I thought this was committed long ago, but I just noticed it's open and set for 4.1. Huh?
          Last activity on this pretty popular issue was from Mark Harwood back in 2008!

          Show
          Otis Gospodnetic added a comment - The JIRA cleaning man is here. I thought this was committed long ago, but I just noticed it's open and set for 4.1. Huh? Last activity on this pretty popular issue was from Mark Harwood back in 2008!
          Hide
          Dmitry Kan added a comment -

          Can someone give me a hand on this parser (despite the jira is so old)?

          We need to have the NOT logic work properly in the boolean sense, that is the following should work correctly:

          a AND NOT b
          a AND NOT (b OR c)
          a AND NOT ((b OR c) AND (d OR e))

          Can anybody guide me here? Is it at all possible to accomplish this with this original CPQP implementation? I would not be afraid of changing QueryParser.jj lexical specification, if the task requires it.

          Show
          Dmitry Kan added a comment - Can someone give me a hand on this parser (despite the jira is so old)? We need to have the NOT logic work properly in the boolean sense, that is the following should work correctly: a AND NOT b a AND NOT (b OR c) a AND NOT ((b OR c) AND (d OR e)) Can anybody guide me here? Is it at all possible to accomplish this with this original CPQP implementation? I would not be afraid of changing QueryParser.jj lexical specification, if the task requires it.
          Hide
          Dmitry Kan added a comment -

          OK, after some study, here is what we did:

          we treat the AND clauses as spanNearQuery objects. So, the

          a AND b

          becomes %a b%slop, where %% operator is an unordered SpanNear query (change to QueryParser.jj was required for this).

          When there is a case of NOT clause with nested clauses:

          NOT( (a AND b) OR (c AND d) ) = NOT ( %a b%~slop OR %c d%~slop ) ,

          we need to handle SpanNearQueries in the addComplexPhraseClause method. In order to handle this, we just added to the if statement:

          [code]
          if (qc instanceof BooleanQuery) {
          [/code]

          the following else if statement:

          [code]
          else if (childQuery instanceof SpanNearQuery)

          { ors.add((SpanQuery)childQuery); }

          [/code]

          Show
          Dmitry Kan added a comment - OK, after some study, here is what we did: we treat the AND clauses as spanNearQuery objects. So, the a AND b becomes %a b% slop, where %% operator is an unordered SpanNear query (change to QueryParser.jj was required for this). When there is a case of NOT clause with nested clauses: NOT( (a AND b) OR (c AND d) ) = NOT ( %a b%~slop OR %c d%~slop ) , we need to handle SpanNearQueries in the addComplexPhraseClause method. In order to handle this, we just added to the if statement: [code] if (qc instanceof BooleanQuery) { [/code] the following else if statement: [code] else if (childQuery instanceof SpanNearQuery) { ors.add((SpanQuery)childQuery); } [/code]
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Nikhil Chhaochharia added a comment -

          The patch posted by Ahmet Arslan on 8th Feb 2012 looks good to me. I have been using it in production for some time and did not find any issues.

          I will request a committer to kindly look into this and help get this included into Solr 4.7. If any further work is required, then I will be happy to give it a shot.

          Show
          Nikhil Chhaochharia added a comment - The patch posted by Ahmet Arslan on 8th Feb 2012 looks good to me. I have been using it in production for some time and did not find any issues. I will request a committer to kindly look into this and help get this included into Solr 4.7. If any further work is required, then I will be happy to give it a shot.
          Hide
          Tim Allison added a comment -

          Any interest in LUCENE-5205?

          Show
          Tim Allison added a comment - Any interest in LUCENE-5205 ?
          Hide
          Ahmet Arslan added a comment -

          Hi Tim, of course. Do you think can we include TestComplexPhraseQuery.java in your work? To demonstrate it can replace ComplexPhraseQueryParser?

          Show
          Ahmet Arslan added a comment - Hi Tim, of course. Do you think can we include TestComplexPhraseQuery.java in your work? To demonstrate it can replace ComplexPhraseQueryParser?
          Hide
          Tim Allison added a comment -

          Happily. Will add tomorrow and update patch. Thank you!

          Show
          Tim Allison added a comment - Happily. Will add tomorrow and update patch. Thank you!
          Hide
          Tim Allison added a comment -

          Attached update with tests from TestComplexPhraseQuery to LUCENE-5205.

          Show
          Tim Allison added a comment - Attached update with tests from TestComplexPhraseQuery to LUCENE-5205 .
          Hide
          Nikhil Chhaochharia added a comment -

          LUCENE-5205 is very interesting, thanks for pointing me to it.

          However, we should still try to get LUCENE-1486 closed - most of the work has already been done and it may be useful in certain cases where the full power of LUCENE-5205 is not required.

          Show
          Nikhil Chhaochharia added a comment - LUCENE-5205 is very interesting, thanks for pointing me to it. However, we should still try to get LUCENE-1486 closed - most of the work has already been done and it may be useful in certain cases where the full power of LUCENE-5205 is not required.
          Hide
          Tim Allison added a comment -

          Agreed. ComplexPhraseQueryParser has actually been fielded

          Show
          Tim Allison added a comment - Agreed. ComplexPhraseQueryParser has actually been fielded
          Hide
          Ahmet Arslan added a comment -

          I agree with Nikhil Chhaochharia, This fix is all about accidentally forgotten thing to an already committed code. Also I am not sure why this issue is reopened. Any feeling towards a separate issue would be a better fit for this?

          Show
          Ahmet Arslan added a comment - I agree with Nikhil Chhaochharia , This fix is all about accidentally forgotten thing to an already committed code. Also I am not sure why this issue is reopened. Any feeling towards a separate issue would be a better fit for this?
          Hide
          Tim Allison added a comment -

          +1

          Show
          Tim Allison added a comment - +1
          Hide
          Nikhil Chhaochharia added a comment -

          It looks like there is a problem with stopwords also - a query like "A for B" where 'for' is a stopword is parsed as "A B" and does not match documents containing "A for B".

          Show
          Nikhil Chhaochharia added a comment - It looks like there is a problem with stopwords also - a query like "A for B" where 'for' is a stopword is parsed as "A B" and does not match documents containing "A for B".
          Hide
          Erick Erickson added a comment -

          OK, this seems like it's completely obsolete, any objections to closing? Should we raise Nikhil Chhaochharia's comment in a new JIRA to test at least?

          Show
          Erick Erickson added a comment - OK, this seems like it's completely obsolete, any objections to closing? Should we raise Nikhil Chhaochharia's comment in a new JIRA to test at least?
          Hide
          Ahmet Arslan added a comment -

          Should we raise Nikhil Chhaochharia's comment in a new JIRA to test at least?

          I created LUCENE-5530 for fielded query support.

          Show
          Ahmet Arslan added a comment - Should we raise Nikhil Chhaochharia's comment in a new JIRA to test at least? I created LUCENE-5530 for fielded query support.
          Hide
          Erick Erickson added a comment -

          What about the stopwords bit? yet another JIRA?

          Show
          Erick Erickson added a comment - What about the stopwords bit? yet another JIRA?
          Hide
          Ahmet Arslan added a comment -

          What about the stopwords bit? yet another JIRA?

          There is no patch/solution for that in ComplexPhraseQueryParser. Tim says about the topic :

          The root of this problem is that SpanNearQuery has no good way to handle stopwords in a way analagous to PhraseQuery.

          I suggested Nikhil Chhaochharia to use a modified StopwordFilter ( I sent the filter to him offlist) that does not remove but instead reduces given stop words to an impossible token.
          "the" => "ImpossibleToken"
          "a" => "ImpossibleToken"
          "for" => "ImpossibleToken"

          I think we don't need a jira for this functionality but we can document this as limitation and workaround for this.

          Show
          Ahmet Arslan added a comment - What about the stopwords bit? yet another JIRA? There is no patch/solution for that in ComplexPhraseQueryParser. Tim says about the topic : The root of this problem is that SpanNearQuery has no good way to handle stopwords in a way analagous to PhraseQuery. I suggested Nikhil Chhaochharia to use a modified StopwordFilter ( I sent the filter to him offlist) that does not remove but instead reduces given stop words to an impossible token. "the" => "ImpossibleToken" "a" => "ImpossibleToken" "for" => "ImpossibleToken" I think we don't need a jira for this functionality but we can document this as limitation and workaround for this.
          Hide
          Ahmet Arslan added a comment -

          One thing is , Tomás Fernández Löbbe has reported one problem in his comment . And he has provided a solution too.
          Its about "fred*" kind of queries. There is only one term inside quotes.

          Show
          Ahmet Arslan added a comment - One thing is , Tomás Fernández Löbbe has reported one problem in his comment . And he has provided a solution too. Its about "fred*" kind of queries. There is only one term inside quotes.
          Hide
          Ahmet Arslan added a comment -

          One last thing that might come out of this jira is Terje Eggestad this comment and his fix. However I couldn't re-produce the problem he reported with new MockAnalyzer(random()); This problem could be analyzer specific.

          Show
          Ahmet Arslan added a comment - One last thing that might come out of this jira is Terje Eggestad this comment and his fix. However I couldn't re-produce the problem he reported with new MockAnalyzer(random()); This problem could be analyzer specific.
          Hide
          Erick Erickson added a comment -

          The re-opens were from 2009. This stuff has been in Lucene for some time, and the comment "Reopening so we don't forget to do this one" makes me think this should have been closed a long time ago.

          NOTE: we're also doing more work with this in the 4.8 time frame, thus it's getting some attention now.

          Show
          Erick Erickson added a comment - The re-opens were from 2009. This stuff has been in Lucene for some time, and the comment "Reopening so we don't forget to do this one" makes me think this should have been closed a long time ago. NOTE: we're also doing more work with this in the 4.8 time frame, thus it's getting some attention now.
          Hide
          Ahmet Arslan added a comment -

          Thanks Erick Erickson for closing this beast

          Show
          Ahmet Arslan added a comment - Thanks Erick Erickson for closing this beast
          Hide
          Uwe Schindler added a comment -

          Close issue after release of 4.8.0

          Show
          Uwe Schindler added a comment - Close issue after release of 4.8.0

            People

            • Assignee:
              Erick Erickson
              Reporter:
              Mark Harwood
            • Votes:
              13 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development