Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-489

Allow QP subclasses to support Wildcard Queries with leading "*"

    Details

    • Type: Wish
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1
    • Component/s: core/queryparser
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      It would be usefull for some users if the logic that prevents QueryParser from creating WIldcardQueries with leading wildcard characters ("?" or "*") be moved from the grammer into the base implimentation of getWildcardQuery so that it may be overridden in subclasses without needing to modifiy the grammer directly.

      1. qp.diff
        30 kB
        Daniel Naber
      2. LUCENE-489.patch
        22 kB
        Steven Parkes
      3. LUCENE-489.patch
        24 kB
        Steven Parkes

        Activity

        Hide
        otis Otis Gospodnetic added a comment -

        You can do this if you create your WildcardQuery's programmatically (i.e. not via QueryParser).
        Support for that is not in QueryParser because leading wildcards may not perform well.
        This may be mentioned in the FAQ, but I didn't check.

        Show
        otis Otis Gospodnetic added a comment - You can do this if you create your WildcardQuery's programmatically (i.e. not via QueryParser). Support for that is not in QueryParser because leading wildcards may not perform well. This may be mentioned in the FAQ, but I didn't check.
        Hide
        nightrider Peter Schäfer added a comment -

        Thanks, I know that those queries perform badly.

        Do you have a hint how to improve those kinds of queries ?
        Or is there a chance that we will see a more efficient implementation in the future ?

        Show
        nightrider Peter Schäfer added a comment - Thanks, I know that those queries perform badly. Do you have a hint how to improve those kinds of queries ? Or is there a chance that we will see a more efficient implementation in the future ?
        Hide
        appler Cheolgoo Kang added a comment -

        Why don't you reverse those analyzed tokens in another field?

        If you have a field named 'CONTENT', make one another 'CONTENT_R' with all indexing terms reversed. eg, CONTENT:lucene, CONTENT_R:enecul. Then the query of "CONTENT:xyz" is the same with "CONTENT_R:zyx" and it would work great with custom QueryParser with QueryParser.getWildcardQuery() method overridden.

        Show
        appler Cheolgoo Kang added a comment - Why don't you reverse those analyzed tokens in another field? If you have a field named 'CONTENT', make one another 'CONTENT_R' with all indexing terms reversed. eg, CONTENT:lucene, CONTENT_R:enecul. Then the query of "CONTENT: xyz" is the same with "CONTENT_R:zyx " and it would work great with custom QueryParser with QueryParser.getWildcardQuery() method overridden.
        Hide
        nightrider Peter Schäfer added a comment -

        great idea, thanks !

        but what about xyz

        Show
        nightrider Peter Schäfer added a comment - great idea, thanks ! but what about xyz
        Hide
        jch John Haxby added a comment -

        I'm sure someone mentioned on one of the lists a while back, but there's a technique that we used for an LDAP server that's applicable here. It's a bit like injecting synonyms: you'd have, say, a SubwordFilter that given "brown" would emit "rown" and "own" at the same position. A "*own" query would then simply drop the leading wildcard and look for the word. We stopped at three letters in the LDAP server. An alternative is to use a ReverseAlternativeFilter (say) that emits "brown" and "nworb" at the same position, but that only deals with prefix or postfix wildcards, but not both.

        I'm not sure how you'd stop "own" matching "brown" though. If someone could come up with some example code I don't suppose I'd be the only one who would be interested!

        Show
        jch John Haxby added a comment - I'm sure someone mentioned on one of the lists a while back, but there's a technique that we used for an LDAP server that's applicable here. It's a bit like injecting synonyms: you'd have, say, a SubwordFilter that given "brown" would emit "rown" and "own" at the same position. A "*own" query would then simply drop the leading wildcard and look for the word. We stopped at three letters in the LDAP server. An alternative is to use a ReverseAlternativeFilter (say) that emits "brown" and "nworb" at the same position, but that only deals with prefix or postfix wildcards, but not both. I'm not sure how you'd stop "own" matching "brown" though. If someone could come up with some example code I don't suppose I'd be the only one who would be interested!
        Hide
        ehatcher Erik Hatcher added a comment -

        There are term rotation techniques that allow for efficient wildcard querying. For example, the word "cat" can be indexed as "cat", "$cat", "t$ca", and "at$c". For a query of "a*, the search can be rotated to search for a*.

        Show
        ehatcher Erik Hatcher added a comment - There are term rotation techniques that allow for efficient wildcard querying. For example, the word "cat" can be indexed as "cat", "$cat", "t$ca", and "at$c". For a query of "a*, the search can be rotated to search for a*.
        Hide
        ehatcher Erik Hatcher added a comment -

        FYI - Actually it would not be possible to override getWildcardQuery to reverse a "*foo" query term. The parser prevents *foo from being parsed before even getting to getWildcardQuery without a change to the .jj grammar.

        Show
        ehatcher Erik Hatcher added a comment - FYI - Actually it would not be possible to override getWildcardQuery to reverse a "*foo" query term. The parser prevents *foo from being parsed before even getting to getWildcardQuery without a change to the .jj grammar.
        Hide
        eyalp Eyal Post added a comment -

        I'd like to ask that this issue be reconsidered. I suggest the following:
        1. Turn on the built in QueryParser support for leading wildcards (in QueryParser.jj)
        2. Disable the support for leading wildcards in the default QueryParser java class implementation but allow users to override that class and enable it there.

        I see many people going for different approaches to handling leading wildcard queries through QueryParser and I beleive most of them eventually recompile QueryParser.jj after making the relevant changes there.

        This might not be an issue for Java users (simple run JavaCC again and you have the QueryParser.java source), but it is especially important for users of the ported versions of Lucene (in my case DotLucene). For every Lucene version I have to recreate QueryParser.java using JavaCC and then do the porting job from java to c#.

        Show
        eyalp Eyal Post added a comment - I'd like to ask that this issue be reconsidered. I suggest the following: 1. Turn on the built in QueryParser support for leading wildcards (in QueryParser.jj) 2. Disable the support for leading wildcards in the default QueryParser java class implementation but allow users to override that class and enable it there. I see many people going for different approaches to handling leading wildcard queries through QueryParser and I beleive most of them eventually recompile QueryParser.jj after making the relevant changes there. This might not be an issue for Java users (simple run JavaCC again and you have the QueryParser.java source), but it is especially important for users of the ported versions of Lucene (in my case DotLucene). For every Lucene version I have to recreate QueryParser.java using JavaCC and then do the porting job from java to c#.
        Hide
        ejain Eric Jain added a comment -

        Would be nice if this request could be revisited: For those people who do need to add support for wildcards at the beginning of terms (and for whom performance is not an issue) it is rather intimidating to have to write a custom query parser rather than e.g. just override a single method somewhere!

        Show
        ejain Eric Jain added a comment - Would be nice if this request could be revisited: For those people who do need to add support for wildcards at the beginning of terms (and for whom performance is not an issue) it is rather intimidating to have to write a custom query parser rather than e.g. just override a single method somewhere!
        Hide
        hossman Hoss Man added a comment -

        I think the revent comments in this feature request make a legitimate argument about the extensability of wildcard support in the QueryParser – I see no reason not to reopen this request given a slight change in title and description.

        This doens't mean I know of any active work to impliment this change (patches are always welcome) just that I think it's a worthwhile request to leave open.

        Show
        hossman Hoss Man added a comment - I think the revent comments in this feature request make a legitimate argument about the extensability of wildcard support in the QueryParser – I see no reason not to reopen this request given a slight change in title and description. This doens't mean I know of any active work to impliment this change (patches are always welcome) just that I think it's a worthwhile request to leave open.
        Hide
        lucenebugs@danielnaber.de Daniel Naber added a comment -

        I wrote this patch that let's users enable the leading wildcard using a method call. It applies to 1.9, but if someone wants to test it and clean it up (so it applies to 2.0) I'd commit it.

        Show
        lucenebugs@danielnaber.de Daniel Naber added a comment - I wrote this patch that let's users enable the leading wildcard using a method call. It applies to 1.9, but if someone wants to test it and clean it up (so it applies to 2.0) I'd commit it.
        Hide
        aptosca Steven Parkes added a comment -

        I was looking in this area (wildcard prefixes) so I figured I might as well do the cleanup.

        Dan's patch, with newly generated javacc-3.2 files. Also added test cases, bothw/ and w/o wildcard prefixes enable.

        Includes patches to the javacc files generated from javacc-3.2 (but see also LUCENE-667).

        Show
        aptosca Steven Parkes added a comment - I was looking in this area (wildcard prefixes) so I figured I might as well do the cleanup. Dan's patch, with newly generated javacc-3.2 files. Also added test cases, bothw/ and w/o wildcard prefixes enable. Includes patches to the javacc files generated from javacc-3.2 (but see also LUCENE-667 ).
        Hide
        otis Otis Gospodnetic added a comment -

        Steven: this patch looks good to me. Why not (manually) remove those 2 deprecated methods, getColumn and getLine?

        Show
        otis Otis Gospodnetic added a comment - Steven: this patch looks good to me. Why not (manually) remove those 2 deprecated methods, getColumn and getLine?
        Hide
        steven_parkes Steven Parkes added a comment -

        I guess because I am uncomfortable manually modifying automatically generated code. If there's a compelling reason for it, I'd consider it, but it has to be pretty compelling because of the effort required to (remember to) maintain the local modifications. If someone runs javacc themselves, they'll get a different result and have to look at the code to see why the results are different. I don't see that the benefit of removing a few deprecated methods is worth the potential confusion (and time taken to resolve the confusion) (and time taken to remember to do the local mod every time).

        Show
        steven_parkes Steven Parkes added a comment - I guess because I am uncomfortable manually modifying automatically generated code. If there's a compelling reason for it, I'd consider it, but it has to be pretty compelling because of the effort required to (remember to) maintain the local modifications. If someone runs javacc themselves, they'll get a different result and have to look at the code to see why the results are different. I don't see that the benefit of removing a few deprecated methods is worth the potential confusion (and time taken to resolve the confusion) (and time taken to remember to do the local mod every time).
        Hide
        steven_parkes Steven Parkes added a comment -

        Uhh ... can I ask why the assignee change? I shouldn't work on this anymore?

        Show
        steven_parkes Steven Parkes added a comment - Uhh ... can I ask why the assignee change? I shouldn't work on this anymore?
        Hide
        otis Otis Gospodnetic added a comment -

        Ooops, sorry, I thought it was ready to be committed. Can you commit? Take it back, all yours!

        Show
        otis Otis Gospodnetic added a comment - Ooops, sorry, I thought it was ready to be committed. Can you commit? Take it back, all yours!
        Hide
        steven_parkes Steven Parkes added a comment -

        Ah. I get it.

        Yeah, it is ready to be committed (at least I think it is). In terms of the "patch available" flag (which I appreciate you watching), I figure that's the flag that the assignee thinks it's ready to be committed. I can imagine adding patches that aren't ready for commit, in which case I wouldn't set the flag.

        As far as doing the commit, I think on Hadoop, the committer just does the svn commit and resolves the Jira issue, w/o changing the assignee.

        We don't have to do it that way, of course, but I do kinda like it that way. I figure if contributors (as opposed to committers) are the lead on any follow-up discussion, that's a good thing, in terms of load balancing?

        Show
        steven_parkes Steven Parkes added a comment - Ah. I get it. Yeah, it is ready to be committed (at least I think it is). In terms of the "patch available" flag (which I appreciate you watching), I figure that's the flag that the assignee thinks it's ready to be committed. I can imagine adding patches that aren't ready for commit, in which case I wouldn't set the flag. As far as doing the commit, I think on Hadoop, the committer just does the svn commit and resolves the Jira issue, w/o changing the assignee. We don't have to do it that way, of course, but I do kinda like it that way. I figure if contributors (as opposed to committers) are the lead on any follow-up discussion, that's a good thing, in terms of load balancing?
        Hide
        otis Otis Gospodnetic added a comment -

        Q: why is this property called "allowZeroLengthPrefixQuery"? Because instead of XXX*YYY, one can now have just *YYY? I think "allowLeadingWildcard" would be more descriptive, no?

        Show
        otis Otis Gospodnetic added a comment - Q: why is this property called "allowZeroLengthPrefixQuery"? Because instead of XXX*YYY, one can now have just *YYY? I think "allowLeadingWildcard" would be more descriptive, no?
        Hide
        steven_parkes Steven Parkes added a comment -

        I think "allowLeadingWildcard" would be more descriptive

        Agree. Changed.

        Show
        steven_parkes Steven Parkes added a comment - I think "allowLeadingWildcard" would be more descriptive Agree. Changed.
        Hide
        otis Otis Gospodnetic added a comment -

        Applied, thanks.

        Show
        otis Otis Gospodnetic added a comment - Applied, thanks.
        Hide
        mikemccand Michael McCandless added a comment -

        Closing all issues that were resolved for 2.1.

        Show
        mikemccand Michael McCandless added a comment - Closing all issues that were resolved for 2.1.

          People

          • Assignee:
            otis Otis Gospodnetic
            Reporter:
            nightrider Peter Schäfer
          • Votes:
            5 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development