Lucene - Core
  1. Lucene - Core
  2. LUCENE-5410

Add fuzziness support to SimpleQueryParser

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.7
    • Fix Version/s: 4.7, 6.0
    • Component/s: core/queryparser
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      It would be nice to add fuzzy query support to the SimpleQueryParser so that:

      foo~2

      generates a FuzzyQuery with an max edit distance of 2 and:

      "foo bar"~2

      generates a PhraseQuery with a slop of 2.

      1. LUCENE-5410.patch
        24 kB
        Robert Muir
      2. LUCENE-5410.patch
        16 kB
        Lee Hinman
      3. LUCENE-5410.patch
        13 kB
        Lee Hinman
      4. LUCENE-5410.patch
        19 kB
        Lee Hinman

        Activity

        Hide
        Lee Hinman added a comment -

        I would like to work on this also, if that's alright.

        Show
        Lee Hinman added a comment - I would like to work on this also, if that's alright.
        Hide
        Lee Hinman added a comment -

        Attached patch that supports both edit distance for single terms and slop for phrases.

        I chose to add only a single flag (FUZZINESS_OPERATOR) for enabling/disabling both behaviors, but it would be easy to separate them if desired.

        Show
        Lee Hinman added a comment - Attached patch that supports both edit distance for single terms and slop for phrases. I chose to add only a single flag ( FUZZINESS_OPERATOR ) for enabling/disabling both behaviors, but it would be easy to separate them if desired.
        Hide
        Robert Muir added a comment -

        Hi Lee, in general this seems like a good approach.

        Can we avoid the changes to QueryBuilder? The purpose of QueryBuilder is to interact with the analysis chain, and thats not typically how fuzzy queries are handled (for the same reason QueryBuilder has no prefix logic, so I think the case should be handled in a similar fashion). This way we can avoid adding methods like createFuzzyBooleanQuery, newTermOrFuzzyQuery and so on.

        As far as the parser itself, I don't like having both newPhraseQuery(String) and newPhraseQuery(String, int). Can we just nuke the first one and have the latter instead? Exact phrase queries can just past 0 here.

        Can we add the tilde operator to the list in testRandomQueries2 ?

        Show
        Robert Muir added a comment - Hi Lee, in general this seems like a good approach. Can we avoid the changes to QueryBuilder? The purpose of QueryBuilder is to interact with the analysis chain, and thats not typically how fuzzy queries are handled (for the same reason QueryBuilder has no prefix logic, so I think the case should be handled in a similar fashion). This way we can avoid adding methods like createFuzzyBooleanQuery, newTermOrFuzzyQuery and so on. As far as the parser itself, I don't like having both newPhraseQuery(String) and newPhraseQuery(String, int). Can we just nuke the first one and have the latter instead? Exact phrase queries can just past 0 here. Can we add the tilde operator to the list in testRandomQueries2 ?
        Hide
        Upayavira added a comment -

        If the point of this query parser is to support the sort of thing that users type into search boxes, are they likely to enter fuzzy or phrase slop searches? This seems unlikely to me, and seems to take a SimpleQueryParser away from being "simple". I'd suggest that if that functionality is to be added, it should be added somewhere where it is under programmatic control, not user control, such as how the edismax query parser accepts additional request parameters.

        Thus ps=2 would set a phrase slop. fs=2 could set a value of 2 for fuzzy search also. This would seem more in keeping with the goal of a clean, simple query parser.

        Show
        Upayavira added a comment - If the point of this query parser is to support the sort of thing that users type into search boxes, are they likely to enter fuzzy or phrase slop searches? This seems unlikely to me, and seems to take a SimpleQueryParser away from being "simple". I'd suggest that if that functionality is to be added, it should be added somewhere where it is under programmatic control, not user control, such as how the edismax query parser accepts additional request parameters. Thus ps=2 would set a phrase slop. fs=2 could set a value of 2 for fuzzy search also. This would seem more in keeping with the goal of a clean, simple query parser.
        Hide
        Robert Muir added a comment -

        Upayavira: I see your point, however one of the key advantages of this parser is the ability to selectively enable or disable any operator. The same arguments could be made for prefix queries or even the NOT operator , so it was added with that in mind (as well as to provide a way to do multiword synonyms: disable WHITESPACE).

        As far as setting a parameter to make every term fuzzy, I don't think thats a good idea. In such a case I really think you should do n-gram indexing

        Show
        Robert Muir added a comment - Upayavira: I see your point, however one of the key advantages of this parser is the ability to selectively enable or disable any operator. The same arguments could be made for prefix queries or even the NOT operator , so it was added with that in mind (as well as to provide a way to do multiword synonyms: disable WHITESPACE). As far as setting a parameter to make every term fuzzy, I don't think thats a good idea. In such a case I really think you should do n-gram indexing
        Hide
        Lee Hinman added a comment -

        Upayavira: That does bring up a good point, how should the extra fuzzy characters be treated if fuzziness is turned off? Currently the patch treats the token foo~2 as a TermQuery for "foo~2" if FUZZINESS_OPERATOR is disabled. Should it be changed to silently swallow the "~2" even if fuzziness is disabled?

        Show
        Lee Hinman added a comment - Upayavira: That does bring up a good point, how should the extra fuzzy characters be treated if fuzziness is turned off? Currently the patch treats the token foo~2 as a TermQuery for "foo~2" if FUZZINESS_OPERATOR is disabled. Should it be changed to silently swallow the "~2" even if fuzziness is disabled?
        Hide
        Upayavira added a comment -

        If fuzziness is disabled, then that should mean that the ~2 syntax has no meaning whatever, and should be treated as a regular token. Otherwise, you're turning the feature off, but not the syntax, and that will catch people out, and make the component more complex than necessary.

        Show
        Upayavira added a comment - If fuzziness is disabled, then that should mean that the ~2 syntax has no meaning whatever, and should be treated as a regular token. Otherwise, you're turning the feature off, but not the syntax, and that will catch people out, and make the component more complex than necessary.
        Hide
        Lee Hinman added a comment -

        Gotcha. Next version of the patch will swallow ~XXX if fuzziness is disabled, and won't touch QueryBuilder.java, as per Roberts comments.

        Show
        Lee Hinman added a comment - Gotcha. Next version of the patch will swallow ~XXX if fuzziness is disabled, and won't touch QueryBuilder.java, as per Roberts comments.
        Hide
        Robert Muir added a comment -

        Wait, why swallow it? When disabling an operator, it means its treated as it does not exist (no special meaning).

        E.g. for PREFIX today, if you disable that operator, it means it will literally make a term of foo*

        I think this is also agrees with Upayavira's thoughts, but maybe I'm missing something.

        Show
        Robert Muir added a comment - Wait, why swallow it? When disabling an operator, it means its treated as it does not exist (no special meaning). E.g. for PREFIX today, if you disable that operator, it means it will literally make a term of foo* I think this is also agrees with Upayavira's thoughts, but maybe I'm missing something.
        Hide
        Lee Hinman added a comment - - edited

        Okay, I misunderstood then, I was thinking "the ~2 syntax has no meaning whatever" meant no meaning (ie, ignore it entirely). I will keep the current behavior of foo~2 being a TermQuery for "foo~2" and "foo bar"~2 being a BooleanQuery of the PhraseQuery "foo bar" and a TermQuery for "~2".

        Show
        Lee Hinman added a comment - - edited Okay, I misunderstood then, I was thinking "the ~2 syntax has no meaning whatever" meant no meaning (ie, ignore it entirely). I will keep the current behavior of foo~2 being a TermQuery for "foo~2" and "foo bar"~2 being a BooleanQuery of the PhraseQuery "foo bar" and a TermQuery for "~2".
        Hide
        Lee Hinman added a comment -

        New version of the patch with these changes:

        • No changes made to QueryBuilder.java
        • Only a single method for newPhraseQuery instead of two
        • Add "~" to testRandomQueries2
        Show
        Lee Hinman added a comment - New version of the patch with these changes: No changes made to QueryBuilder.java Only a single method for newPhraseQuery instead of two Add "~" to testRandomQueries2
        Hide
        Jack Conradson added a comment -

        I think this patch is a good start. Personally, there's two changes I'd like to see.

        • The first was discussed briefly before, but I'm strongly in favor of having phrase slop be a different operator from fuzzy terms. I can see users wanting phrase slop without wanting fuzzy terms.
        • The second is I'd strongly prefer to see a separate method for the parsing logic related to slop/fuzzy where it would account for checking if there's a '~' and the integer afterwards. I think it will help clean up the consumeTerm and consumePhrase methods significantly. Once a term/phrase has been found, if fuzzy/slop is on, the method can be called to check for it.
        Show
        Jack Conradson added a comment - I think this patch is a good start. Personally, there's two changes I'd like to see. The first was discussed briefly before, but I'm strongly in favor of having phrase slop be a different operator from fuzzy terms. I can see users wanting phrase slop without wanting fuzzy terms. The second is I'd strongly prefer to see a separate method for the parsing logic related to slop/fuzzy where it would account for checking if there's a '~' and the integer afterwards. I think it will help clean up the consumeTerm and consumePhrase methods significantly. Once a term/phrase has been found, if fuzzy/slop is on, the method can be called to check for it.
        Hide
        Lee Hinman added a comment -

        Okay, the first is very simple (I'll add a SLOP_OPERATOR flag and make FUZZINESS_OPERATOR only work for fuzzy terms.

        As for the second, I think this is doable, but I think it will still require a bit of special logic in consumeTerm and consumePhrase based on the differences in how they consume/increment state.data and state.index. I'll work on another revision doing this.

        Show
        Lee Hinman added a comment - Okay, the first is very simple (I'll add a SLOP_OPERATOR flag and make FUZZINESS_OPERATOR only work for fuzzy terms. As for the second, I think this is doable, but I think it will still require a bit of special logic in consumeTerm and consumePhrase based on the differences in how they consume/increment state.data and state.index. I'll work on another revision doing this.
        Hide
        Lee Hinman added a comment -

        Here's a new version of the patch with these changes:

        • Make only minimal changes to consumeToken and consumePhrase, all the logic lives in separate "parseFuzziness" function used by both.
        • Separate the flags for edit distance and slop, there is now FUZZINESS_OPERATOR and SLOP_OPERATOR.
        • More tests
        Show
        Lee Hinman added a comment - Here's a new version of the patch with these changes: Make only minimal changes to consumeToken and consumePhrase , all the logic lives in separate "parseFuzziness" function used by both. Separate the flags for edit distance and slop, there is now FUZZINESS_OPERATOR and SLOP_OPERATOR . More tests
        Hide
        Robert Muir added a comment -

        This patch looks great! I'll do a detailed review now, but I really like the latest iteration.

        Thanks Lee!

        Show
        Robert Muir added a comment - This patch looks great! I'll do a detailed review now, but I really like the latest iteration. Thanks Lee!
        Hide
        Robert Muir added a comment -

        Updated patch with a few tweaks:

        • rename FUZZINESS_OPERATOR -> FUZZY_OPERATOR
        • rename SLOP_OPERATOR -> NEAR_OPERATOR
        • add test and check for negative slop in SimpleQP
        • add negative slop IAE to Phrase/MultiPhraseQuery
        • update solr support for new operators
        • move constants to params so users have access to them.
        • add some simple updates to the class javadocs
        Show
        Robert Muir added a comment - Updated patch with a few tweaks: rename FUZZINESS_OPERATOR -> FUZZY_OPERATOR rename SLOP_OPERATOR -> NEAR_OPERATOR add test and check for negative slop in SimpleQP add negative slop IAE to Phrase/MultiPhraseQuery update solr support for new operators move constants to params so users have access to them. add some simple updates to the class javadocs
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        ASF subversion and git services added a comment -

        Commit 1563558 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1563558 ]

        LUCENE-5410: add fuzzy and near to SimpleQueryParser

        Show
        ASF subversion and git services added a comment - Commit 1563558 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1563558 ] LUCENE-5410 : add fuzzy and near to SimpleQueryParser
        Hide
        ASF subversion and git services added a comment -

        Commit 1563562 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1563562 ]

        LUCENE-5410: add fuzzy and near to SimpleQueryParser

        Show
        ASF subversion and git services added a comment - Commit 1563562 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1563562 ] LUCENE-5410 : add fuzzy and near to SimpleQueryParser
        Hide
        Robert Muir added a comment -

        Thanks Lee!

        Show
        Robert Muir added a comment - Thanks Lee!

          People

          • Assignee:
            Robert Muir
            Reporter:
            Lee Hinman
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 168h
              168h
              Remaining:
              Remaining Estimate - 168h
              168h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development