Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We worked a lot on FuzzyQuery, but you need to be a rocket scientist to ensure good results.

      The main problem is that the default distance is 0.5f, which doesn't take into account the length of the string.
      To add insult to injury, the default number of expansions is 1024 (traditionally from BooleanQuery maxClauseCount)

      I propose:

      • The syntax of FuzzyQuery is enhanced, so that you can specify raw edits too: such as foobar~2 (all terms within 2 levenshtein edits of foobar). Previously if you specified any amount >=1, you got IllegalArgumentException, so this won't break anyone. You can still use foobar~0.5, and it works just as before
      • The default for minimumSimilarity then becomes LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE, which is 2. This way if you just do foobar~, its always fast.
      • The size of the priority queue is reduced by default from 1024 to a much more reasonable value: 50. This is what FuzzyLikeThis uses.

      I think its best to just change the defaults for this query, since it was so aweful before. We can add notes in migrate.txt that if you care about using the old values, then you should provide them explicitly, and you will get the same results!

      1. LUCENE-2667_contrib.patch
        9 kB
        Robert Muir
      2. LUCENE-2667.patch
        36 kB
        Robert Muir
      3. LUCENE-2667.patch
        35 kB
        Robert Muir

        Activity

        Hide
        Simon Willnauer added a comment -

        I think its best to just change the defaults for this query, since it was so aweful before. We can add notes in migrate.txt that if you care about using the old values, then you should provide them explicitly, and you will get the same results!

        +1

        Thanks robert for bringing this up. Changes to the queryparsers look good to me I only have one comment about the harmony code, could you put the svn path and revision into a comment so we can track possible changes later more easily? I personally think moving to 1.6 is far away

        I propose:

        +1 to all the proposals

        Show
        Simon Willnauer added a comment - I think its best to just change the defaults for this query, since it was so aweful before. We can add notes in migrate.txt that if you care about using the old values, then you should provide them explicitly, and you will get the same results! +1 Thanks robert for bringing this up. Changes to the queryparsers look good to me I only have one comment about the harmony code, could you put the svn path and revision into a comment so we can track possible changes later more easily? I personally think moving to 1.6 is far away I propose: +1 to all the proposals
        Hide
        Michael McCandless added a comment -

        This looks great! I'm always confused by the fuzzy distance measure (when it's ~X where X < 1).

        You needed that nextAfter method to ensure that the truncation used when computing initialMaxDistance always "inverts" correctly, for the X >= 1.0 case?

        Show
        Michael McCandless added a comment - This looks great! I'm always confused by the fuzzy distance measure (when it's ~X where X < 1). You needed that nextAfter method to ensure that the truncation used when computing initialMaxDistance always "inverts" correctly, for the X >= 1.0 case?
        Hide
        Robert Muir added a comment -

        could you put the svn path and revision into a comment so we can track possible changes later more easily?

        great idea... i didnt mean to forget it: i put it in my local.
        http://svn.apache.org/viewvc/harmony/enhanced/java/branches/java6/classlib/modules/luni/src/main/java/java/lang/StrictMath.java

        You needed that nextAfter method to ensure that the truncation used when computing initialMaxDistance always "inverts" correctly, for the X >= 1.0 case?

        right, our fuzzy's scaling is wierd. and it compares with '>'. so if ed=2 computes to 0.5, we have to nudge the float a bit, because minSimilarity is non-inclusive.

        Show
        Robert Muir added a comment - could you put the svn path and revision into a comment so we can track possible changes later more easily? great idea... i didnt mean to forget it: i put it in my local. http://svn.apache.org/viewvc/harmony/enhanced/java/branches/java6/classlib/modules/luni/src/main/java/java/lang/StrictMath.java You needed that nextAfter method to ensure that the truncation used when computing initialMaxDistance always "inverts" correctly, for the X >= 1.0 case? right, our fuzzy's scaling is wierd. and it compares with '>'. so if ed=2 computes to 0.5, we have to nudge the float a bit, because minSimilarity is non-inclusive.
        Hide
        Robert Muir added a comment -

        here's an updated patch, i think this is ready to commit.

        • use integer calculations internally to avoid the tricky float stuff
        • i added tests for the case where the edit distance is greater than the word.
          previously, it was not possible to issue these type of queries, as noted in the enum
                // this will return less than 0.0 when the edit distance is
                // greater than the number of characters in the shorter word.
                // but this was the formula that was previously used in FuzzyTermEnum,
                // so it has not been changed (even though minimumSimilarity must be
                // greater than 0.0)
          
        • i removed a TODO, so the linear enum gets an optimization from the priority queue,
          in that it uses the updated maxEdits to quickly reject too long/too short terms.
        Show
        Robert Muir added a comment - here's an updated patch, i think this is ready to commit. use integer calculations internally to avoid the tricky float stuff i added tests for the case where the edit distance is greater than the word. previously, it was not possible to issue these type of queries, as noted in the enum // this will return less than 0.0 when the edit distance is // greater than the number of characters in the shorter word. // but this was the formula that was previously used in FuzzyTermEnum, // so it has not been changed (even though minimumSimilarity must be // greater than 0.0) i removed a TODO, so the linear enum gets an optimization from the priority queue, in that it uses the updated maxEdits to quickly reject too long/too short terms.
        Hide
        Michael McCandless added a comment -

        Patch looks great Robert!

        Show
        Michael McCandless added a comment - Patch looks great Robert!
        Hide
        Robert Muir added a comment -

        Committed revision 1002214

        Show
        Robert Muir added a comment - Committed revision 1002214
        Hide
        Robert Muir added a comment -

        i missed some things in the contrib/queryparser when doing this.

        Show
        Robert Muir added a comment - i missed some things in the contrib/queryparser when doing this.
        Hide
        Robert Muir added a comment -

        Committed revision 1024498 for the contrib parsers... hopefully I don't find any more I missed!

        Show
        Robert Muir added a comment - Committed revision 1024498 for the contrib parsers... hopefully I don't find any more I missed!
        Hide
        Francisco Alvarez added a comment -

        The Fuzzy Search functionality has been dramatically limited with this new implementation.

        Before it was possible to search with edit distances higher than 2, which is really necessary in many situations.

        We have tried to increase the MAXIMUM_SUPPORTED_DISTANCE value but got the following error:

        java.lang.NullPointerException
        at org.apache.lucene.util.automaton.UTF32ToUTF8.convert(UTF32ToUTF8.java:259)
        at org.apache.lucene.util.automaton.CompiledAutomaton.<init>(CompiledAutomaton.java:163)
        at org.apache.lucene.search.FuzzyTermsEnum.initAutomata(FuzzyTermsEnum.java:182)
        at org.apache.lucene.search.FuzzyTermsEnum.getAutomatonEnum(FuzzyTermsEnum.java:153)
        at org.apache.lucene.search.FuzzyTermsEnum.maxEditDistanceChanged(FuzzyTermsEnum.java:217)

        We need a solution for fuzzy searches higher than 2 edit distances to keep consistent behaviour with Lucene 3.x

        Show
        Francisco Alvarez added a comment - The Fuzzy Search functionality has been dramatically limited with this new implementation. Before it was possible to search with edit distances higher than 2, which is really necessary in many situations. We have tried to increase the MAXIMUM_SUPPORTED_DISTANCE value but got the following error: java.lang.NullPointerException at org.apache.lucene.util.automaton.UTF32ToUTF8.convert(UTF32ToUTF8.java:259) at org.apache.lucene.util.automaton.CompiledAutomaton.<init>(CompiledAutomaton.java:163) at org.apache.lucene.search.FuzzyTermsEnum.initAutomata(FuzzyTermsEnum.java:182) at org.apache.lucene.search.FuzzyTermsEnum.getAutomatonEnum(FuzzyTermsEnum.java:153) at org.apache.lucene.search.FuzzyTermsEnum.maxEditDistanceChanged(FuzzyTermsEnum.java:217) We need a solution for fuzzy searches higher than 2 edit distances to keep consistent behaviour with Lucene 3.x
        Hide
        Uwe Schindler added a comment -

        Hi Francisco: The core FuzzyQuery does not support edit distances > 2, because the automatons used for this would be too big and slow. If you really want distances > 2, use http://lucene.apache.org/core/4_0_0-BETA/sandbox/org/apache/lucene/sandbox/queries/SlowFuzzyQuery.html from the sandbox module (lucene-sandbox.jar). This one is the same algorithm as the old 3.x FuzzyQuery (and is as slow).

        Show
        Uwe Schindler added a comment - Hi Francisco: The core FuzzyQuery does not support edit distances > 2, because the automatons used for this would be too big and slow. If you really want distances > 2, use http://lucene.apache.org/core/4_0_0-BETA/sandbox/org/apache/lucene/sandbox/queries/SlowFuzzyQuery.html from the sandbox module (lucene-sandbox.jar). This one is the same algorithm as the old 3.x FuzzyQuery (and is as slow).
        Hide
        Francisco Alvarez added a comment -

        I see, thanks for your quick reply!

        Show
        Francisco Alvarez added a comment - I see, thanks for your quick reply!

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development