Lucene - Core
  1. Lucene - Core
  2. LUCENE-4282

Automaton Fuzzy Query doesn't deliver all results

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-BETA, 6.0
    • Component/s: core/search
    • Labels:
    • Lucene Fields:
      New

      Description

      Having a small index with n documents where each document has one of the following terms:
      WEBER, WEBE, WEB, WBR, WE, (and some more)
      The new FuzzyQuery (Automaton) with maxEdits=2 only delivers the expected terms WEBER and WEBE in the rewritten query. The expected terms WEB and WBR which have an edit distance of 2 as well are missing.

      1. LUCENE-4282.patch
        5 kB
        Robert Muir
      2. LUCENE-4282.patch
        5 kB
        Robert Muir
      3. LUCENE-4282-tests.patch
        2 kB
        Uwe Schindler
      4. ModifiedFuzzyTermsEnum.java
        18 kB
        Johannes Christen
      5. ModifiedFuzzyTermsEnum.java
        18 kB
        Johannes Christen

        Activity

        Hide
        Uwe Schindler added a comment - - edited

        This is caused by the rewrite method not FuzzyQuery itsself. The rewrite mode uses an internal priority queue, where it collects all terms from the index, that match the levensthein distance. If there are more terms available, some are dropped. This depends on their distance and other factors. If you want to use a larger PQ, create a separate instance of the TopTermsScoringBooleanQueryRewrite, giving a queue size.

        Show
        Uwe Schindler added a comment - - edited This is caused by the rewrite method not FuzzyQuery itsself. The rewrite mode uses an internal priority queue, where it collects all terms from the index, that match the levensthein distance. If there are more terms available, some are dropped. This depends on their distance and other factors. If you want to use a larger PQ, create a separate instance of the TopTermsScoringBooleanQueryRewrite, giving a queue size.
        Hide
        Johannes Christen added a comment -

        Thanks for the quick response Uwe.
        I don't think that is the cause. My test index is very small (less than 100 terms), so I don't think the terms get dropped. I thinkt they are missed by the automaton.
        My rewritten query has only 2 terms: NAME:WEBE^0.5833334 NAME:WEBER

        Show
        Johannes Christen added a comment - Thanks for the quick response Uwe. I don't think that is the cause. My test index is very small (less than 100 terms), so I don't think the terms get dropped. I thinkt they are missed by the automaton. My rewritten query has only 2 terms: NAME:WEBE^0.5833334 NAME:WEBER
        Hide
        Uwe Schindler added a comment -

        What was your query?

        Show
        Uwe Schindler added a comment - What was your query?
        Hide
        Johannes Christen added a comment - - edited

        Query query = new FuzzyQuery(new Term("NAME", "WEBER"),2,1);

        Here are all the terms for the field NAME in my index:
        LANGE
        LUETH
        PIRSING
        RIEGEL
        TRZECZIAK
        WALKER
        WBR
        WE
        WEB
        WEBE
        WEBER
        WITTKOPF
        WOJNAROWSKI
        WRICKE

        Show
        Johannes Christen added a comment - - edited Query query = new FuzzyQuery(new Term("NAME", "WEBER"),2,1); Here are all the terms for the field NAME in my index: LANGE LUETH PIRSING RIEGEL TRZECZIAK WALKER WBR WE WEB WEBE WEBER WITTKOPF WOJNAROWSKI WRICKE
        Hide
        Uwe Schindler added a comment -

        There is indeed something strange, I have to wait for Robert to get awake. The following test failes (when added to TestFuzzyQuery.java):

          public void test2() throws Exception {
            Directory directory = newDirectory();
            RandomIndexWriter writer = new RandomIndexWriter(random(), directory, new MockAnalyzer(random(), MockTokenizer.KEYWORD, false));
            addDoc("LANGE", writer);
            addDoc("LUETH", writer);
            addDoc("PIRSING", writer);
            addDoc("RIEGEL", writer);
            addDoc("TRZECZIAK", writer);
            addDoc("WALKER", writer);
            addDoc("WBR", writer);
            addDoc("WE", writer);
            addDoc("WEB", writer);
            addDoc("WEBE", writer);
            addDoc("WEBER", writer);
            addDoc("WITTKOPF", writer);
            addDoc("WOJNAROWSKI", writer);
            addDoc("WRICKE", writer);
        
            IndexReader reader = writer.getReader();
            IndexSearcher searcher = newSearcher(reader);
            writer.close();
        
            FuzzyQuery query = new FuzzyQuery(new Term("field", "WEBER"), 2, 1);
            ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
            assertEquals(4, hits.length);
        
            reader.close();
            directory.close();
        }
        

        The two missing terms have 2 deletions, so they are in edit distance.

        Show
        Uwe Schindler added a comment - There is indeed something strange, I have to wait for Robert to get awake. The following test failes (when added to TestFuzzyQuery.java): public void test2() throws Exception { Directory directory = newDirectory(); RandomIndexWriter writer = new RandomIndexWriter(random(), directory, new MockAnalyzer(random(), MockTokenizer.KEYWORD, false )); addDoc( "LANGE" , writer); addDoc( "LUETH" , writer); addDoc( "PIRSING" , writer); addDoc( "RIEGEL" , writer); addDoc( "TRZECZIAK" , writer); addDoc( "WALKER" , writer); addDoc( "WBR" , writer); addDoc( "WE" , writer); addDoc( "WEB" , writer); addDoc( "WEBE" , writer); addDoc( "WEBER" , writer); addDoc( "WITTKOPF" , writer); addDoc( "WOJNAROWSKI" , writer); addDoc( "WRICKE" , writer); IndexReader reader = writer.getReader(); IndexSearcher searcher = newSearcher(reader); writer.close(); FuzzyQuery query = new FuzzyQuery( new Term( "field" , "WEBER" ), 2, 1); ScoreDoc[] hits = searcher.search(query, null , 1000).scoreDocs; assertEquals(4, hits.length); reader.close(); directory.close(); } The two missing terms have 2 deletions, so they are in edit distance.
        Hide
        Uwe Schindler added a comment -

        The same happens, if I disable traspositions, so the transposition supporting automatons are not the problem.

        Show
        Uwe Schindler added a comment - The same happens, if I disable traspositions, so the transposition supporting automatons are not the problem.
        Hide
        Johannes Christen added a comment -

        Yes I tried this as well. Also the prefix is not the problem.
        I expect the error deep in the automaton.

        Show
        Johannes Christen added a comment - Yes I tried this as well. Also the prefix is not the problem. I expect the error deep in the automaton.
        Hide
        Uwe Schindler added a comment - - edited

        I also added more terms that are in fact longer than WEBER (WEBERE and WEBERES), both are returned, only the shorter ones not. WBRE also works. I dont think the automaton is broken, it may be the FuzzyTermsEnum that does some stuff on top of AutomatonTermsEnum. We have to wait for Robert, he might understand whats going on.

        Show
        Uwe Schindler added a comment - - edited I also added more terms that are in fact longer than WEBER (WEBERE and WEBERES), both are returned, only the shorter ones not. WBRE also works. I dont think the automaton is broken, it may be the FuzzyTermsEnum that does some stuff on top of AutomatonTermsEnum. We have to wait for Robert, he might understand whats going on.
        Hide
        Johannes Christen added a comment -

        Ok. I keep on digging in the code and come back when I found something.

        Show
        Johannes Christen added a comment - Ok. I keep on digging in the code and come back when I found something.
        Hide
        Johannes Christen added a comment -

        Modification of FuzzyTermEnum class fixing issue LUCENE-4282

        Show
        Johannes Christen added a comment - Modification of FuzzyTermEnum class fixing issue LUCENE-4282
        Hide
        Johannes Christen added a comment - - edited

        Well. I think I found the solution.
        You were right Uwe. It happens in the FuzzyTermsEnum:AutomatonFuzzyTermsEnum class.
        Calculating the similarity in the accept() method is based on the offset of the smallest length of request term and index term.

        I attached my ModifiedFuzzyTermEnum class, where you can find the modification which makes it work.
        BTW. There are some more modifications, fixing bugs in calculating the similarity out of the edit distance and vise versa.
        The modification of the boost factor was only necessary for my boolean address search approach and possibly doesn't apply here.
        The modified bits are marked with USERCODE_BEGIN and USERCODE_END tags.

        Show
        Johannes Christen added a comment - - edited Well. I think I found the solution. You were right Uwe. It happens in the FuzzyTermsEnum:AutomatonFuzzyTermsEnum class. Calculating the similarity in the accept() method is based on the offset of the smallest length of request term and index term. I attached my ModifiedFuzzyTermEnum class, where you can find the modification which makes it work. BTW. There are some more modifications, fixing bugs in calculating the similarity out of the edit distance and vise versa. The modification of the boost factor was only necessary for my boolean address search approach and possibly doesn't apply here. The modified bits are marked with USERCODE_BEGIN and USERCODE_END tags.
        Hide
        Uwe Schindler added a comment -

        Thanks for help. We are starting to investigate what's wrong!

        I did another test in parallel:

        query.setRewriteMethod(FuzzyQuery.SCORING_BOOLEAN_QUERY_REWRITE);
        

        With that one it is also failing, so the boost attribute itsself is not the problem. Because this rewrite method does not use it at all (no PriorityQueue).

        Also the Automaton is correct, if you pass the terms to the automaton, they all pass:

        LevenshteinAutomata builder = new LevenshteinAutomata("EBER", true);
        Automaton a = builder.toAutomaton(2);
        a = BasicOperations.concatenate(BasicAutomata.makeChar('W'), a);
        System.out.println(BasicOperations.run(a, "WBR"));
        System.out.println(BasicOperations.run(a, "WEB"));
        System.out.println(BasicOperations.run(a, "WEBE"));
        System.out.println(BasicOperations.run(a, "WEBER"));
        
        Show
        Uwe Schindler added a comment - Thanks for help. We are starting to investigate what's wrong! I did another test in parallel: query.setRewriteMethod(FuzzyQuery.SCORING_BOOLEAN_QUERY_REWRITE); With that one it is also failing, so the boost attribute itsself is not the problem. Because this rewrite method does not use it at all (no PriorityQueue). Also the Automaton is correct, if you pass the terms to the automaton, they all pass: LevenshteinAutomata builder = new LevenshteinAutomata( "EBER" , true ); Automaton a = builder.toAutomaton(2); a = BasicOperations.concatenate(BasicAutomata.makeChar('W'), a); System .out.println(BasicOperations.run(a, "WBR" )); System .out.println(BasicOperations.run(a, "WEB" )); System .out.println(BasicOperations.run(a, "WEBE" )); System .out.println(BasicOperations.run(a, "WEBER" ));
        Hide
        Robert Muir added a comment -

        thanks for reporting and looking into this!

        I think the bug is just the use of floats at all in this enum.

        -        if (similarity > minSimilarity) {
        +        if (ed <= maxEdits) {
                   boostAtt.setBoost((similarity - minSimilarity) * scale_factor);
                   //System.out.println("  yes");
                   return AcceptStatus.YES;
                 } else {
        +          System.out.println("reject: " + term.utf8ToString());
                   return AcceptStatus.NO;
                 }
        

        This seems to fix it for me. We should remove all float crap from this enum,
        we dont need it, only a slower deprecated class in the sandbox needs it.

        Show
        Robert Muir added a comment - thanks for reporting and looking into this! I think the bug is just the use of floats at all in this enum. - if (similarity > minSimilarity) { + if (ed <= maxEdits) { boostAtt.setBoost((similarity - minSimilarity) * scale_factor); //System.out.println(" yes"); return AcceptStatus.YES; } else { + System.out.println("reject: " + term.utf8ToString()); return AcceptStatus.NO; } This seems to fix it for me. We should remove all float crap from this enum, we dont need it, only a slower deprecated class in the sandbox needs it.
        Hide
        Uwe Schindler added a comment -

        Robert Muir: I added my tests as patch. TestFuzzyQuery is currently not the best test we have: All terms there have equal length, this helps here. I added some more terms (longer ones, too), still the 2 shorter ones fail without a fix.

        I am now away, I hope that helps.

        Show
        Uwe Schindler added a comment - Robert Muir: I added my tests as patch. TestFuzzyQuery is currently not the best test we have: All terms there have equal length, this helps here. I added some more terms (longer ones, too), still the 2 shorter ones fail without a fix. I am now away, I hope that helps.
        Hide
        Johannes Christen added a comment -

        Hi Robert. Yes this might be right, but I am still using the similarity float based stuff, since 2 edits on a three letter word is much more difference to me than 2 edits on a 10 letter word.
        If you apply the stuff I sent, it will work for both cases.

        Show
        Johannes Christen added a comment - Hi Robert. Yes this might be right, but I am still using the similarity float based stuff, since 2 edits on a three letter word is much more difference to me than 2 edits on a 10 letter word. If you apply the stuff I sent, it will work for both cases.
        Hide
        Robert Muir added a comment -

        Johannes: we will have the same scoring when i say 'removing floats' only less code actually (we can remove this entire if i think).

        the only floats will be what is put into the boost attribute: but no comparisons against floats. the latter is what causes the bug.

        Show
        Robert Muir added a comment - Johannes: we will have the same scoring when i say 'removing floats' only less code actually (we can remove this entire if i think). the only floats will be what is put into the boost attribute: but no comparisons against floats. the latter is what causes the bug.
        Hide
        Robert Muir added a comment -

        here's a patch, with Uwe's test.

        The float comparison is wasted cpu for FuzzyQuery, as you already know its accepted by the automaton.

        But the deprecated SlowFuzzyQuery in sandbox needs this, because it has crazier logic. So it overrides the logic and does the float comparison. We should really remove that one from trunk since its deprecated since 4.x, it will make it easier to clean this up to be much simpler.

        Show
        Robert Muir added a comment - here's a patch, with Uwe's test. The float comparison is wasted cpu for FuzzyQuery, as you already know its accepted by the automaton. But the deprecated SlowFuzzyQuery in sandbox needs this, because it has crazier logic. So it overrides the logic and does the float comparison. We should really remove that one from trunk since its deprecated since 4.x, it will make it easier to clean this up to be much simpler.
        Hide
        Robert Muir added a comment -

        I will think about this one more: the patch is correct for 'edits' but the scoring
        becomes crazy. this is because of the historical behavior of this query.

        Just try porting Uwe's test to 3.6 and you will see what I mean

        I think its too tricky for the query in core (and used by spellchecker) to also
        be the base for the SlowFuzzyQuery which is supposed to mimic the old crazy behavior.

        Show
        Robert Muir added a comment - I will think about this one more: the patch is correct for 'edits' but the scoring becomes crazy. this is because of the historical behavior of this query. Just try porting Uwe's test to 3.6 and you will see what I mean I think its too tricky for the query in core (and used by spellchecker) to also be the base for the SlowFuzzyQuery which is supposed to mimic the old crazy behavior.
        Hide
        Robert Muir added a comment -

        A simpler patch, i also benchmarked.

        The problem is this comment in the legacy scoring (in all previous lucene versions):

              // this will return less than 0.0 when the edit distance is
              // greater than the number of characters in the shorter word.
              // but this was the formula that was previously used in FuzzyTermEnum,
              // so it has not been changed (even though minimumSimilarity must be
              // greater than 0.0)
        

        Because of that its really impossible to fix until we remove that deprecated one completely

        So i think this one is good to commit, and separately I will look at removing the deprecated one from trunk and cleaning all this up when i have time (I would port the math-proof tests from automata-package to run as queries so we are sure).

        Show
        Robert Muir added a comment - A simpler patch, i also benchmarked. The problem is this comment in the legacy scoring (in all previous lucene versions): // this will return less than 0.0 when the edit distance is // greater than the number of characters in the shorter word. // but this was the formula that was previously used in FuzzyTermEnum, // so it has not been changed (even though minimumSimilarity must be // greater than 0.0) Because of that its really impossible to fix until we remove that deprecated one completely So i think this one is good to commit, and separately I will look at removing the deprecated one from trunk and cleaning all this up when i have time (I would port the math-proof tests from automata-package to run as queries so we are sure).
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Johannes Christen added a comment -

        Would that cleanup mean the SlowFuzzyQuery is gone and we can only search fuzzy with maximum 2 edits? I hope not, because I also need the old slow version for greater edit distances.

        Show
        Johannes Christen added a comment - Would that cleanup mean the SlowFuzzyQuery is gone and we can only search fuzzy with maximum 2 edits? I hope not, because I also need the old slow version for greater edit distances.
        Hide
        Uwe Schindler added a comment -

        +1 to fix with easy patch for now

        Show
        Uwe Schindler added a comment - +1 to fix with easy patch for now
        Hide
        Robert Muir added a comment -

        Thanks for reporting this!

        Show
        Robert Muir added a comment - Thanks for reporting this!

          People

          • Assignee:
            Robert Muir
            Reporter:
            Johannes Christen
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development