Lucene - Core
  1. Lucene - Core
  2. LUCENE-2230

Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Environment:
    • Lucene Fields:
      New, Patch Available

      Description

      W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973
      http://portal.acm.org/citation.cfm?doid=362003.362025

      I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google).

      Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests).

      Big list od distance implementations:
      http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

      1. FuzzyTermEnumNEW.java
        4 kB
        Fuad Efendi
      2. FuzzyTermEnumNEW.java
        4 kB
        Fuad Efendi
      3. DistanceImpl.java
        4 kB
        Fuad Efendi
      4. Distance.java
        0.1 kB
        Fuad Efendi
      5. BKTree.java
        3 kB
        Fuad Efendi

        Activity

        Hide
        Fuad Efendi added a comment -

        First version of BKTree.java

        Show
        Fuad Efendi added a comment - First version of BKTree.java
        Hide
        Fuad Efendi added a comment -

        FuzzyTermEnumNEW.java

        In order to use it with Lucene 2.9.1, complie all files (4 files) in a separate JAR file (Java 6).

        In a source distribution of Lucene 2.9.1, modify FuzzyQuery, single method:
        protected FilteredTermEnum getEnum(IndexReader reader) throws IOException

        { return new FuzzyTermEnumNEW(reader, getTerm(), minimumSimilarity, prefixLength); }
        • and complie it (using default settings such as Java 1.4 compatibility); "ant jar-core" will do it.

        Use 2 new jars instead of lucene-core-2.9.1.jar

        Show
        Fuad Efendi added a comment - FuzzyTermEnumNEW.java In order to use it with Lucene 2.9.1, complie all files (4 files) in a separate JAR file (Java 6). In a source distribution of Lucene 2.9.1, modify FuzzyQuery, single method: protected FilteredTermEnum getEnum(IndexReader reader) throws IOException { return new FuzzyTermEnumNEW(reader, getTerm(), minimumSimilarity, prefixLength); } and complie it (using default settings such as Java 1.4 compatibility); "ant jar-core" will do it. Use 2 new jars instead of lucene-core-2.9.1.jar
        Hide
        Fuad Efendi added a comment - - edited

        Minor bug fixed (with cache warm-up)...

        Don't forget to disable QueryResultsCache and to enable large DocumentCache (if you are using SOLR); otherwise you won't see the difference. (SOLR users: there are some other tricks!)

        With Lucene 2.9.1:
        800ms

        With BKTree and Levenstein algo:
        200ms

        With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html
        70ms

        Average timing after many hours of tests. We may consider "integer" distance instead of "float" for Lucene Query if we accept this algo; I tried the best to have it close to "float" distance.

        BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed; simplest way - to recalc it at night (separate thread will do it).

        ======
        P.S.

        FuzzyQuery constructs instance of FuzzyTermEnum and passes instance of IndexReader to constructor. This is the way... If IndexReader changed (new instance) we can simply repopulate BKTree (or, for instance, we can compare list of terms and simply add terms missed in BKTree)...

        Show
        Fuad Efendi added a comment - - edited Minor bug fixed (with cache warm-up)... Don't forget to disable QueryResultsCache and to enable large DocumentCache (if you are using SOLR); otherwise you won't see the difference. (SOLR users: there are some other tricks!) With Lucene 2.9.1: 800ms With BKTree and Levenstein algo: 200ms With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html 70ms Average timing after many hours of tests. We may consider "integer" distance instead of "float" for Lucene Query if we accept this algo; I tried the best to have it close to "float" distance. BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed; simplest way - to recalc it at night (separate thread will do it). ====== P.S. FuzzyQuery constructs instance of FuzzyTermEnum and passes instance of IndexReader to constructor. This is the way... If IndexReader changed (new instance) we can simply repopulate BKTree (or, for instance, we can compare list of terms and simply add terms missed in BKTree)...
        Hide
        Fuad Efendi added a comment - - edited

        After long-run load-stress tests...

        I used 2 boxes, one with SOLR, another one with simple multithreaded stress simulator (with randomply generated fuzzy query samples); each box is 2x AMD Opteron 2350 (8 core per box); 64-bit.

        I disabled all SOLR caches except Document Cache (I want isolated tests; I want to ignore time taken by disk I/O to load document).

        Performance boosted accordingly to number of load-stress threads (on "client" computer), then dropped:

        9 Threads:
        ==========
        TPS: 200 - 210
        Response: 45 - 50 (ms)

        10 Threads:
        ===========
        TPS: 200 - 215
        Response: 45 - 55 (ms)

        12 Threads:
        ===========
        TPS: 180 - 220
        Response: 50 - 90 (ms)

        16 Threads:
        ===========
        TPS: 60 - 65
        Response: 230 - 260 (ms)

        It can be explained by CPU-bound processing and 8 cores available; "top" command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step (12 stressing threads), and 200% on 4th step (16 stressing threads) - due probably to Network I/O, Tomcat internals, etc.

        It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp (persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads according to use case.

        BTW, my best counters for default SOLR/Lucene were:
        TPS: 12
        Response: 750ms

        "Fuzzy" queries were tuned such a way that distance threshold was less than or equal two. I used "StrikeAMatch" distance...

        Thanks,
        http://www.tokenizer.ca
        +1 416-993-2060(cell)

        P.S.
        Before performing load-stress tests, I established the baseline in my environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ (static JSP).
        And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default Lucene/SOLR)...

        P.P.S.
        Distance function must follow 3 'axioms':

        D(a,a) = 0
        D(a,b) = D(b,a)
        D(a,b) + D(b,c) >= D(a,c)
        

        And, function must return Integer value.

        Otherwise, BKTree will produce wrong results.

        Also, it's mentioned somewhere in Levenstein Algo Java Docs (in contrib folder I believe) that instance method runs faster than static method; need to test with Java 6... most probably 'yes', depends on JVM implementation; I can guess only that CPU-internals are better optimized for instance method...

        Show
        Fuad Efendi added a comment - - edited After long-run load-stress tests... I used 2 boxes, one with SOLR, another one with simple multithreaded stress simulator (with randomply generated fuzzy query samples); each box is 2x AMD Opteron 2350 (8 core per box); 64-bit. I disabled all SOLR caches except Document Cache (I want isolated tests; I want to ignore time taken by disk I/O to load document). Performance boosted accordingly to number of load-stress threads (on "client" computer), then dropped: 9 Threads: ========== TPS: 200 - 210 Response: 45 - 50 (ms) 10 Threads: =========== TPS: 200 - 215 Response: 45 - 55 (ms) 12 Threads: =========== TPS: 180 - 220 Response: 50 - 90 (ms) 16 Threads: =========== TPS: 60 - 65 Response: 230 - 260 (ms) It can be explained by CPU-bound processing and 8 cores available; "top" command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step (12 stressing threads), and 200% on 4th step (16 stressing threads) - due probably to Network I/O, Tomcat internals, etc. It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp (persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads according to use case. BTW, my best counters for default SOLR/Lucene were: TPS: 12 Response: 750ms "Fuzzy" queries were tuned such a way that distance threshold was less than or equal two. I used "StrikeAMatch" distance... Thanks, http://www.tokenizer.ca +1 416-993-2060(cell) P.S. Before performing load-stress tests, I established the baseline in my environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ (static JSP). And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default Lucene/SOLR)... P.P.S. Distance function must follow 3 'axioms': D(a,a) = 0 D(a,b) = D(b,a) D(a,b) + D(b,c) >= D(a,c) And, function must return Integer value. Otherwise, BKTree will produce wrong results. Also, it's mentioned somewhere in Levenstein Algo Java Docs (in contrib folder I believe) that instance method runs faster than static method; need to test with Java 6... most probably 'yes', depends on JVM implementation; I can guess only that CPU-internals are better optimized for instance method...
        Hide
        Dirk Goldhan added a comment -

        I have tried your patch on lucene 3.0.0. I had to make a small change to get it work.
        In the current implementation the enumerator is before the first element. This is no problem in lucene 2.9.1 as it simply does one more iteration in FuzzyQuery (rewrite). In lucene 3.0.0 FuzzyQuery directly accesses the first element of the enumeration. As this is null it simply stops further processing of terms (if (t == null) break.

        I have made a small change in the FuzzyTermEnumNEW class to assign the first element to the enumerator during creation. I simply appended the following lines within the constructor:

        if (this.termIterator.hasNext())

        { Term firstTerm = new Term(field, termMap.keySet().iterator().next()); this.currentTerm = firstTerm; }

        Thank you for the patch.

        Show
        Dirk Goldhan added a comment - I have tried your patch on lucene 3.0.0. I had to make a small change to get it work. In the current implementation the enumerator is before the first element. This is no problem in lucene 2.9.1 as it simply does one more iteration in FuzzyQuery (rewrite). In lucene 3.0.0 FuzzyQuery directly accesses the first element of the enumeration. As this is null it simply stops further processing of terms (if (t == null) break . I have made a small change in the FuzzyTermEnumNEW class to assign the first element to the enumerator during creation. I simply appended the following lines within the constructor: if (this.termIterator.hasNext()) { Term firstTerm = new Term(field, termMap.keySet().iterator().next()); this.currentTerm = firstTerm; } Thank you for the patch.
        Hide
        Uwe Schindler added a comment - - edited

        Hi Fuad,

        thanks for submitting your changed FuzzyQuery. After quickly looking through the classes I found the following problems:

        • The cache is incorrectly synchronized: The cache is static but access is synchronized against the instance.
        • The cache is not limited, maybe it should be a WeakHashMap. It can easily overflow the memory (as it consumes lots of memory).
        • When you build the tree, you use a class from spellchecker: org.apache.lucene.search.spell.LuceneDictionary. This adds an additional memory consumption, esp. if the index has a large term dict. Why not simply iterate over the IndexReaders's TermEnum?
        • The cache cannot work correctly with per segment search (since 2.9) or reopened IndexReaders, because it is only bound to the field name but not an index reader. To have a correct cache, do it like FieldCache and use a combined key from field name and IndexReader.getFieldCacheKey().

        Else it looks like a good approach, but the memory consumption is immense for large term dicts. We currently develop a DFA-based FuzzyQuery, which will be provided, when the new flex branch gets out: LUCENE-2089

        If you fix the problems in your classes, we can add this patch to contrib.

        Show
        Uwe Schindler added a comment - - edited Hi Fuad, thanks for submitting your changed FuzzyQuery. After quickly looking through the classes I found the following problems: The cache is incorrectly synchronized: The cache is static but access is synchronized against the instance. The cache is not limited, maybe it should be a WeakHashMap. It can easily overflow the memory (as it consumes lots of memory). When you build the tree, you use a class from spellchecker: org.apache.lucene.search.spell.LuceneDictionary. This adds an additional memory consumption, esp. if the index has a large term dict. Why not simply iterate over the IndexReaders's TermEnum? The cache cannot work correctly with per segment search (since 2.9) or reopened IndexReaders, because it is only bound to the field name but not an index reader. To have a correct cache, do it like FieldCache and use a combined key from field name and IndexReader.getFieldCacheKey(). Else it looks like a good approach, but the memory consumption is immense for large term dicts. We currently develop a DFA-based FuzzyQuery, which will be provided, when the new flex branch gets out: LUCENE-2089 If you fix the problems in your classes, we can add this patch to contrib.
        Hide
        Fuad Efendi added a comment -

        Hi Uwe,

        Thanks for the analysis! I spent only few days on this basic PoC.

        I need to use IndexReader (index version number and etc.) also to rewarm a cache; if term disappeared from index we can still leave it in BKTree (not a problem; can't remove!), and if we have new term we need simply call

        public void add(E term)

        Synchronization should be significantly improved...

        Cache warming takes 10-15 seconds in my environment, about 250k tokens, and I use TreeSet internally for fast lookup. I also believe that main performance issue is related to Levenstein algo (which is significantly improved in trunk; plus synchronization is removed from FuzzySearch: LUCENE-2258)

        Regarding memory requirements: BKTree is not heavy... I should use

        StringHelper.intern(fld);
        • it's already in memory... and FuzzyTermEnum uses almost same amount of memory for processing as BKTree. I'll check FieldCache.

        BKTree-approach can be significantly improved.

        Show
        Fuad Efendi added a comment - Hi Uwe, Thanks for the analysis! I spent only few days on this basic PoC. I need to use IndexReader (index version number and etc.) also to rewarm a cache; if term disappeared from index we can still leave it in BKTree (not a problem; can't remove!), and if we have new term we need simply call public void add(E term) Synchronization should be significantly improved... Cache warming takes 10-15 seconds in my environment, about 250k tokens, and I use TreeSet internally for fast lookup. I also believe that main performance issue is related to Levenstein algo (which is significantly improved in trunk; plus synchronization is removed from FuzzySearch: LUCENE-2258 ) Regarding memory requirements: BKTree is not heavy... I should use StringHelper.intern(fld); it's already in memory... and FuzzyTermEnum uses almost same amount of memory for processing as BKTree. I'll check FieldCache. BKTree-approach can be significantly improved.
        Hide
        Uwe Schindler added a comment -

        Hi Fuad,

        Ok thanks for the explanation about the cache. But there should still be some eviction algorithm or at least a WeakHashmap so the JVM can cleanup the cache for unused fields. My problem with IndexReaders missing in the cache logic was that if you reopen the index, the BKTree contains terms no longer available and the FuzzyTermEnum enumerates terms that are no longer available. This is bad parctise and should not be done in query rewrite. So enumerating terms with no relation to a real term dict is not really supported by MultiTermQuery, but works for fuzzy, because it does not use the low level segment-based term enumeration and linkage to TermDocs.

        The new LUCENE-2258 issue needs no warmup, as it uses a different algorithm for the Levenstein algorithm and also does not scan the whole term dict. By the way, in flex also RegEx queries and Wildcard queries are speed up. The problem with trunk not having this automaton package used for that is the problem, that the AutomatonTermsEnum used for that needs to do lots of seeking in the TermsEnum, which is improved in flex and we do not want to do the work twice.

        Flex has a completely different API on the enum side, so TermEnumerations work different and so on.

        Show
        Uwe Schindler added a comment - Hi Fuad, Ok thanks for the explanation about the cache. But there should still be some eviction algorithm or at least a WeakHashmap so the JVM can cleanup the cache for unused fields. My problem with IndexReaders missing in the cache logic was that if you reopen the index, the BKTree contains terms no longer available and the FuzzyTermEnum enumerates terms that are no longer available. This is bad parctise and should not be done in query rewrite. So enumerating terms with no relation to a real term dict is not really supported by MultiTermQuery, but works for fuzzy, because it does not use the low level segment-based term enumeration and linkage to TermDocs. The new LUCENE-2258 issue needs no warmup, as it uses a different algorithm for the Levenstein algorithm and also does not scan the whole term dict. By the way, in flex also RegEx queries and Wildcard queries are speed up. The problem with trunk not having this automaton package used for that is the problem, that the AutomatonTermsEnum used for that needs to do lots of seeking in the TermsEnum, which is improved in flex and we do not want to do the work twice. Flex has a completely different API on the enum side, so TermEnumerations work different and so on.
        Hide
        Fuad Efendi added a comment - - edited

        Hi Uwe,

        I am trying to study LUCENE-2258 right now...

        BKTree contains terms no longer available

        BKTree contains objects, not terms; in my sample it contains Strings, new BKTree<String>(new Distance()). It is a structure for fast lookup of close objects from a set of objects, with predefined distance algorithm.

        It won't hurt if String appears in BKTree structure, and corresponding Term disappeared from Index; search results will be the same. Simply, search for <DisappearedTerm> OR <AnotherTerm> is the same as search for <AnotherTerm>.
        At least, we can run background thread which will create new BKTree instance, without hurting end users.

        Yes, Term<->String is another thing to do... I recreate fake terms in TermEnum...

        BKTree allows to iterate about 5-10% of whole structure in order to find closest matches only if distance threshold is small, 2. If it is 4, almost no any improvement. And, classic Levenshtein distance is slow...

        Edited: trying to study LUCENE-2089...

        Show
        Fuad Efendi added a comment - - edited Hi Uwe, I am trying to study LUCENE-2258 right now... BKTree contains terms no longer available BKTree contains objects, not terms; in my sample it contains Strings, new BKTree<String>(new Distance()). It is a structure for fast lookup of close objects from a set of objects, with predefined distance algorithm. It won't hurt if String appears in BKTree structure, and corresponding Term disappeared from Index; search results will be the same. Simply, search for <DisappearedTerm> OR <AnotherTerm> is the same as search for <AnotherTerm>. At least, we can run background thread which will create new BKTree instance, without hurting end users. Yes, Term<->String is another thing to do... I recreate fake terms in TermEnum... BKTree allows to iterate about 5-10% of whole structure in order to find closest matches only if distance threshold is small, 2. If it is 4, almost no any improvement. And, classic Levenshtein distance is slow... Edited: trying to study LUCENE-2089 ...
        Hide
        Fuad Efendi added a comment -

        LUCENE-2089 - extremely good staff (Lucene-Flex branch, applicable for wildcard-queries, RegEx, and Fuzzy Search). BKTree improves performance if distance is 2; otherwise it is almost full-term-scan.
        Some links borrowed:
        http://en.wikipedia.org/wiki/Deterministic_finite-state_machine
        http://rcmuir.wordpress.com/2009/12/04/finite-state-queries-for-lucene/
        http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198

        Show
        Fuad Efendi added a comment - LUCENE-2089 - extremely good staff (Lucene-Flex branch, applicable for wildcard-queries, RegEx, and Fuzzy Search). BKTree improves performance if distance is 2; otherwise it is almost full-term-scan. Some links borrowed: http://en.wikipedia.org/wiki/Deterministic_finite-state_machine http://rcmuir.wordpress.com/2009/12/04/finite-state-queries-for-lucene/ http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198
        Hide
        Ron Mayer added a comment -

        Could this or a similar technique improve searches for things that use both leading and trailing wildcards - like a search for 'ildcar'?

        Show
        Ron Mayer added a comment - Could this or a similar technique improve searches for things that use both leading and trailing wildcards - like a search for ' ildcar '?
        Hide
        Fuad Efendi added a comment -

        I believe this issue should be closed due to significant performance improvements related to LUCENE-2089 and LUCENE-2258.
        I don't think there is any interest from the community to continue with this (BK Tree and "Strike a Match") naive approach; although some people found it useful. Of course we might have few more distance implementations as a separate improvement.

        Please close it.

        Thanks

        Show
        Fuad Efendi added a comment - I believe this issue should be closed due to significant performance improvements related to LUCENE-2089 and LUCENE-2258 . I don't think there is any interest from the community to continue with this (BK Tree and "Strike a Match") naive approach; although some people found it useful. Of course we might have few more distance implementations as a separate improvement. Please close it. Thanks

          People

          • Assignee:
            Unassigned
            Reporter:
            Fuad Efendi
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 1m
              1m
              Remaining:
              Remaining Estimate - 1m
              1m
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development