-
Type:
Improvement
-
Status: Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: 3.0
-
Fix Version/s: None
-
Component/s: core/search
-
Labels:None
-
Environment:
Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms.
New algo uses integer distances between objects.Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms. New algo uses integer distances between objects.
-
Lucene Fields:New, Patch Available
W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973
http://portal.acm.org/citation.cfm?doid=362003.362025
I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google).
Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests).
Big list od distance implementations:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm