Issue Details (XML | Word | Printable)

Key: LUCENE-1183
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Otis Gospodnetic
Reporter: Cédrik LIME
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

TRStringDistance uses way too much memory (with patch)

Created: 21/Feb/08 11:18 AM   Updated: Yesterday 04:47 PM
Return to search
Component/s: contrib/*
Affects Version/s: 1.9, 2.0.0, 2.1, 2.2, 2.3
Fix Version/s: 3.0

Time Tracking:
Original Estimate: 0.17h
Original Estimate - 0.17h
Remaining Estimate: 0.17h
Remaining Estimate - 0.17h
Time Spent: Not Specified
Remaining Estimate - 0.17h

File Attachments:
  Size
Text File Licensed for inclusion in ASF works FuzzyTermEnum.patch 2008-02-21 03:27 PM Cédrik LIME 7 kB
Java Source File Licensed for inclusion in ASF works TRStringDistance.java 2008-02-21 11:22 AM Cédrik LIME 4 kB
Text File Licensed for inclusion in ASF works TRStringDistance.patch 2008-02-21 11:22 AM Cédrik LIME 6 kB
Issue Links:
Reference

Lucene Fields: Patch Available, New
Resolution Date: 20/Oct/09 09:23 PM


 Description  « Hide
The implementation of TRStringDistance is based on version 2.1 of org.apache.commons.lang.StringUtils#getLevenshteinDistance(String, String), which uses an un-optimized implementation of the Levenshtein Distance algorithm (it uses way too much memory). Please see Bug 38911 (http://issues.apache.org/bugzilla/show_bug.cgi?id=38911) for more information.

The commons-lang implementation has been heavily optimized as of version 2.2 (3x speed-up). I have reported the new implementation to TRStringDistance.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Cédrik LIME added a comment - 21/Feb/08 11:22 AM
new TRStringDistance implementation, as patch and as complete source file.

Grant Ingersoll added a comment - 21/Feb/08 01:38 PM
It occurs to me that we apparently have two different implementations of Levenshtein, one in spellchecker and one for FuzzyQuery. I haven't analyzed them individually to know for sure, but if this is a much better implementation, then we should think about using it for FuzzyQuery, too.

The FuzzyQuery (FuzzyTermEnum) version claims to have a fast-fail mechanism, too:

<p>Embedded within this algorithm is a fail-fast Levenshtein distance

  • algorithm. The fail-fast algorithm differs from the standard Levenshtein
  • distance algorithm in that it is aborted if it is discovered that the
  • mimimum distance between the words is greater than some threshold.
    *
  • <p>

Cedrik, since you seem to know about these things, would you have time to look at FuzzyTermEnum? A 3x speedup there would be great for users, too.


Cédrik LIME added a comment - 21/Feb/08 01:57 PM
You caught me while I was finalizing a patch for FuzzyTermEnum... I'll add it to this bug report in the next few minutes.

Cédrik LIME added a comment - 21/Feb/08 01:59 PM
Patch for FuzzyTermEnum.

Karl Wettin added a comment - 21/Feb/08 02:40 PM
Also see LUCENE-691 if you are looking at optimizing FuzzyTermEnum.

Cédrik LIME added a comment - 21/Feb/08 03:05 PM
Well spotted Karl! My version is very similar to LUCENE-691, except I kept some smallish optimisations out the the sake of readability. I'll incorporate some of his changes/ideas and publish a new patch.
Can someone link those 2 issues together in the meantime? (There are too many options in the drop-down; don't know which one to choose.)

Cédrik LIME added a comment - 21/Feb/08 03:27 PM
New patch for FuzzyTermEnum, incorporating most of LUCENE-691.

Karl Wettin added a comment - 21/Feb/08 03:39 PM
on request by issue reporter

Otis Gospodnetic added a comment - 22/May/08 06:29 AM
Committed the TRStringDistance patch – thank you!

Committed revision 659016.

I'll leave the FuzzyTermEnum patch for a later date. Is there anything in Bob's FuzzyTermEnum that is not in this patch? Anything that you'd want to add, Cédrik?


Cédrik LIME added a comment - 26/May/08 09:39 AM
All of Bob's FuzzyTermEnum patch is in my patch. I only left some smallish optimizations that didn't bring much but did hurt code readability. In other words, should you commit my patch, you will have most of (99.9%) LUCENE-691.
I think this is an important patch for Lucene 2.4, as it brings vast performance improvements in fuzzy search (no hard numbers, sorry).

Cédrik LIME added a comment - 20/Oct/09 01:17 PM
Any news on the landing of this patch?
Now that Lucene 2.9 is out, the vastly better memory usage and speed-up would be a welcome addition to Lucene 3.0's fuzzy search!

Michael McCandless added a comment - 20/Oct/09 02:56 PM
Cédrik, could you update the patch to trunk? It sounds like a compelling improvement. We should get it in.

Cédrik LIME added a comment - 20/Oct/09 08:21 PM
Thanks Michael.
FuzzyTermEnum.java has not changed for more than 2 years. The uploaded patch (FuzzyTermEnum.patch) is still valid for trunk.

Michael McCandless added a comment - 20/Oct/09 08:50 PM
OK I had 2 hunks fail but I managed to apply them.

Michael McCandless added a comment - 20/Oct/09 09:23 PM
Thanks Cédrik!