Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1183

TRStringDistance uses way too much memory (with patch)

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.9, 2.0.0, 2.1, 2.2, 2.3
    • 3.0
    • modules/other
    • None
    • New, Patch Available

    Description

      The implementation of TRStringDistance is based on version 2.1 of org.apache.commons.lang.StringUtils#getLevenshteinDistance(String, String), which uses an un-optimized implementation of the Levenshtein Distance algorithm (it uses way too much memory). Please see Bug 38911 (http://issues.apache.org/bugzilla/show_bug.cgi?id=38911) for more information.

      The commons-lang implementation has been heavily optimized as of version 2.2 (3x speed-up). I have reported the new implementation to TRStringDistance.

      Attachments

        1. TRStringDistance.patch
          6 kB
          CĂ©drik LIME
        2. TRStringDistance.java
          4 kB
          CĂ©drik LIME
        3. FuzzyTermEnum.patch
          7 kB
          CĂ©drik LIME
        4. LUCENE-1183.patch
          7 kB
          Robert Muir
        5. LUCENE-1183_3x.patch
          2 kB
          Robert Muir

        Issue Links

          Activity

            cedrik.lime@gmail.com CĂ©drik LIME added a comment -

            new TRStringDistance implementation, as patch and as complete source file.

            cedrik.lime@gmail.com CĂ©drik LIME added a comment - new TRStringDistance implementation, as patch and as complete source file.

            It occurs to me that we apparently have two different implementations of Levenshtein, one in spellchecker and one for FuzzyQuery. I haven't analyzed them individually to know for sure, but if this is a much better implementation, then we should think about using it for FuzzyQuery, too.

            The FuzzyQuery (FuzzyTermEnum) version claims to have a fast-fail mechanism, too:

            <p>Embedded within this algorithm is a fail-fast Levenshtein distance

            • algorithm. The fail-fast algorithm differs from the standard Levenshtein
            • distance algorithm in that it is aborted if it is discovered that the
            • mimimum distance between the words is greater than some threshold.
              *
            • <p>

            Cedrik, since you seem to know about these things, would you have time to look at FuzzyTermEnum? A 3x speedup there would be great for users, too.

            gsingers Grant Ingersoll added a comment - It occurs to me that we apparently have two different implementations of Levenshtein, one in spellchecker and one for FuzzyQuery. I haven't analyzed them individually to know for sure, but if this is a much better implementation, then we should think about using it for FuzzyQuery, too. The FuzzyQuery (FuzzyTermEnum) version claims to have a fast-fail mechanism, too: <p>Embedded within this algorithm is a fail-fast Levenshtein distance algorithm. The fail-fast algorithm differs from the standard Levenshtein distance algorithm in that it is aborted if it is discovered that the mimimum distance between the words is greater than some threshold. * <p> Cedrik, since you seem to know about these things, would you have time to look at FuzzyTermEnum? A 3x speedup there would be great for users, too.
            cedrik.lime@gmail.com CĂ©drik LIME added a comment -

            You caught me while I was finalizing a patch for FuzzyTermEnum... I'll add it to this bug report in the next few minutes.

            cedrik.lime@gmail.com CĂ©drik LIME added a comment - You caught me while I was finalizing a patch for FuzzyTermEnum... I'll add it to this bug report in the next few minutes.
            cedrik.lime@gmail.com CĂ©drik LIME added a comment -

            Patch for FuzzyTermEnum.

            cedrik.lime@gmail.com CĂ©drik LIME added a comment - Patch for FuzzyTermEnum.
            karl.wettin Karl Wettin added a comment -

            Also see LUCENE-691 if you are looking at optimizing FuzzyTermEnum.

            karl.wettin Karl Wettin added a comment - Also see LUCENE-691 if you are looking at optimizing FuzzyTermEnum.
            cedrik.lime@gmail.com CĂ©drik LIME added a comment -

            Well spotted Karl! My version is very similar to LUCENE-691, except I kept some smallish optimisations out the the sake of readability. I'll incorporate some of his changes/ideas and publish a new patch.
            Can someone link those 2 issues together in the meantime? (There are too many options in the drop-down; don't know which one to choose.)

            cedrik.lime@gmail.com CĂ©drik LIME added a comment - Well spotted Karl! My version is very similar to LUCENE-691 , except I kept some smallish optimisations out the the sake of readability. I'll incorporate some of his changes/ideas and publish a new patch. Can someone link those 2 issues together in the meantime? (There are too many options in the drop-down; don't know which one to choose.)
            cedrik.lime@gmail.com CĂ©drik LIME added a comment -

            New patch for FuzzyTermEnum, incorporating most of LUCENE-691.

            cedrik.lime@gmail.com CĂ©drik LIME added a comment - New patch for FuzzyTermEnum, incorporating most of LUCENE-691 .
            karl.wettin Karl Wettin added a comment -

            on request by issue reporter

            karl.wettin Karl Wettin added a comment - on request by issue reporter

            Committed the TRStringDistance patch – thank you!

            Committed revision 659016.

            I'll leave the FuzzyTermEnum patch for a later date. Is there anything in Bob's FuzzyTermEnum that is not in this patch? Anything that you'd want to add, CĂ©drik?

            otis Otis Gospodnetic added a comment - Committed the TRStringDistance patch – thank you! Committed revision 659016. I'll leave the FuzzyTermEnum patch for a later date. Is there anything in Bob's FuzzyTermEnum that is not in this patch? Anything that you'd want to add, Cédrik?
            cedrik.lime@gmail.com CĂ©drik LIME added a comment -

            All of Bob's FuzzyTermEnum patch is in my patch. I only left some smallish optimizations that didn't bring much but did hurt code readability. In other words, should you commit my patch, you will have most of (99.9%) LUCENE-691.
            I think this is an important patch for Lucene 2.4, as it brings vast performance improvements in fuzzy search (no hard numbers, sorry).

            cedrik.lime@gmail.com CĂ©drik LIME added a comment - All of Bob's FuzzyTermEnum patch is in my patch. I only left some smallish optimizations that didn't bring much but did hurt code readability. In other words, should you commit my patch, you will have most of (99.9%) LUCENE-691 . I think this is an important patch for Lucene 2.4, as it brings vast performance improvements in fuzzy search (no hard numbers, sorry).
            cedrik_lime CĂ©drik LIME added a comment -

            Any news on the landing of this patch?
            Now that Lucene 2.9 is out, the vastly better memory usage and speed-up would be a welcome addition to Lucene 3.0's fuzzy search!

            cedrik_lime CĂ©drik LIME added a comment - Any news on the landing of this patch? Now that Lucene 2.9 is out, the vastly better memory usage and speed-up would be a welcome addition to Lucene 3.0's fuzzy search!

            CĂ©drik, could you update the patch to trunk? It sounds like a compelling improvement. We should get it in.

            mikemccand Michael McCandless added a comment - CĂ©drik, could you update the patch to trunk? It sounds like a compelling improvement. We should get it in.
            cedrik_lime CĂ©drik LIME added a comment -

            Thanks Michael.
            FuzzyTermEnum.java has not changed for more than 2 years. The uploaded patch (FuzzyTermEnum.patch) is still valid for trunk.

            cedrik_lime CĂ©drik LIME added a comment - Thanks Michael. FuzzyTermEnum.java has not changed for more than 2 years. The uploaded patch (FuzzyTermEnum.patch) is still valid for trunk.

            OK I had 2 hunks fail but I managed to apply them.

            mikemccand Michael McCandless added a comment - OK I had 2 hunks fail but I managed to apply them.

            Thanks CĂ©drik!

            mikemccand Michael McCandless added a comment - Thanks CĂ©drik!
            rcmuir Robert Muir added a comment -

            these improvements only made it into fuzzyTERMenum but somehow didnt make it into fuzzyTERMSenum on flex branch, and aren't in trunk!

            rcmuir Robert Muir added a comment - these improvements only made it into fuzzyTERMenum but somehow didnt make it into fuzzyTERMSenum on flex branch, and aren't in trunk!
            rcmuir Robert Muir added a comment -

            here is the patch for FuzzyTermsEnum in trunk

            rcmuir Robert Muir added a comment - here is the patch for FuzzyTermsEnum in trunk
            rcmuir Robert Muir added a comment -

            while I'm here, here is a little patch for 3x_branch to eliminate some useless charAt bounds checks.

            rcmuir Robert Muir added a comment - while I'm here, here is a little patch for 3x_branch to eliminate some useless charAt bounds checks.
            rcmuir Robert Muir added a comment -

            Committed revision 983298 (trunk) and small optimizations 983299 (3x)

            rcmuir Robert Muir added a comment - Committed revision 983298 (trunk) and small optimizations 983299 (3x)
            tomoko Tomoko Uchida added a comment -

            This issue was moved to GitHub issue: #2260.

            tomoko Tomoko Uchida added a comment - This issue was moved to GitHub issue: #2260 .

            People

              rcmuir Robert Muir
              cedrik.lime@gmail.com CĂ©drik LIME
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 10m
                  10m
                  Remaining:
                  Remaining Estimate - 10m
                  10m
                  Logged:
                  Time Spent - Not Specified
                  Not Specified