Lucene - Core
  1. Lucene - Core
  2. LUCENE-1183

TRStringDistance uses way too much memory (with patch)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.9, 2.0.0, 2.1, 2.2, 2.3
    • Fix Version/s: 3.0
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The implementation of TRStringDistance is based on version 2.1 of org.apache.commons.lang.StringUtils#getLevenshteinDistance(String, String), which uses an un-optimized implementation of the Levenshtein Distance algorithm (it uses way too much memory). Please see Bug 38911 (http://issues.apache.org/bugzilla/show_bug.cgi?id=38911) for more information.

      The commons-lang implementation has been heavily optimized as of version 2.2 (3x speed-up). I have reported the new implementation to TRStringDistance.

      1. TRStringDistance.patch
        6 kB
        Cédrik LIME
      2. TRStringDistance.java
        4 kB
        Cédrik LIME
      3. LUCENE-1183.patch
        7 kB
        Robert Muir
      4. LUCENE-1183_3x.patch
        2 kB
        Robert Muir
      5. FuzzyTermEnum.patch
        7 kB
        Cédrik LIME

        Issue Links

          Activity

          Hide
          Cédrik LIME added a comment -

          new TRStringDistance implementation, as patch and as complete source file.

          Show
          Cédrik LIME added a comment - new TRStringDistance implementation, as patch and as complete source file.
          Hide
          Grant Ingersoll added a comment -

          It occurs to me that we apparently have two different implementations of Levenshtein, one in spellchecker and one for FuzzyQuery. I haven't analyzed them individually to know for sure, but if this is a much better implementation, then we should think about using it for FuzzyQuery, too.

          The FuzzyQuery (FuzzyTermEnum) version claims to have a fast-fail mechanism, too:

          <p>Embedded within this algorithm is a fail-fast Levenshtein distance

          • algorithm. The fail-fast algorithm differs from the standard Levenshtein
          • distance algorithm in that it is aborted if it is discovered that the
          • mimimum distance between the words is greater than some threshold.
            *
          • <p>

          Cedrik, since you seem to know about these things, would you have time to look at FuzzyTermEnum? A 3x speedup there would be great for users, too.

          Show
          Grant Ingersoll added a comment - It occurs to me that we apparently have two different implementations of Levenshtein, one in spellchecker and one for FuzzyQuery. I haven't analyzed them individually to know for sure, but if this is a much better implementation, then we should think about using it for FuzzyQuery, too. The FuzzyQuery (FuzzyTermEnum) version claims to have a fast-fail mechanism, too: <p>Embedded within this algorithm is a fail-fast Levenshtein distance algorithm. The fail-fast algorithm differs from the standard Levenshtein distance algorithm in that it is aborted if it is discovered that the mimimum distance between the words is greater than some threshold. * <p> Cedrik, since you seem to know about these things, would you have time to look at FuzzyTermEnum? A 3x speedup there would be great for users, too.
          Hide
          Cédrik LIME added a comment -

          You caught me while I was finalizing a patch for FuzzyTermEnum... I'll add it to this bug report in the next few minutes.

          Show
          Cédrik LIME added a comment - You caught me while I was finalizing a patch for FuzzyTermEnum... I'll add it to this bug report in the next few minutes.
          Hide
          Cédrik LIME added a comment -

          Patch for FuzzyTermEnum.

          Show
          Cédrik LIME added a comment - Patch for FuzzyTermEnum.
          Hide
          Karl Wettin added a comment -

          Also see LUCENE-691 if you are looking at optimizing FuzzyTermEnum.

          Show
          Karl Wettin added a comment - Also see LUCENE-691 if you are looking at optimizing FuzzyTermEnum.
          Hide
          Cédrik LIME added a comment -

          Well spotted Karl! My version is very similar to LUCENE-691, except I kept some smallish optimisations out the the sake of readability. I'll incorporate some of his changes/ideas and publish a new patch.
          Can someone link those 2 issues together in the meantime? (There are too many options in the drop-down; don't know which one to choose.)

          Show
          Cédrik LIME added a comment - Well spotted Karl! My version is very similar to LUCENE-691 , except I kept some smallish optimisations out the the sake of readability. I'll incorporate some of his changes/ideas and publish a new patch. Can someone link those 2 issues together in the meantime? (There are too many options in the drop-down; don't know which one to choose.)
          Hide
          Cédrik LIME added a comment -

          New patch for FuzzyTermEnum, incorporating most of LUCENE-691.

          Show
          Cédrik LIME added a comment - New patch for FuzzyTermEnum, incorporating most of LUCENE-691 .
          Hide
          Karl Wettin added a comment -

          on request by issue reporter

          Show
          Karl Wettin added a comment - on request by issue reporter
          Hide
          Otis Gospodnetic added a comment -

          Committed the TRStringDistance patch – thank you!

          Committed revision 659016.

          I'll leave the FuzzyTermEnum patch for a later date. Is there anything in Bob's FuzzyTermEnum that is not in this patch? Anything that you'd want to add, Cédrik?

          Show
          Otis Gospodnetic added a comment - Committed the TRStringDistance patch – thank you! Committed revision 659016. I'll leave the FuzzyTermEnum patch for a later date. Is there anything in Bob's FuzzyTermEnum that is not in this patch? Anything that you'd want to add, Cédrik?
          Hide
          Cédrik LIME added a comment -

          All of Bob's FuzzyTermEnum patch is in my patch. I only left some smallish optimizations that didn't bring much but did hurt code readability. In other words, should you commit my patch, you will have most of (99.9%) LUCENE-691.
          I think this is an important patch for Lucene 2.4, as it brings vast performance improvements in fuzzy search (no hard numbers, sorry).

          Show
          Cédrik LIME added a comment - All of Bob's FuzzyTermEnum patch is in my patch. I only left some smallish optimizations that didn't bring much but did hurt code readability. In other words, should you commit my patch, you will have most of (99.9%) LUCENE-691 . I think this is an important patch for Lucene 2.4, as it brings vast performance improvements in fuzzy search (no hard numbers, sorry).
          Hide
          Cédrik LIME added a comment -

          Any news on the landing of this patch?
          Now that Lucene 2.9 is out, the vastly better memory usage and speed-up would be a welcome addition to Lucene 3.0's fuzzy search!

          Show
          Cédrik LIME added a comment - Any news on the landing of this patch? Now that Lucene 2.9 is out, the vastly better memory usage and speed-up would be a welcome addition to Lucene 3.0's fuzzy search!
          Hide
          Michael McCandless added a comment -

          Cédrik, could you update the patch to trunk? It sounds like a compelling improvement. We should get it in.

          Show
          Michael McCandless added a comment - Cédrik, could you update the patch to trunk? It sounds like a compelling improvement. We should get it in.
          Hide
          Cédrik LIME added a comment -

          Thanks Michael.
          FuzzyTermEnum.java has not changed for more than 2 years. The uploaded patch (FuzzyTermEnum.patch) is still valid for trunk.

          Show
          Cédrik LIME added a comment - Thanks Michael. FuzzyTermEnum.java has not changed for more than 2 years. The uploaded patch (FuzzyTermEnum.patch) is still valid for trunk.
          Hide
          Michael McCandless added a comment -

          OK I had 2 hunks fail but I managed to apply them.

          Show
          Michael McCandless added a comment - OK I had 2 hunks fail but I managed to apply them.
          Hide
          Michael McCandless added a comment -

          Thanks Cédrik!

          Show
          Michael McCandless added a comment - Thanks Cédrik!
          Hide
          Robert Muir added a comment -

          these improvements only made it into fuzzyTERMenum but somehow didnt make it into fuzzyTERMSenum on flex branch, and aren't in trunk!

          Show
          Robert Muir added a comment - these improvements only made it into fuzzyTERMenum but somehow didnt make it into fuzzyTERMSenum on flex branch, and aren't in trunk!
          Hide
          Robert Muir added a comment -

          here is the patch for FuzzyTermsEnum in trunk

          Show
          Robert Muir added a comment - here is the patch for FuzzyTermsEnum in trunk
          Hide
          Robert Muir added a comment -

          while I'm here, here is a little patch for 3x_branch to eliminate some useless charAt bounds checks.

          Show
          Robert Muir added a comment - while I'm here, here is a little patch for 3x_branch to eliminate some useless charAt bounds checks.
          Hide
          Robert Muir added a comment -

          Committed revision 983298 (trunk) and small optimizations 983299 (3x)

          Show
          Robert Muir added a comment - Committed revision 983298 (trunk) and small optimizations 983299 (3x)

            People

            • Assignee:
              Robert Muir
              Reporter:
              Cédrik LIME
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 10m
                10m
                Remaining:
                Remaining Estimate - 10m
                10m
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development