Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-296

[PATCH] FuzzyTermEnum optimization and refactor

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

    • Bugzilla Id:
      31882

      Description

      I took a look at it to see if it could be improved. I saw speed improvements of
      20% - 40% by making a couple changes.

      The patch is here: http://www.hagerfamily.com/patches/FuzzyTermEnumOptimizePatch.txt

      The Patch is based on the HEAD of the CVS tree as of Oct 22, 2004.

      What Changed?

      Since the word was discarded if the edit distance for the word was
      above a certain threshold, I updated the distance algorithm to abort
      if at any time during the calculation it is determined that the best
      possible outcome of the edit distance algorithm is above this
      threshold. The source code has a great explanation.

      I also reduced the amount of floating point math, reduced the amount
      of potential space the array takes in its first dimension, removed the
      potential divide by 0 error when one term is an empty string, and
      fixed a bug where an IllegalArgumentException was thrown if the class
      was somehow initialized wrong, instead of looking at the arguments.

      The behavior is almost identical. The exception is that similarity is
      set to 0.0 when it is guaranteed to be below the minimum similarity.

      Results

      I saw the biggest improvement from longer words, which makes a sense.
      My long word was "bridgetown" and I saw a 60% improvement on this.
      The biggest improvement are for words that are farthest away from the
      median length of the words in the index. Short words (1-3 characters)
      saw a 30% improvement. Medium words saw a 10% improvement (5-7
      characters). These improvements are with the prefix set to 0.

        Activity

        Hide
        daniel.naber@t-online.de Daniel Naber added a comment -

        Thanks, your patch has now been applied. I didn't do anything about that
        "similarity can be less than 0" issue. If anybody considers that a problem,
        please let us know and open a separate issue.

        Show
        daniel.naber@t-online.de Daniel Naber added a comment - Thanks, your patch has now been applied. I didn't do anything about that "similarity can be less than 0" issue. If anybody considers that a problem, please let us know and open a separate issue.
        Hide
        goller@detego-software.de Christoph Goller added a comment -

        The following observation should not get lost:

        In his original mail Jonathan also mentioned that currently similarity
        may be below 0.0f. A possible fix, if this should be fixed, would be to relate
        the Levenstein distance to the maximum of the two length values instead of the
        minimum.

        Show
        goller@detego-software.de Christoph Goller added a comment - The following observation should not get lost: In his original mail Jonathan also mentioned that currently similarity may be below 0.0f. A possible fix, if this should be fixed, would be to relate the Levenstein distance to the maximum of the two length values instead of the minimum.
        Hide
        jhager@gmail.com Jonathan Hager added a comment -

        The "synchronized" was added to the similarity() because it uses the d[][]
        across calls to it. If similarity() was called on the SAME FuzzyTermEnum object
        by two different threads, the results could not be guaranteed. How lucene uses
        FuzzyTermEnum, this will never happen. Currently, Lucene creates it and let's
        it be garbage collected within a single method. This is more defensive
        programming than anything else.

        The TYPICAL_LONGEST_WORD_IN_INDEX is used in two different context. The first
        context is to initializing the d[][] to something smarter than [x][1]. It
        should save very little time as d[][] will quickly grow to the appropriate size.
        The second context is to create a lookup table for the max distance, so that it
        does not have to recalculate what the max distance is for every word. The max
        distance is a constant for every word of the same length. I also only saw
        marginal gain from this change.

        I went back to estimate the acutal amount of time saved from this change in
        isolation. On a modern machine, it is less than 200ms for 10 million words (see
        test below).

        public class Scratch
        {
        private static final int TYPICAL_LONGEST_WORD_IN_INDEX = 19;
        private static final float[] TYPICAL_WORD_LENGTH_DISTRIBUTION =

        { 0, 0.01f, 0.015f, 0.025f, 0.05f, 0.10f, 0.125f, 0.15f, 0.155f, 0.12f, 0.10f, 0.07f, 0.04f, 0.015f, 0.010f, 0.005f, 0.005f, 0.005f }

        ;

        private final int[] maxDistances = new int[TYPICAL_LONGEST_WORD_IN_INDEX];

        public Scratch()
        {
        for (int i = 0; i < maxDistances.length; i++)

        { maxDistances[i] = calculateMaxDistance(i); }

        }

        private final int getMaxDistance(int m)

        { return (m < maxDistances.length) ? maxDistances[m] : calculateMaxDistance(m); }

        private int calculateMaxDistance(int m)

        { return (int) ((1-0.5) * (Math.min(7, m) + 0)); }

        public static void main(String[] args)
        {
        long start = System.currentTimeMillis();
        Scratch s = new Scratch();
        final int totalNumberOfRuns = 10000000;
        for (int i = 0; i < TYPICAL_WORD_LENGTH_DISTRIBUTION.length; i++)
        {
        float v = TYPICAL_WORD_LENGTH_DISTRIBUTION[i];
        final int runsForASingleLength = (int) (v * totalNumberOfRuns);
        //System.out.println(runsForASingleLength);
        for (int j=0; j < runsForASingleLength; j++)

        { s.getMaxDistance(i); }

        }

        long total = System.currentTimeMillis() - start;

        System.out.println("Time to lookup the records (in ms): " + total);

        long start2 = System.currentTimeMillis();

        for (int i = 0; i < TYPICAL_WORD_LENGTH_DISTRIBUTION.length; i++)
        {
        float v = TYPICAL_WORD_LENGTH_DISTRIBUTION[i];
        final int runsForASingleLength = (int) (v * totalNumberOfRuns);
        for (int j=0; j < runsForASingleLength; j++)

        { s.calculateMaxDistance(i); }

        }

        long total2 = System.currentTimeMillis() - start2;

        System.out.println("Time to recalculate the records (in ms): " + total2);
        }

        }

        Show
        jhager@gmail.com Jonathan Hager added a comment - The "synchronized" was added to the similarity() because it uses the d[][] across calls to it. If similarity() was called on the SAME FuzzyTermEnum object by two different threads, the results could not be guaranteed. How lucene uses FuzzyTermEnum, this will never happen. Currently, Lucene creates it and let's it be garbage collected within a single method. This is more defensive programming than anything else. The TYPICAL_LONGEST_WORD_IN_INDEX is used in two different context. The first context is to initializing the d[][] to something smarter than [x] [1] . It should save very little time as d[][] will quickly grow to the appropriate size. The second context is to create a lookup table for the max distance, so that it does not have to recalculate what the max distance is for every word. The max distance is a constant for every word of the same length. I also only saw marginal gain from this change. I went back to estimate the acutal amount of time saved from this change in isolation. On a modern machine, it is less than 200ms for 10 million words (see test below). public class Scratch { private static final int TYPICAL_LONGEST_WORD_IN_INDEX = 19; private static final float[] TYPICAL_WORD_LENGTH_DISTRIBUTION = { 0, 0.01f, 0.015f, 0.025f, 0.05f, 0.10f, 0.125f, 0.15f, 0.155f, 0.12f, 0.10f, 0.07f, 0.04f, 0.015f, 0.010f, 0.005f, 0.005f, 0.005f } ; private final int[] maxDistances = new int [TYPICAL_LONGEST_WORD_IN_INDEX] ; public Scratch() { for (int i = 0; i < maxDistances.length; i++) { maxDistances[i] = calculateMaxDistance(i); } } private final int getMaxDistance(int m) { return (m < maxDistances.length) ? maxDistances[m] : calculateMaxDistance(m); } private int calculateMaxDistance(int m) { return (int) ((1-0.5) * (Math.min(7, m) + 0)); } public static void main(String[] args) { long start = System.currentTimeMillis(); Scratch s = new Scratch(); final int totalNumberOfRuns = 10000000; for (int i = 0; i < TYPICAL_WORD_LENGTH_DISTRIBUTION.length; i++) { float v = TYPICAL_WORD_LENGTH_DISTRIBUTION [i] ; final int runsForASingleLength = (int) (v * totalNumberOfRuns); //System.out.println(runsForASingleLength); for (int j=0; j < runsForASingleLength; j++) { s.getMaxDistance(i); } } long total = System.currentTimeMillis() - start; System.out.println("Time to lookup the records (in ms): " + total); long start2 = System.currentTimeMillis(); for (int i = 0; i < TYPICAL_WORD_LENGTH_DISTRIBUTION.length; i++) { float v = TYPICAL_WORD_LENGTH_DISTRIBUTION [i] ; final int runsForASingleLength = (int) (v * totalNumberOfRuns); for (int j=0; j < runsForASingleLength; j++) { s.calculateMaxDistance(i); } } long total2 = System.currentTimeMillis() - start2; System.out.println("Time to recalculate the records (in ms): " + total2); } }
        Hide
        daniel.naber@t-online.de Daniel Naber added a comment -

        The "synchronized" for the similarity() method should no be necessary, should
        it?

        I tried modifying the value of TYPICAL_LONGEST_WORD_IN_INDEX (e.g. to 1 or to
        100) and I didn't see any real difference. Are you sure this particular
        optimization is useful?

        As I mentioned, the speedup of all changes together is 20-40% in my short
        tests, so I'm sure this patch will be committed sooner or later.

        Show
        daniel.naber@t-online.de Daniel Naber added a comment - The "synchronized" for the similarity() method should no be necessary, should it? I tried modifying the value of TYPICAL_LONGEST_WORD_IN_INDEX (e.g. to 1 or to 100) and I didn't see any real difference. Are you sure this particular optimization is useful? As I mentioned, the speedup of all changes together is 20-40% in my short tests, so I'm sure this patch will be committed sooner or later.

          People

          • Assignee:
            java-dev@lucene.apache.org Lucene Developers
            Reporter:
            jhager@gmail.com Jonathan Hager
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development