[LUCENE-691] Bob Carpenter's FuzzyTermEnum refactoring - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: core/search
Labels:
None

Lucene Fields:

New

Description

I'll just paste Bob's complete email here.

I refactored the org.apache.lucene.search.FuzzyTermEnum
edit distance implementation. It now only uses a single
pair of arrays, and those never get resized. That required
turning the order of text/target around in the loops. You'll
see that with the pair of arrays method, they get re-used
hand-over-hand, and are assigned to local variables in the
tight loops.

I removed the calculation of min distance and replaced
it with a boolean – the min wasn't needed, only the test vs.
the max. I also flipped some variables around so there's
one less addition in the very inner loop and the arrays are
now looping in the ordinary way (starting at 0 with a < comparison).
I also commented out the redundant definition of the public close()
[which just called super.close() and had none of its own doc.]
I also just compute the max distance each time rather than
fiddling with an array – it's just a little arithmetic done once
per term, but that could be put back.

I also rewrote min(int,int,int) to get rid of intermediate
assignments. Is there a lib for this kind of thing?

An intermediate refactoring that does the hand-over-hand
with the existing array and resizing strategy is in
FuzzyTermEnum.intermed.java.

I ran the unit tests as follows on both versions (my hat's off to the
build.xml author(s) – this all just worked out of the box and was
really easy to follow the first through):

C:\java\lucene-2.0.0>ant -Djunit.includes="" -Dtestcase=TestFuzzyQuery test
Buildfile: build.xml
javacc-uptodate-check:
javacc-notice:
init:
common.compile-core:
[javac] Compiling 1 source file to
C:\java\lucene-2.0.0\build\classes\java
compile-core:
compile-demo:
common.compile-test:
compile-test:
test:
[junit] Testsuite: org.apache.lucene.search.TestFuzzyQuery
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.453 sec
BUILD SUCCESSFUL
Total time: 2 seconds

Does anyone have regression/performance test harnesses?
The unit tests were pretty minimal (which is a good thing!).
It'd be nice to test the min impl (ternary vs. if/then)
and the assumption about not allocating an
array of max distances. All told, the refactored version
should be a modest speed improvement, mainly from
unfolding the arrays so they're local one-dimensional refs.

I don't know what the protocol is for one-off contributions.
I'm happy with the Apache license, so that shouldn't
be a problem. I also don't know whether you use tabs
or spaces – I untabified the final version and used your
two-space format in emacs.

Bob Carpenter
package org.apache.lucene.search;

/**

Copyright 2004 The Apache Software Foundation
*
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
*
http://www.apache.org/licenses/LICENSE-2.0
*
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;

import java.io.IOException;

/** Subclass of FilteredTermEnum for enumerating all terms that are similiar

to the specified filter term.
*
Term enumerations are always ordered by Term.compareTo(). Each term in
the enumeration is greater than all that precede it.
*/
public final class FuzzyTermEnum extends FilteredTermEnum {

/* This should be somewhere around the average long word.

If it is longer, we waste time and space. If it is shorter, we waste a
little bit of time growing the array as we encounter longer words.
*/
private static final int TYPICAL_LONGEST_WORD_IN_INDEX = 19;

/* Allows us save time required to create a new array

everytime similarity is called. These are slices that
will be reused during dynamic programming hand-over-hand
style.
*/
private final int[] d0;
private final int[] d1;

private float similarity;
private boolean endEnum = false;

private Term searchTerm = null;
private final String field;
private final String text;
private final String prefix;

private final float minimumSimilarity;
private final float scale_factor;

/**

Creates a FuzzyTermEnum with an empty prefix and a minSimilarity of 0.5f.
After calling the constructor the enumeration is already pointing to the first
valid term if such a term exists.
*
@param reader
@param term
@throws IOException
@see #FuzzyTermEnum(IndexReader, Term, float, int)
*/
public FuzzyTermEnum(IndexReader reader, Term term) throws IOException { this(reader, term, FuzzyQuery.defaultMinSimilarity, FuzzyQuery.defaultPrefixLength); }

/**
* Creates a FuzzyTermEnum with an empty prefix.
* 
* After calling the constructor the enumeration is already pointing to the first
* valid term if such a term exists.
*
* @param reader
* @param term
* @param minSimilarity
* @throws IOException
* @see #FuzzyTermEnum(IndexReader, Term, float, int)
*/
public FuzzyTermEnum(IndexReader reader, Term term, float minSimilarity) throws IOException { this(reader, term, minSimilarity, FuzzyQuery.defaultPrefixLength); }

/**
* Constructor for enumeration of all terms from specified <code>reader</code> which share a prefix of
* length <code>prefixLength</code> with <code>term</code> and which have a fuzzy similarity >
* <code>minSimilarity</code>.
* 
* After calling the constructor the enumeration is already pointing to the first
* valid term if such a term exists.
*
* @param reader Delivers terms.
* @param term Pattern term.
* @param minSimilarity Minimum required similarity for terms from the reader. Default value is 0.5f.
* @param prefixLength Length of required common prefix. Default value is 0.
* @throws IOException
*/
public FuzzyTermEnum(IndexReader reader, Term term, final float minSimilarity, final int prefixLength) throws IOException { super(); if (minSimilarity >= 1.0f) throw new IllegalArgumentException("minimumSimilarity cannot be greater than or equal to 1"); else if (minSimilarity < 0.0f) throw new IllegalArgumentException("minimumSimilarity cannot be less than 0"); if(prefixLength < 0) throw new IllegalArgumentException("prefixLength cannot be less than 0"); this.minimumSimilarity = minSimilarity; this.scale_factor = 1.0f / (1.0f - minimumSimilarity); this.searchTerm = term; this.field = searchTerm.field(); //The prefix could be longer than the word. //It's kind of silly though. It means we must match the entire word. final int fullSearchTermLength = searchTerm.text().length(); final int realPrefixLength = prefixLength > fullSearchTermLength ? fullSearchTermLength : prefixLength; this.text = searchTerm.text().substring(realPrefixLength); this.prefix = searchTerm.text().substring(0, realPrefixLength); this.d0 = new int[this.text.length()+1]; this.d1 = new int[this.d0.length]; setEnum(reader.terms(new Term(searchTerm.field(), prefix))); }

/**
* The termCompare method in FuzzyTermEnum uses Levenshtein distance to
* calculate the distance between the given term and the comparing term.
*/
protected final boolean termCompare(Term term) {
if (field == term.field() && term.text().startsWith(prefix)) { final String target = term.text().substring(prefix.length()); this.similarity = similarity(target); return (similarity > minimumSimilarity); }
endEnum = true;
return false;
}

public final float difference() { return (float)((similarity - minimumSimilarity) * scale_factor); }

public final boolean endEnum() { return endEnum; }

/******************************
* Compute Levenshtein distance
******************************/

/**
* Finds and returns the smallest of three integers
*/
private static final int min(int a, int b, int c) {
// removed assignments to use double ternary
return (a < b)
? ((a < c) ? a : c)
: ((b < c) ? b: c);

// alt form is:
// if (a < b) { if (a < c) return a; else return c; }
// if (b < c) return b; else return c;
}

/**
* Similarity returns a number that is 1.0f or less (including negative numbers)
* based on how similar the Term is compared to a target term. It returns
* exactly 0.0f when
* <pre>
* editDistance < maximumEditDistance</pre>
* Otherwise it returns:
* <pre>
* 1 - (editDistance / length)</pre>
* where length is the length of the shortest term (text or target) including a
* prefix that are identical and editDistance is the Levenshtein distance for
* the two words.
*
* Embedded within this algorithm is a fail-fast Levenshtein distance
* algorithm. The fail-fast algorithm differs from the standard Levenshtein
* distance algorithm in that it is aborted if it is discovered that the
* mimimum distance between the words is greater than some threshold.
*
* To calculate the maximum distance threshold we use the following formula:
* <pre>
* (1 - minimumSimilarity) * length</pre>
* where length is the shortest term including any prefix that is not part of the
* similarity comparision. This formula was derived by solving for what maximum value
* of distance returns false for the following statements:
* <pre>
* similarity = 1 - ((float)distance / (float) (prefixLength + Math.min(textlen, targetlen)));
* return (similarity > minimumSimilarity);</pre>
* where distance is the Levenshtein distance for the two words.
* 
* Levenshtein distance (also known as edit distance) is a measure of similiarity
* between two strings where the distance is measured as the number of character
* deletions, insertions or substitutions required to transform one string to
* the other string.
* @param target the target word or phrase
* @return the similarity, 0.0 or less indicates that it matches less than the required
* threshold and 1.0 indicates that the text and target are identical
*/
private synchronized final float similarity(final String target) {
final int m = target.length();
final int n = text.length();
if (n == 0) { //we don't have anything to compare. That means if we just add //the letters for m we get the new word return prefix.length() == 0 ? 0.0f : 1.0f - ((float) m / prefix.length()); }
if (m == 0) { return prefix.length() == 0 ? 0.0f : 1.0f - ((float) n / prefix.length()); }

final int maxDistance = calculateMaxDistance(m);

if (maxDistance < Math.abs(m-n)) { //just adding the characters of m to n or vice-versa results in //too many edits //for example "pre" length is 3 and "prefixes" length is 8. We can see that //given this optimal circumstance, the edit distance cannot be less than 5. //which is 8-3 or more precisesly Math.abs(3-8). //if our maximum edit distance is 4, then we can discard this word //without looking at it. return 0.0f; }

int[] dLast = d0; // set locals for efficiency
int[] dCurrent = d1;
for (int j = 0; j <= n; j++) dCurrent[j] = j;

for (int i = 0; i < m; ) {
final char s_i = target.charAt;
int[] dTemp = dLast;
dLast = dCurrent; // previously: d[i-i]
dCurrent = dTemp; // previously: d[i]
boolean prune = (dCurrent[0] = ++i) > maxDistance; // true if d[i][0] is too large
for (int j = 0; j < n; j++) { dCurrent[j+1] = (s_i == text.charAt(j)) ? min(dLast[j+1]+1, dCurrent[j]+1, dLast[j]) : min(dLast[j+1], dCurrent[j], dLast[j])+1; if (prune && dCurrent[j+1] <= maxDistance) prune = false; }

// (prune==false) iff (dCurrent[j] < maxDistance) for some j
if (prune) { return 0.0f; }
}

// this will return less than 0.0 when the edit distance is
// greater than the number of characters in the shorter word.
// but this was the formula that was previously used in FuzzyTermEnum,
// so it has not been changed (even though minimumSimilarity must be
// greater than 0.0)
return 1.0F - dCurrent[n]/(float)(prefix.length() + Math.min(n,m));
}

private int calculateMaxDistance(int m) { return (int) ((1-minimumSimilarity) * (Math.min(text.length(), m) + prefix.length())); }

/* This is redundant
public void close() throws IOException { super.close(); //call super.close() and let the garbage collector do its work. }
*/

}
package org.apache.lucene.search;

/**
* Copyright 2004 The Apache Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;

import java.io.IOException;

/** Subclass of FilteredTermEnum for enumerating all terms that are similiar
* to the specified filter term.
*
* Term enumerations are always ordered by Term.compareTo(). Each term in
* the enumeration is greater than all that precede it.
*/
public final class FuzzyTermEnum extends FilteredTermEnum {

/* This should be somewhere around the average long word.
* If it is longer, we waste time and space. If it is shorter, we waste a
* little bit of time growing the array as we encounter longer words.
*/
private static final int TYPICAL_LONGEST_WORD_IN_INDEX = 19;

/* Allows us save time required to create a new array
* everytime similarity is called. These are slices that
* will be reused during dynamic programming hand-over-hand
* style. They get resized, if necessary, by growDistanceArrays(int).
*/
private int[] d0;
private int[] d1;

private float similarity;
private boolean endEnum = false;

private Term searchTerm = null;
private final String field;
private final String text;
private final String prefix;

private final float minimumSimilarity;
private final float scale_factor;

/**
* Creates a FuzzyTermEnum with an empty prefix and a minSimilarity of 0.5f.
* 
* After calling the constructor the enumeration is already pointing to the first
* valid term if such a term exists.
*
* @param reader
* @param term
* @throws IOException
* @see #FuzzyTermEnum(IndexReader, Term, float, int)
*/
public FuzzyTermEnum(IndexReader reader, Term term) throws IOException { this(reader, term, FuzzyQuery.defaultMinSimilarity, FuzzyQuery.defaultPrefixLength); }

/**

Creates a FuzzyTermEnum with an empty prefix.
After calling the constructor the enumeration is already pointing to the first
valid term if such a term exists.
*
@param reader
@param term
@param minSimilarity
@throws IOException
@see #FuzzyTermEnum(IndexReader, Term, float, int)
*/
public FuzzyTermEnum(IndexReader reader, Term term, float minSimilarity) throws IOException { this(reader, term, minSimilarity, FuzzyQuery.defaultPrefixLength); }

/**

Constructor for enumeration of all terms from specified <code>reader</code> which share a prefix of
length <code>prefixLength</code> with <code>term</code> and which have a fuzzy similarity >
<code>minSimilarity</code>.
After calling the constructor the enumeration is already pointing to the first
valid term if such a term exists.
*
@param reader Delivers terms.
@param term Pattern term.
@param minSimilarity Minimum required similarity for terms from the reader. Default value is 0.5f.
@param prefixLength Length of required common prefix. Default value is 0.
@throws IOException
*/
public FuzzyTermEnum(IndexReader reader, Term term, final float minSimilarity, final int prefixLength) throws IOException { super(); if (minSimilarity >= 1.0f) throw new IllegalArgumentException("minimumSimilarity cannot be greater than or equal to 1"); else if (minSimilarity < 0.0f) throw new IllegalArgumentException("minimumSimilarity cannot be less than 0"); if(prefixLength < 0) throw new IllegalArgumentException("prefixLength cannot be less than 0"); this.minimumSimilarity = minSimilarity; this.scale_factor = 1.0f / (1.0f - minimumSimilarity); this.searchTerm = term; this.field = searchTerm.field(); //The prefix could be longer than the word. //It's kind of silly though. It means we must match the entire word. final int fullSearchTermLength = searchTerm.text().length(); final int realPrefixLength = prefixLength > fullSearchTermLength ? fullSearchTermLength : prefixLength; this.text = searchTerm.text().substring(realPrefixLength); this.prefix = searchTerm.text().substring(0, realPrefixLength); growDistanceArrays(TYPICAL_LONGEST_WORD_IN_INDEX); setEnum(reader.terms(new Term(searchTerm.field(), prefix))); }

/**

The termCompare method in FuzzyTermEnum uses Levenshtein distance to
calculate the distance between the given term and the comparing term.
*/
protected final boolean termCompare(Term term)
Unknown macro: { if (field == term.field() && term.text().startsWith(prefix)) { final String target = term.text().substring(prefix.length()); this.similarity = similarity(target); return (similarity > minimumSimilarity); } endEnum = true; return false; }

public final float difference()

{ return (float)((similarity - minimumSimilarity) * scale_factor); }

public final boolean endEnum()

{ return endEnum; }

/******************************

Compute Levenshtein distance
******************************/

/**

Finds and returns the smallest of three integers
*/
private static final int min(int a, int b, int c) {
// removed assignments to use double ternary
return (a < b)
? ((a < c) ? a : c)
: ((b < c) ? b: c);

// alt form is:
// if (a < b)

{ if (a < c) return a; else return c; }

// if (b < c) return b; else return c;
}

/**

Similarity returns a number that is 1.0f or less (including negative numbers)
based on how similar the Term is compared to a target term. It returns
exactly 0.0f when
<pre>
editDistance < maximumEditDistance</pre>
Otherwise it returns:
<pre>
1 - (editDistance / length)</pre>
where length is the length of the shortest term (text or target) including a
prefix that are identical and editDistance is the Levenshtein distance for
the two words.
*
Embedded within this algorithm is a fail-fast Levenshtein distance
algorithm. The fail-fast algorithm differs from the standard Levenshtein
distance algorithm in that it is aborted if it is discovered that the
mimimum distance between the words is greater than some threshold.
*
To calculate the maximum distance threshold we use the following formula:
<pre>
(1 - minimumSimilarity) * length</pre>
where length is the shortest term including any prefix that is not part of the
similarity comparision. This formula was derived by solving for what maximum value
of distance returns false for the following statements:
<pre>
similarity = 1 - ((float)distance / (float) (prefixLength + Math.min(textlen, targetlen)));
return (similarity > minimumSimilarity);</pre>
where distance is the Levenshtein distance for the two words.
Levenshtein distance (also known as edit distance) is a measure of similiarity
between two strings where the distance is measured as the number of character
deletions, insertions or substitutions required to transform one string to
the other string.
@param target the target word or phrase
@return the similarity, 0.0 or less indicates that it matches less than the required
threshold and 1.0 indicates that the text and target are identical
*/
private synchronized final float similarity(final String target) {
final int m = target.length();
final int n = text.length();
if (n == 0) { //we don't have anything to compare. That means if we just add //the letters for m we get the new word return prefix.length() == 0 ? 0.0f : 1.0f - ((float) m / prefix.length()); }
if (m == 0)
{ return prefix.length() == 0 ? 0.0f : 1.0f - ((float) n / prefix.length()); }

final int maxDistance = calculateMaxDistance(m);

if (maxDistance < Math.abs(m-n))

{ //just adding the characters of m to n or vice-versa results in //too many edits //for example "pre" length is 3 and "prefixes" length is 8. We can see that //given this optimal circumstance, the edit distance cannot be less than 5. //which is 8-3 or more precisesly Math.abs(3-8). //if our maximum edit distance is 4, then we can discard this word //without looking at it. return 0.0f; }

//let's make sure we have enough room in our array to do the distance calculations.
if (d0.length <= m)

{ growDistanceArrays(m); }

int[] dLast = d0; // set local vars for efficiency ~ the old d[i-1]
int[] dCurrent = d1; // ~ the old d[i]
for (int j = 0; j <= m; j++) dCurrent[j] = j;

for (int i = 0; i < n; ) {
final char s_i = text.charAt;
int[] dTemp = dLast;
dLast = dCurrent; // previously: d[i-i]
dCurrent = dTemp; // previously: d[i]
boolean prune = (dCurrent[0] = ++i) > maxDistance; // true if d[i][0] is too large
for (int j = 0; j < m; j++)

{ dCurrent[j+1] = (s_i == target.charAt(j)) ? min(dLast[j+1]+1, dCurrent[j]+1, dLast[j]) : min(dLast[j+1], dCurrent[j], dLast[j])+1; if (prune && dCurrent[j+1] <= maxDistance) prune = false; }

// (prune==false) iff (dCurrent[j] < maxDistance) for some j
if (prune)

{ return 0.0f; }

}

// this will return less than 0.0 when the edit distance is
// greater than the number of characters in the shorter word.
// but this was the formula that was previously used in FuzzyTermEnum,
// so it has not been changed (even though minimumSimilarity must be
// greater than 0.0)
return 1.0F - dCurrent[m]/(float)(prefix.length() + Math.min(n,m));
}

/**

Grow the second dimension of the array slices, so that we can
calculate the Levenshtein difference.
*/
private void growDistanceArrays(int m) { d0 = new int[m+1]; d1 = new int[m+1]; }

private int calculateMaxDistance(int m)

{ return (int) ((1-minimumSimilarity) * (Math.min(text.length(), m) + prefix.length())); }

/* This is redundant
public void close() throws IOException

{ super.close(); //call super.close() and let the garbage collector do its work. }

*/

}

Attachments

Issue Links

is related to

LUCENE-1183 TRStringDistance uses way too much memory (with patch)

Closed

relates to

LUCENE-1183 TRStringDistance uses way too much memory (with patch)

Closed

Bob Carpenter's FuzzyTermEnum refactoring

Details

Description

Attachments

Issue Links

Activity

People

Dates