[LUCENE-1124] short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Trivial
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.9.1, 3.0
Component/s: core/query/scoring
Labels:
None

Lucene Fields:

New

Description

I found this (unreplied to) email floating around in my Lucene folder from during the holidays...

From: Timo Nentwig
To: java-dev
Subject: Fuzzy makes no sense for short tokens
Date: Mon, 31 Dec 2007 16:01:11 +0100
Message-Id: <200712311601.12255.lucene@nitwit.de>

Hi!

it generally makes no sense to search fuzzy for short tokens because changing
even only a single character of course already results in a high edit
distance. So it actually only makes sense in this case:

           if( token.length() > 1f / (1f - minSimilarity) )

E.g. changing one character in a 3-letter token (foo) results in an edit
distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
we can save all the expensive rewrite() logic.

I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity. (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-1124.patch
16/Oct/09 16:31
3 kB
Michael McCandless
LUCENE-1124.patch
04/Jan/09 15:35
3 kB
Mark Miller
LUCENE-1124.patch
17/Aug/08 14:44
3 kB
Mark Miller
LUCENE-1124.patch
17/Aug/08 14:35
3 kB
Mark Miller

Activity

People

Assignee:: Mark Miller

Reporter:: Chris M. Hostetter

Votes:: 1 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 08/Jan/08 22:41

Updated:: 28/Aug/22 11:44

Resolved:: 16/Oct/09 17:38