[LUCENE-2667] Fix FuzzyQuery's defaults, so its fast. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0-ALPHA
Fix Version/s: 4.0-ALPHA
Component/s: core/search
Labels:
None

Lucene Fields:

New

Description

We worked a lot on FuzzyQuery, but you need to be a rocket scientist to ensure good results.

The main problem is that the default distance is 0.5f, which doesn't take into account the length of the string.
To add insult to injury, the default number of expansions is 1024 (traditionally from BooleanQuery maxClauseCount)

I propose:

The syntax of FuzzyQuery is enhanced, so that you can specify raw edits too: such as foobar~2 (all terms within 2 levenshtein edits of foobar). Previously if you specified any amount >=1, you got IllegalArgumentException, so this won't break anyone. You can still use foobar~0.5, and it works just as before
The default for minimumSimilarity then becomes LevenshteinAutomata.MAXIMUM_SUPPORTED_DISTANCE, which is 2. This way if you just do foobar~, its always fast.
The size of the priority queue is reduced by default from 1024 to a much more reasonable value: 50. This is what FuzzyLikeThis uses.

I think its best to just change the defaults for this query, since it was so aweful before. We can add notes in migrate.txt that if you care about using the old values, then you should provide them explicitly, and you will get the same results!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-2667.patch
24/Sep/10 23:45
35 kB
Robert Muir
LUCENE-2667.patch
28/Sep/10 13:50
36 kB
Robert Muir
LUCENE-2667_contrib.patch
20/Oct/10 01:53
9 kB
Robert Muir

Activity

People

Assignee:: Robert Muir

Reporter:: Robert Muir

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Sep/10 23:45

Updated:: 28/Aug/22 12:33

Resolved:: 20/Oct/10 02:14