[LUCENE-3842] Analyzing Suggester - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.6, 4.0-ALPHA
Fix Version/s: 4.1, 6.0
Component/s: modules/spellchecker
Labels:
None

Lucene Fields:

New

Description

Since we added shortest-path wFSA search in ~~LUCENE-3714~~, and generified the comparator in ~~LUCENE-3801~~,
I think we should look at implementing suggesters that have more capabilities than just basic prefix matching.

In particular I think the most flexible approach is to integrate with Analyzer at both build and query time,
such that we build a wFST with:
input: analyzed text such as ghost0christmas0past <-- byte 0 here is an optional token separator
output: surface form such as "the ghost of christmas past"
weight: the weight of the suggestion

we make an FST with PairOutputs<weight,output>, but only do the shortest path operation on the weight side (like
the test in ~~LUCENE-3801~~), at the same time accumulating the output (surface form), which will be the actual suggestion.

This allows a lot of flexibility:

Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in "ghost of chr...",
it will suggest "the ghost of christmas past"
we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!)
this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading,
so we would add a TokenFilter that copies ReadingAttribute into term text to support that...
other general things like offering suggestions that are more "fuzzy" like using a plural stemmer or ignoring accents or whatever.

According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not
explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-3842-TokenStream_to_Automaton.patch
04/Mar/12 14:31
17 kB
Michael McCandless
LUCENE-3842.patch
03/Mar/12 20:52
26 kB
Robert Muir
LUCENE-3842.patch
14/Mar/12 20:24
44 kB
Robert Muir
LUCENE-3842.patch
10/May/12 14:57
34 kB
Robert Muir
LUCENE-3842.patch
11/May/12 12:39
34 kB
Robert Muir
LUCENE-3842.patch
11/May/12 16:26
53 kB
Michael McCandless
LUCENE-3842.patch
12/May/12 20:17
58 kB
Michael McCandless
LUCENE-3842.patch
15/May/12 15:42
64 kB
Michael McCandless
LUCENE-3842.patch
28/Jun/12 12:03
65 kB
Robert Muir
LUCENE-3842.patch
30/Jun/12 16:14
74 kB
Sudarshan Gaikaiwari
LUCENE-3842.patch
15/Sep/12 16:51
72 kB
Michael McCandless
LUCENE-3842.patch
15/Sep/12 18:56
20 kB
Michael McCandless
LUCENE-3842.patch
15/Sep/12 19:09
17 kB
Michael McCandless
LUCENE-3842.patch
16/Sep/12 12:53
25 kB
Michael McCandless
LUCENE-3842.patch
18/Sep/12 12:31
22 kB
Michael McCandless
LUCENE-3842.patch
19/Sep/12 12:00
46 kB
Michael McCandless
LUCENE-3842.patch
28/Sep/12 13:55
116 kB
Michael McCandless
LUCENE-3842.patch
28/Sep/12 19:30
117 kB
Michael McCandless
LUCENE-3842.patch
28/Sep/12 21:45
118 kB
Michael McCandless

Issue Links

contains

SOLR-2479 Phrase (arbitrary delimiter) based autocomplete

Resolved

is related to

LUCENE-4450 Distance boost added to Suggester

Open

relates to

SOLR-2479 Phrase (arbitrary delimiter) based autocomplete

Resolved

Activity

People

Assignee:: Michael McCandless

Reporter:: Robert Muir

Votes:: 2 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 03/Mar/12 20:48

Updated:: 28/Aug/22 13:10

Resolved:: 04/Oct/12 19:13