[LUCENE-1190] a lexicon object for merging spellchecker and synonyms from stemming - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 2.3
Fix Version/s: None
Component/s: core/search, modules/other
Labels:
None

Lucene Fields:

New, Patch Available

Description

Some Lucene features need a list of referring word. Spellchecking is the basic example, but synonyms is an other use. Other tools can be used smoothlier with a list of words, without disturbing the main index : stemming and other simplification of word (anagram, phonetic ...).
For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can be built from Lucene Directory, or plain text files.
Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter should be the most useful).
Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...).
Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used.
A similarTokenFilter is provided.
A spellchecker will come soon.
A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
Unused words can be remove on demand (lazy delete?)

Any criticism or suggestions?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

aphone+lexicon.patch
25/Feb/08 20:46
303 kB
Mathieu Lecarme
aphone+lexicon.patch
29/Feb/08 19:11
336 kB
Mathieu Lecarme

Activity

People

Assignee:: Otis Gospodnetic

Reporter:: Mathieu Lecarme

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Feb/08 20:45

Updated:: 28/Aug/22 11:46

Resolved:: 10/Mar/13 13:28