[LUCENE-5252] add NGramSynonymTokenizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

I'd like to propose that we have another n-gram tokenizer which can process synonyms. That is NGramSynonymTokenizer. Note that in this ticket, the gram size is fixed, i.e. minGramSize = maxGramSize.

Today, I think we have the following problems when using SynonymFilter with NGramTokenizer.
For purpose of illustration, we have a synonym setting "ABC, DEFG" w/ expand=true and N = 2 (2-gram).

There is no consensus (I think how we assign offsets to generated synonym tokens DE, EF and FG when expanding source token AB and BC.
If the query pattern looks like ABCY, it cannot be matched even if there is a document "…ABCY…" in index when autoGeneratePhraseQueries set to true, because there is no "CY" token (but "GY" is there) in the index.

NGramSynonymTokenizer can solve these problems by providing the following methods.

NGramSynonymTokenizer reads synonym settings (synonyms.txt) and it doesn't tokenize registered words. e.g.

source text	NGramTokenizer+SynonymFilter	NGramSynonymTokenizer
ABC	AB/DE/BC/EF/FG	ABC/DEFG

The back and forth of the registered words, NGramSynonymTokenizer generates extra tokens w/ posInc=0. e.g.

source text	NGramTokenizer+SynonymFilter	NGramSynonymTokenizer
XYZABC123	XY/YZ/ZA/AB/DE/BC/EF/C1/FG/12/23	XY/YZ/Z/ABC/DEFG/1/12/23

In the above sample, "Z" and "1" are the extra tokens.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-5252_4x.patch
18/Oct/13 13:59
84 kB
Koji Sekiguchi
LUCENE-5252_4x.patch
15/Oct/13 09:12
84 kB
Koji Sekiguchi
LUCENE-5252_4x.patch
11/Oct/13 13:27
84 kB
Koji Sekiguchi
LUCENE-5252_4x.patch
03/Oct/13 07:28
83 kB
Koji Sekiguchi
LUCENE-5252_4x.patch
02/Oct/13 09:09
20 kB
Koji Sekiguchi

Issue Links

is duplicated by

LUCENE-5253 add NGramSynonymTokenizer

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Koji Sekiguchi

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Oct/13 08:44

Updated:: 28/Aug/22 13:54