[LUCENE-3907] Improve the Edge/NGramTokenizer/Filters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.4
Component/s: None
Labels:
- gsoc2013

Lucene Fields:

New

Description

Our ngram tokenizers/filters could use some love. EG, they output ngrams in multiple passes, instead of "stacked", which messes up offsets/positions and requires too much buffering (can hit OOME for long tokens). They clip at 1024 chars (tokenizers) but don't (token filters). The split up surrogate pairs incorrectly.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-3907.patch
06/May/13 23:04
43 kB
Adrien Grand

Issue Links

is related to

LUCENE-4641 Fix analyzer bugs documented in TestRandomChains

Open

LUCENE-3920 ngram tokenizer/filters create nonsense offsets if followed by a word combiner

Resolved

LUCENE-4810 Positions are incremented for each ngram in EdgeNGramTokenFilter

Closed

SOLR-11894 [Ref-Guide] Removed parameter mentioned in EdgeNgram tokenizer doc

Closed

Activity

People

Assignee:: Adrien Grand

Reporter:: Michael McCandless

Votes:: 2 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 23/Mar/12 17:33

Updated:: 28/Aug/22 13:12

Resolved:: 13/May/13 17:56