Description
Our ngram tokenizers/filters could use some love. EG, they output ngrams in multiple passes, instead of "stacked", which messes up offsets/positions and requires too much buffering (can hit OOME for long tokens). They clip at 1024 chars (tokenizers) but don't (token filters). The split up surrogate pairs incorrectly.
Attachments
Attachments
Issue Links
- is related to
-
LUCENE-3920 ngram tokenizer/filters create nonsense offsets if followed by a word combiner
- Resolved
-
LUCENE-4810 Positions are incremented for each ngram in EdgeNGramTokenFilter
- Closed
-
SOLR-11894 [Ref-Guide] Removed parameter mentioned in EdgeNgram tokenizer doc
- Closed
-
LUCENE-4641 Fix analyzer bugs documented in TestRandomChains
- Patch Available