[LUCENE-7705] Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 6.7, 7.0
Component/s: None
Labels:
None

Lucene Fields:

New

Description

~~SOLR-10186~~

erickerickson: Is there a good reason that we hard-code a 256 character limit for the CharTokenizer? In order to change this limit it requires that people copy/paste the incrementToken into some new class since incrementToken is final.
KeywordTokenizer can easily change the default (which is also 256 bytes), but to do so requires code rather than being able to configure it in the schema.
For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) (Factories) it would take adding a c'tor to the base class in Lucene and using it in the factory.
Any objections?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-7705.patch
23/Feb/17 22:26
26 kB
Erick Erickson
LUCENE-7705.patch
24/Feb/17 17:52
26 kB
Erick Erickson
LUCENE-7705.patch
25/Feb/17 03:20
27 kB
Amrit Sarkar
LUCENE-7705.patch
27/Feb/17 05:05
38 kB
Amrit Sarkar
LUCENE-7705.patch
08/May/17 17:08
30 kB
Erick Erickson
LUCENE-7705.patch
09/May/17 12:31
42 kB
Amrit Sarkar
LUCENE-7705.patch
09/May/17 14:50
35 kB
Erick Erickson
LUCENE-7705.patch
25/May/17 15:16
49 kB
Erick Erickson
LUCENE-7705.patch
28/May/17 22:37
49 kB
Erick Erickson
LUCENE-7705.patch
02/Jun/17 06:04
4 kB
Amrit Sarkar
LUCENE-7705
09/May/17 15:55
72 kB
Amrit Sarkar

Issue Links

depends upon

SOLR-10229 See what it would take to shift many of our one-off schemas used for testing to managed schema and construct them as part of the tests

Open

is required by

SOLR-10186 Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Resolved

relates to

LUCENE-7857 CharTokenizer-derived tokenizers and KeywordTokenizer emit multiple tokens when the max length is exceeded

Resolved

Activity

People

Assignee:: Erick Erickson

Reporter:: Amrit Sarkar

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 23/Feb/17 06:36

Updated:: 28/Aug/22 15:11

Resolved:: 28/May/17 23:40