Description
erickerickson: Is there a good reason that we hard-code a 256 character limit for the CharTokenizer? In order to change this limit it requires that people copy/paste the incrementToken into some new class since incrementToken is final.
KeywordTokenizer can easily change the default (which is also 256 bytes), but to do so requires code rather than being able to configure it in the schema.
For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) (Factories) it would take adding a c'tor to the base class in Lucene and using it in the factory.
Any objections?
Attachments
Attachments
Issue Links
- depends upon
-
SOLR-10229 See what it would take to shift many of our one-off schemas used for testing to managed schema and construct them as part of the tests
- Open
- is required by
-
SOLR-10186 Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
- Resolved
- relates to
-
LUCENE-7857 CharTokenizer-derived tokenizers and KeywordTokenizer emit multiple tokens when the max length is exceeded
- Resolved