Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10186

Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      Is there a good reason that we hard-code a 256 character limit for the CharTokenizer? In order to change this limit it requires that people copy/paste the incrementToken into some new class since incrementToken is final.

      KeywordTokenizer can easily change the default (which is also 256 bytes), but to do so requires code rather than being able to configure it in the schema.

      For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) (Factories) it would take adding a c'tor to the base class in Lucene and using it in the factory.

      Any objections?

      Attachments

        1. SOLR-10186.patch
          24 kB
          Amrit Sarkar
        2. SOLR-10186.patch
          14 kB
          Amrit Sarkar
        3. SOLR-10186.patch
          9 kB
          Amrit Sarkar

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            erickerickson Erick Erickson
            erickerickson Erick Erickson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment