Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7848

Strictly enforce charFilter/tokenizer/filter order in fieldType definitions



    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 5.2.1
    • Fix Version/s: None
    • Component/s: Schema and Analysis
    • Labels:


      Currently you can define a fieldType with the components specified backwards:

          <fieldType name="icu_test" class="solr.TextField">
              <filter class="solr.LowercaseFilterFactory"/>
              <tokenizer class="solr.ICUTokenizerFactory"/>
              <charFilter class="solr.HTMLStripCharFilterFactory"/>

      This will work (just tested in 5.2.1), but it will work in exactly the opposite order that it is defined.

      The moinmoin wiki page for Analyzers, Tokenizers, and TokenFilters, in the section for HTMLStripCharFilterFactory, states that charFilter definitions must come before the tokenizer. This bit of documentation is wrong.

      The easiest fix would be to correct the wiki page, but if the order in the config can be detected, we could emit a warning in 5.x when the order is wrong and fail to start the core in 6.0.

      When I was first building my schema, back in the 1.4 days, I was thoroughly confused and caught off guard when I tried to use PatternReplaceCharFilterFactory. I found that it was being executed before tokenization, even though I had defined it AFTER. I did eventually figure out my mistake and switched to PatternReplaceFilterFactory. If the incorrect order had been enforced, or caused a warning in the log, I would have figured it out a lot sooner.




            • Assignee:
              elyograg Shawn Heisey
            • Votes:
              0 Vote for this issue
              1 Start watching this issue


              • Created: