it appears you accidentally included other WIP
Sorry I probably generated the patch against the wrong base commit, hence these unrelated changes.
Why create a StringTokenStream; isn't KeywordTokenizer fine? Oh I see that's in another module... kinda seems like a generic utility that should be in core to me IMO.
I'd be fine to have KeywordTokenizer in core too, let's discuss it in another issue and then potentially cut over to it if it makes it to core?
An easy optimization is to check if initReaderForNormalization returns the input StringReader. If so, simply set filteredText to text.
The way #normalize works is indeed not very efficient at the moment. In addition to this, it does not cache its analysis chain like we do for #tokenStream. But it's probably ok since this method should not be called as intensively as #tokenStream? (we can still improve in the future if this proves to be a bottleneck)
It's a shame to call createComponents just to get the AttributeFactory
Agreed, this one annoys me too. I initially wanted to add a method but this is a pity since this information is already available. That said, maybe the method approach is better since borrowing the attribute factory from the regular analysis chain makes us close the token stream before it has been consumed, which some analysis chains might not like. I updated the patch.
I suppose a separate issue might be for Solr to do this when someone configures a custom Analyzer.
Solr already solves this problem in a different way by having a different analyzer for multi-term queries which is computed using MultiTermAwareComponent. I agree it would be nice for it to switch to Analyzer#normalize but this would have the drawback that it would either require to drop support for configuring a custom multi-term analyzer or the integration would be a bit weird, ie. it would have to use Analyzer.tokenStream on the multiterm analyzer if it is configured or fall back to Analyzer.normalize on the default analyzer if no multi-term analyzer is configured - which might be controversial.