We only need getTokenizer because TokenStream.reset() does not accept a Reader. If we could introduce such method on TokenStream, we wouldn't need to refer to Tokenizer directly.
do you have any ideas on the back compat issues?
Well it's a bit trickier ... today we call reusableTokenStream in our indexing code, and either get a new instance, or a reused instance. We cannot change Analyzer's default behavior, which returns a new instance (unless we're willing to break back-compat), because Analyzers that did not override reusableTokenStream, may break if we start reusing the instance by default (for example if they add two fields to a document w/ reusableTokenStream called twice).
Also, deprecate reusableTokenStream and define a new one (say reuseTokenStream), and move to use it is not good either, since we want its default impl to reuse the token stream, and impls that did not override it may break.
So how about if we create a new abstract ReusingAnalyzer which impls reusableTokenStream to always reuse it. And we add Streams to Analyzer as a protected static class. That way, Analyzers that don't care about reuse, can still extend Analyzer. Analyzers which care about reuse and are fine w/ ReusingAnalyzer's impl, can move to extend it. And Analyzers that care about reuse but want their reuse to be done differently can choose to extend ReusingAnalyzer, or Analyzer.
Back-compat wise, we're safe since:
- Existing Lucene Analyzers that reuse can be changed to extend ReusingAnalyzer.
- Existing Analyzers (outside Lucene code) either override or not reusableTokenStream, and therefore won't break.
- Our indexing code will still call reusableTokenStream, no change here.
- Any code out there which traverses an Analyzer by calling reusableTokenStream does not need to change anything.
I think that'd work?