Commenting on the stuff you edited away because they are good confusions!
I'm curious why there shouldn't there be some trimming in `end()` as well? Or is a `TokenStream` meant to be used only once (no multiple `reset()`, `incrementToken()`, `end()` on the same `TokenStream`)?
The TokenStream API is confusing
I started with end here too (it seemed correct) but it turns out close is also called (internally, in Lucene's IndexWriter) after all tokens are iterated for a single input, but close is called even on exception (but end is not necessarily I think).
The TokenStream instance is typically thread-private, and re-used (for that one thread) for analysis on future docs.
Elasticsearch seems to never reinstantiate Tokenizers and just reuses them for each field in an index, though I may be wrong. Or elasticsearch is using TokenStream the wrong way?
ES using Lucene's Analyzer (well, DelegatingAnalyzerWrapper I think), which (by default) reuses the Tokenizer instance, per thread.
It'd be great if this can get added to 4.10 so elasticsearch 1.x can pull it in too.
I think it's unlikely Lucene will have another 4.10.x release, and ES is releasing 2.0.0 (using Lucene 5.3.x) shortly.
Can you describe what impact you're seeing from this bug? How many PatternTokenizer instances is ES keeping in your case, how large are your docs, etc.? You could probably lower the ES bulk indexing thread pool size (if you don't in fact need so much concurrency) to reduce the impact of the bug ...
I think this bug means PatternTokenizer holds onto the max sized doc it ever saw in heap right? Does StringBuilder ever reduce its allocated space by itself...