The cause is known and the issue reported here is just some broken usage of NumericTokenStream in user code that looks totally broken anyways,
Hmm, I don't think this usage is broken? The public TokenStream API has a contract and the public NumericTokenStream fails to implement it properly
Or maybe you are saying one can never call captureState after calling end? But then how does one hang onto the "final" offset and posInc? And if so, I guess we should fix MockTokenizer to enforce this? Yet, CachingTokenFilter is doing exactly this, so maybe the bug is there?
Our analysis APIs are way too complex! They are like a 747 cockpit. We can't even agree on which buttons you are allowed to press, when.
I have no idea why Elasticsearch uses the class at all!
Well, Elasticsearch's simple query parser is just trying to create the right "equals" query from the incoming text, for a numeric field. Here's the query:
It does this today by using the appropriate tokenizer given the field type, and for numerics that's supposed to be NumericTokenStream. But then we hit this bug, seen in ES originally at https://github.com/elastic/elasticsearch/issues/16577
Yes, ES can work around this bug if we leave Lucene buggy, which is exactly what that PR is doing, by "special casing" numerics by explicitly creating a TermQuery using the full precision numeric term, instead of trusting its type-specific tokenizer to work correctly.
Because of the corner case and possible performance corner cases, I'd tend to close this as won't fix.
Hmm I think correctness trumps performance?