SOLR-3961 rmuir and i were discussing the tests for LimitTokenCountFilter. it seems to me like there are at least 2 things that should be improved in the tests, and one possible improvement to the code itself...
A) fix TestLimitTokenCountAnalyzer to use MockTokenizer...
3) the closest thing i see to a test of LimitTokenCountFilter is TestLimitTokenCountAnalyzer - i realize now the reason it's testLimitTokenCountAnalyzer doesn't get the same failure is because it's wrapping WhitespaceAnalyzer, StandardAnalyzer - should those be changed to use MockTokenizer?
I think we should always do this!
B) testLimitTokenCountAnalyzer.testLimitTokenCountIndexWriter smells fishy to me, still discussing with rmuir...
4) TestLimitTokenCountAnalyzer also has a testLimitTokenCountIndexWriter that uses MockAnalyzer w/o calling setEnableChecks(false) which seems like it should trigger the same failure i got since it uses MockTokenizer, but in general that test looks suspicious, as it seems to add the exact number of tokens that the limit is configured for, and then asserts that the last token is in the index - but never actaully triggers the limiting logic since exactly the allowed umber of tokens are used.
Then thats fine, because when LimitTokenCountFilter consumes the whole stream, its a "good consumer". its only when it actually truncates that it breaks the tokenstream contract.
My concern wasn't that i thought that test was fishy just because it didn't call setEnableChecks(false) – my concern was that i don't see the point of the test at all as it's currently written, because it doesn't actually trigger the token limiting behavior (if it did it would have to call setEnableChecks(false) ... so what exactly is it suppose to be testing?
C) going back to a comment rmuir made earlier in
Thats because LimitTokenCountFilter violates the workflow of the standard tokenstream API ... it cuts off its consumer and calls end() even while that consumer still has more tokens: this could easily cause unexpected side effects and bugs. ... But we knew about this anyway
This makes me wonder if it would be worth while to add an option to LimitTokenCountFilter to make it always consume all the tokens from the stream it wraps – it could be off by default so it would be havle exactly like today, but if you are wrapping a TokenStream where it's really important to you that every token is consumed before end() is called, you could set it appropriately.
would that make sense?