I think it's used for both tokenized and un-tokenized.... see line1319.
It seems redundant to call clear() in both the consumer (DocumentsWriter) and producer (Tokenizer).
You're right again Yonik, I missed line 1319.
But I think it would be cleaner/safer to move the responsibility to clear() from consumers to producers.
(Producer being the deepest tokenstream in the call sequence, the one that would instantiate a new Token if it implemented next()).
Otherwise you get bugs like the one I had in testStopPositons() in the patch for
The test chains two stop filters:
- a = WhiltSpaceAnalyzer().
- f1 = StopFilter(a)
- f2 = StopFilter(f1)
Now the test itself calls next().
StopFilter implements only next(Token).
So this is the sequence we get:
- test call f2.next()
- TokenSteam next() calls t2.next(new Token())
- t2.next(r) calls t1.next(r) repeatedly (until r not stopped).
- t1.next(r) calls a.ts.next(r) repeatedly (until r not stopped).
The cause for the bug was that when t1 returns a token r, it may have set r's pos_incr to something other than1. But when t2 calls t1 again (because t2 stopped r), that pos_incr should have bean cleared to 1. Now this can also be fixed by changing StopFilter to clear() before every call to t1.next(r), except for the first time it calls ts.next(), because for the first call it can assume that its consumer already cleared r. Since most words are not stopped, this distinction between first call to successive calls is important, performance wise.
Now, this is a little complicated (and not only because of my writing style : - ) ),
and so I think moving the clear() responsibility to the (deep most) producer would make things more simple and safe.