i worry about this clearAttributes solution though, perhaps WordTokenFilter should use captureState/restoreState api, like the ThaiWordFilter does (very similar analyzer).
If i use capture/restoreState this should not be a problem right?
I think the filter is fine how it is at the moment. The problem is only the missing clearAttributes when you produce more than one token out of one big one (the sentence). No need for captureState, because the tokens are new ones. If somebody adds custom attributes, they would have cleared, but would that be not correct?
I guess the only advantage would be that it would preserve any customAttributes or payloads that someone might add after the SentenceTokenizer, but before the WordTokenFilter propagating them downto the individual words.
Does this make sense to insert a filter between both? The transition from sentence tokens to word tokens creates totally different tokens, how should a payload or other custom att work correct here? Normally such payload filters should be inserted after the WordFilter. The problem of capture/restore state is addiional copy cost for nothing (the long sentence token is copied again and again and always reset to the text word).