May this much faster than CharArraySet
I ran indexing tests a while ago (reuters) with CharArraySet itself implemented with a DFA, and it was slightly faster, but not much. I think this is because english words are usually not very long (average length=5). For other languages this technique might save some cpu time, but there are some "problems" i imagine
- building an automaton from a list of words is more expensive, although Dawid Weiss has implemented an addition to automaton that does this fast.
- in general building automaton and runautomaton etc is more "heavy" i would think, but Mike Mccandless hacked away a lot of this heaviness when we converted to UTF-32.
- the CharacterRunAutomaton is not optimized right now, we disabled the classmap for chars because it consume more RAM. I think if we were to care about performance on char we should make it classmap 0x0-0xffff and binary search the rest, or something similar. currently it binarysearches on each input character.
Somewhat related, a while ago i tested this with CharArraySet as a DFA, and opened this issue: LUCENE-2227. But obviously this is not the only way, as this example shows filtering on the dfa itself (and not using chararrayset at all).
So in general, i have those concerns right now, but maybe in the future once some things are addressed we could at least make an optional stopfilter impl or something similar.
One thing i like about this filter personally, is that rejected terms always get (optionally) the posInc increased... I do not think our existing KeepWord or LengthFilters do this, but maybe i am wrong.