from the brief code inspection I concluded you're using a search from each position to get the maximum match, is this right?
Right. I mean, that's how the code currently works ... I'm just
replacing the recursive Map w/ an FST.
If so, the pessimistic time is quite bad for patterns like "aaa*b" and input strings "aaaa*c" (replace * with a large number of repeated sequences).
Yes... though I suspect in practice the impact is minimal for Lucene's
typical use cases. Still it would be nice to fix
The way I would try to implement it is to do a state-tracking much like in non-deterministic automata, where you have a stack of "active" states which you follow in the automaton and you move them from state to state on each input symbol.
OK! I think I understand... I think this amounts to putting a .*
cycle on the FST's start node, and then doing subset construction to
"determinize" as you traverse? Ie track all states that the input
characters could be in... so that you only traverse the input once.
I suppose what Mihov et al. do is to statically (on the fst) determine which states can lead to this "deferred match queue" and simply eliminate them, but haven't really looked into it.
Yeah I think I think the paper (and Aho/Corasick) essentially do that
determinization "up front".
SynonymFilter also has the same limitation....