Description
I created a new query, called TermAutomatonQuery, that's a proximity
query to generalize MultiPhraseQuery/SpanNearQuery: it lets you
construct an arbitrary automaton whose transitions are whole terms, and
then find all documents that the automaton matches. This is different
from a "normal" automaton whose transitions are usually
bytes/characters within a term/s.
So, if the automaton has just 1 transition, it's just an expensive
TermQuery. If you have two transitions in sequence, it's a phrase
query of two terms. You can express synonyms by using transitions
that overlap one another but the automaton doesn't have to be a
"sausage" (as MultiPhraseQuery requires) i.e. it "respects" posLength
(at query time).
It also allows "any" transitions, to match any term, so you can do
sloppy matching and span-like queries, e.g. find "lucene" and "python"
with up to 3 other terms in between.
I also added a class to convert a TokenStream directly to the
automaton for this query, preserving posLength. (Of course, the index
can't store posLength, so the matching won't be fully correct if any
indexed tokens has posLength != 1). But if you do query-time-only
synonyms then the matching should finally be correct.
I haven't tested performance but I suspect it's quite slowish ... its
cost is O(sum-totalTF) of all terms "used" in the automaton. There
are some optimizations we could do, e.g. detecting that some terms in
the automaton can be upgraded to MUST (right now they are all
effectively SHOULD).
I'm not sure how it should assign scores (punted on that for now), but
the matching seems to be working.