Here's the dev thread that lead to this issue, for context:
I think the syn filter here takes generally the same approach as
Solr's (now moved to modules/analyzer in trunk) SynonymFilter, ie
overlapping words as the expanded synonyms unwind? Are there salient
differences between the two? Maybe we can merge them and get best of
There are tricky tradeoffs of index time vs search time – index time
is less flexible (you must re-index on changing them) but better
search perf (OR in a TermQuery instead of expanding to many
PhraseQuerys); index time is better scoring (the IDF is "true" if the
syn is a term in the index, vs PhraseQuery which necessarily
approximates, possibly badly).
There is also the controversial question of whether using manually
defined synonyms even helps relevance As Robert points out, doing
an iteration of feedback (take the top N docs, that match user's
query, extract their salient terms, and do a 2nd search expanded w/
those salient terms), sort of accomplishes something similar (and
perhaps better since it's not just synonyms but also uncovers
"relationships" like Barack Obama is a US president), but w/o the
manual effort of creating the synonyms. And it's been shown to
Still, I think Lucene should make index and query time expansion
feasible. At the plumbing level we don't have a horse in that race
If you do index syns at index time, you really should just inject a
single syn token, representing any occurrence of a term/phrase that
this synonym accepts (and do the matching thing @ query time). But,
then, as Earwin pointed out, Lucene is missing the notion of "span"
saying how many positions this term took up (we only encode the pos
incr, reflecting where this token begins relative to the last token's
EG if "food place" is a syn for "restaurant", and you have a doc
"... a great food place in boston ...", and so you inject RESTAURANT (syn
group) "over" the phrase "food place", then an exact phrase query
won't work right – you can't have "a great RESTAURANT in boston"
One simple way to express this during analysis is as a new SpanAttr
(say), which expresses how many positions the token takes up. We
could then index this, doing so efficiently for the default case
(span==1), and then in addition to getting the .nextPosition() you
could then also ask for .span() from DocsAndPositionsEnum.
But, generalizing this a bit, really we are indexing a graph, where
the nodes are positions and the edges are tokens connecting them.
With only posIncr & span, you restrict the nodes to be a single linear
chain; but if we generalize it, then nodes can be part of side
branches; eg the node in the middle of "food place" need not be a
"real" position if it were injected into a document / query containing
restaurant. Hard boundaries (eg b/w sentences) would be more cleanly
represented here – there would not even be an edge between the nodes.
We'd then need an AutomatonWordQuery – the same idea as
AutomatonQuery, except at the word level not at the character level.
MultiPhraseQuery would then be a special case of AutomatonWordQuery.
Then analysis becomes the serializing of this graph... analysis would
have to flatten out the nodes into a single linear chain, and then
express the edges using position & span. I think position would no
longer be a hard relative position. EG when injecting "food place" (=
2 tokens) into the tokens that contain restaurant, both food and
restaurant would have the same start position, but food would have
span 1 and restaurant would have span 2.
(Sorry for the rambling... this is a complex topic!!).