[LUCENE-5815] Add TermAutomatonQuery, for proximity matching that generalizes MultiPhraseQuery/SpanNearQuery - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.10, 6.0
Component/s: None
Labels:
None

Lucene Fields:

New

Description

I created a new query, called TermAutomatonQuery, that's a proximity
query to generalize MultiPhraseQuery/SpanNearQuery: it lets you
construct an arbitrary automaton whose transitions are whole terms, and
then find all documents that the automaton matches. This is different
from a "normal" automaton whose transitions are usually
bytes/characters within a term/s.

So, if the automaton has just 1 transition, it's just an expensive
TermQuery. If you have two transitions in sequence, it's a phrase
query of two terms. You can express synonyms by using transitions
that overlap one another but the automaton doesn't have to be a
"sausage" (as MultiPhraseQuery requires) i.e. it "respects" posLength
(at query time).

It also allows "any" transitions, to match any term, so you can do
sloppy matching and span-like queries, e.g. find "lucene" and "python"
with up to 3 other terms in between.

I also added a class to convert a TokenStream directly to the
automaton for this query, preserving posLength. (Of course, the index
can't store posLength, so the matching won't be fully correct if any
indexed tokens has posLength != 1). But if you do query-time-only
synonyms then the matching should finally be correct.

I haven't tested performance but I suspect it's quite slowish ... its
cost is O(sum-totalTF) of all terms "used" in the automaton. There
are some optimizations we could do, e.g. detecting that some terms in
the automaton can be upgraded to MUST (right now they are all
effectively SHOULD).

I'm not sure how it should assign scores (punted on that for now), but
the matching seems to be working.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-5815.patch
11/Jul/14 11:30
46 kB
Michael McCandless
LUCENE-5815.patch
14/Jul/14 14:10
53 kB
Michael McCandless

Activity

People

Assignee:: Michael McCandless

Reporter:: Michael McCandless

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 11/Jul/14 11:28

Updated:: 28/Aug/22 14:11

Resolved:: 20/Jul/14 11:43