Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5815

Add TermAutomatonQuery, for proximity matching that generalizes MultiPhraseQuery/SpanNearQuery

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I created a new query, called TermAutomatonQuery, that's a proximity
      query to generalize MultiPhraseQuery/SpanNearQuery: it lets you
      construct an arbitrary automaton whose transitions are whole terms, and
      then find all documents that the automaton matches. This is different
      from a "normal" automaton whose transitions are usually
      bytes/characters within a term/s.

      So, if the automaton has just 1 transition, it's just an expensive
      TermQuery. If you have two transitions in sequence, it's a phrase
      query of two terms. You can express synonyms by using transitions
      that overlap one another but the automaton doesn't have to be a
      "sausage" (as MultiPhraseQuery requires) i.e. it "respects" posLength
      (at query time).

      It also allows "any" transitions, to match any term, so you can do
      sloppy matching and span-like queries, e.g. find "lucene" and "python"
      with up to 3 other terms in between.

      I also added a class to convert a TokenStream directly to the
      automaton for this query, preserving posLength. (Of course, the index
      can't store posLength, so the matching won't be fully correct if any
      indexed tokens has posLength != 1). But if you do query-time-only
      synonyms then the matching should finally be correct.

      I haven't tested performance but I suspect it's quite slowish ... its
      cost is O(sum-totalTF) of all terms "used" in the automaton. There
      are some optimizations we could do, e.g. detecting that some terms in
      the automaton can be upgraded to MUST (right now they are all
      effectively SHOULD).

      I'm not sure how it should assign scores (punted on that for now), but
      the matching seems to be working.

        Attachments

        1. LUCENE-5815.patch
          53 kB
          Michael McCandless
        2. LUCENE-5815.patch
          46 kB
          Michael McCandless

          Activity

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              mikemccand Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: