Lucene - Core
  1. Lucene - Core
  2. LUCENE-5815

Add TermAutomatonQuery, for proximity matching that generalizes MultiPhraseQuery/SpanNearQuery

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I created a new query, called TermAutomatonQuery, that's a proximity
      query to generalize MultiPhraseQuery/SpanNearQuery: it lets you
      construct an arbitrary automaton whose transitions are whole terms, and
      then find all documents that the automaton matches. This is different
      from a "normal" automaton whose transitions are usually
      bytes/characters within a term/s.

      So, if the automaton has just 1 transition, it's just an expensive
      TermQuery. If you have two transitions in sequence, it's a phrase
      query of two terms. You can express synonyms by using transitions
      that overlap one another but the automaton doesn't have to be a
      "sausage" (as MultiPhraseQuery requires) i.e. it "respects" posLength
      (at query time).

      It also allows "any" transitions, to match any term, so you can do
      sloppy matching and span-like queries, e.g. find "lucene" and "python"
      with up to 3 other terms in between.

      I also added a class to convert a TokenStream directly to the
      automaton for this query, preserving posLength. (Of course, the index
      can't store posLength, so the matching won't be fully correct if any
      indexed tokens has posLength != 1). But if you do query-time-only
      synonyms then the matching should finally be correct.

      I haven't tested performance but I suspect it's quite slowish ... its
      cost is O(sum-totalTF) of all terms "used" in the automaton. There
      are some optimizations we could do, e.g. detecting that some terms in
      the automaton can be upgraded to MUST (right now they are all
      effectively SHOULD).

      I'm not sure how it should assign scores (punted on that for now), but
      the matching seems to be working.

      1. LUCENE-5815.patch
        53 kB
        Michael McCandless
      2. LUCENE-5815.patch
        46 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch, work in progress, lots of nocommits...

        Show
        Michael McCandless added a comment - Patch, work in progress, lots of nocommits...
        Hide
        Robert Muir added a comment -

        (Of course, the index
        can't store posLength, so the matching won't be fully correct if any
        indexed tokens has posLength != 1).

        Why not? Cant we just have a tokenfilter that encodes positionLengthAttribute as a vInt payload (will always be one byte, unless you are crazy)? The custom scorer here could optionally support it.

        Personally: not sure its worth it. I think its better to fix QP to parse correctly in common cases like word-delimiter etc (first: those tokenfilters must be fixed!).

        And I'm a little confused if this approach is faster than rewrite() to booleans of phrase queries?

        Show
        Robert Muir added a comment - (Of course, the index can't store posLength, so the matching won't be fully correct if any indexed tokens has posLength != 1). Why not? Cant we just have a tokenfilter that encodes positionLengthAttribute as a vInt payload (will always be one byte, unless you are crazy)? The custom scorer here could optionally support it. Personally: not sure its worth it. I think its better to fix QP to parse correctly in common cases like word-delimiter etc (first: those tokenfilters must be fixed!). And I'm a little confused if this approach is faster than rewrite() to booleans of phrase queries?
        Hide
        Michael McCandless added a comment -

        Cant we just have a tokenfilter that encodes positionLengthAttribute as a vInt payload (will always be one byte, unless you are crazy)? The custom scorer here could optionally support it.

        Yes, I think that would work! And would be pretty simple to build... and the changes to this scorer would be simple: right now it just hardwires that a given token goes from pos to pos+1, but with this vInt in the payload it would decode that and use it instead of +1.

        I think its better to fix QP to parse correctly in common cases like word-delimiter etc (first: those tokenfilters must be fixed!).

        Right, QP needs to use posLength to build the correct queries... this new query just makes it "easy" since any arbitrary graph TokenStream can be directly translated into the equivalent query.

        And I'm a little confused if this approach is faster than rewrite() to booleans of phrase queries?

        We can only rewrite to BQ of PQ if the automaton doesn't use the ANY token transition, and if it's finite, right? Or maybe we could somehow take ANY and map it to slop on the phrase queries?

        But in those restricted cases, it's probably faster, I guess depending on what the automaton looks like. Ie, you could make a biggish automaton that rewrites to many many phrase queries. I'll add a TODO to maybe do this rewriting for this query ...

        Show
        Michael McCandless added a comment - Cant we just have a tokenfilter that encodes positionLengthAttribute as a vInt payload (will always be one byte, unless you are crazy)? The custom scorer here could optionally support it. Yes, I think that would work! And would be pretty simple to build... and the changes to this scorer would be simple: right now it just hardwires that a given token goes from pos to pos+1, but with this vInt in the payload it would decode that and use it instead of +1. I think its better to fix QP to parse correctly in common cases like word-delimiter etc (first: those tokenfilters must be fixed!). Right, QP needs to use posLength to build the correct queries... this new query just makes it "easy" since any arbitrary graph TokenStream can be directly translated into the equivalent query. And I'm a little confused if this approach is faster than rewrite() to booleans of phrase queries? We can only rewrite to BQ of PQ if the automaton doesn't use the ANY token transition, and if it's finite, right? Or maybe we could somehow take ANY and map it to slop on the phrase queries? But in those restricted cases, it's probably faster, I guess depending on what the automaton looks like. Ie, you could make a biggish automaton that rewrites to many many phrase queries. I'll add a TODO to maybe do this rewriting for this query ...
        Hide
        Robert Muir added a comment -

        We can only rewrite to BQ of PQ if the automaton doesn't use the ANY token transition, and if it's finite, right? Or maybe we could somehow take ANY and map it to slop on the phrase queries?

        Hmm, ok i see what you are getting at.

        I guess I was immediately only considering the case where the automaton is actually coming from query analysis chain: it would always be finite and so on in this case... right?

        Show
        Robert Muir added a comment - We can only rewrite to BQ of PQ if the automaton doesn't use the ANY token transition, and if it's finite, right? Or maybe we could somehow take ANY and map it to slop on the phrase queries? Hmm, ok i see what you are getting at. I guess I was immediately only considering the case where the automaton is actually coming from query analysis chain: it would always be finite and so on in this case... right?
        Hide
        Michael McCandless added a comment -

        I guess I was immediately only considering the case where the automaton is actually coming from query analysis chain: it would always be finite and so on in this case... right?

        Ahh, yes it would! We could always use BQ(PQ) in that case, and "typically" the Automaton would be smallish, unless app makes crazy synonyms, etc.?

        Show
        Michael McCandless added a comment - I guess I was immediately only considering the case where the automaton is actually coming from query analysis chain: it would always be finite and so on in this case... right? Ahh, yes it would! We could always use BQ(PQ) in that case, and "typically" the Automaton would be smallish, unless app makes crazy synonyms, etc.?
        Hide
        Michael McCandless added a comment -

        New patch, I think it's ready. I stopped using RollingBuffer: it was
        overkill. For scoring I just did what PhraseQuery does; this is
        clearly not "right" since an app could make an automaton that matches
        wildly different "phrases", but I think it's OK for starters.

        I added a testRandom, it does [simulated] index-time synonyms, applies
        filters (so we test .advance), even tests ANY transitions, etc.

        I think performance will be not-so-great: this query does a
        disjunction for all terms when often you could make some or all of the
        terms MUST'd. We can easily detect simple cases (startswith /
        endswith a fixed phrase) so we could start with that ... but I think
        we pursue that later.

        Also I didn't explore putting posLength into the index (in a payload)
        and then using it at search time, but that should be pretty easy
        future improvement too.

        Show
        Michael McCandless added a comment - New patch, I think it's ready. I stopped using RollingBuffer: it was overkill. For scoring I just did what PhraseQuery does; this is clearly not "right" since an app could make an automaton that matches wildly different "phrases", but I think it's OK for starters. I added a testRandom, it does [simulated] index-time synonyms, applies filters (so we test .advance), even tests ANY transitions, etc. I think performance will be not-so-great: this query does a disjunction for all terms when often you could make some or all of the terms MUST'd. We can easily detect simple cases (startswith / endswith a fixed phrase) so we could start with that ... but I think we pursue that later. Also I didn't explore putting posLength into the index (in a payload) and then using it at search time, but that should be pretty easy future improvement too.
        Hide
        ASF subversion and git services added a comment -

        Commit 1612076 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1612076 ]

        LUCENE-5815: add TermAutomatonQuery

        Show
        ASF subversion and git services added a comment - Commit 1612076 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1612076 ] LUCENE-5815 : add TermAutomatonQuery
        Hide
        ASF subversion and git services added a comment -

        Commit 1612077 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1612077 ]

        LUCENE-5815: add TermAutomatonQuery

        Show
        ASF subversion and git services added a comment - Commit 1612077 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1612077 ] LUCENE-5815 : add TermAutomatonQuery

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development