Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7638

Optimize graph query produced by QueryBuilder

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.5
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The QueryBuilder creates a graph query when the underlying TokenStream contains token with PositionLengthAttribute greater than 1.
      These TokenStreams are in fact graphs (lattice to be more precise) where synonyms can span on multiple terms.
      Currently the graph query is built by visiting all the path of the graph TokenStream. For instance if you have a synonym like "ny, new york" and you search for "new york city", the query builder would produce two pathes:
      "new york city", "ny city"
      This can quickly explode when the number of multi terms synonyms increase.
      The query "ny ny" for instance would produce 4 pathes and so on.
      For boolean queries with should or must clauses it should be more efficient to build a boolean query that merges all the intersections in the graph. So instead of "new york city", "ny city" we could produce:
      "+((+new +york) ny) +city"

      The attached patch is a proposal to do that instead of the all path solution.
      The patch transforms multi terms synonyms in graph query for each intersection in the graph. This is not done in this patch but we could also create a specialized query that gives equivalent scores to multi terms synonyms like the SynonymQuery does for single term synonyms.
      For phrase query this patch does not change the current behavior but we could also use the new method to create optimized graph SpanQuery.

      Matt Weber I think this patch could optimize a lot of cases where multiple muli-terms synonyms are present in a single request. Could you take a look ?

      1. LUCENE-7638.patch
        48 kB
        Jim Ferenczi
      2. LUCENE-7638.patch
        16 kB
        Jim Ferenczi

        Activity

        Hide
        jpountz Adrien Grand added a comment -

        Maybe TermAutomatonQuery would be a good fit for that problem?

        Show
        jpountz Adrien Grand added a comment - Maybe TermAutomatonQuery would be a good fit for that problem?
        Hide
        mattweber Matt Weber added a comment -

        I think the problem here is that we lose minimum should match support as that is applied AFTER query generation by building a new boolean query. Same thing for phrase slop even though that would not be affected by this patch. If we can move this logic into rewrite method of GraphQuery then we could take all that information into consideration to build a more efficient query.

        Show
        mattweber Matt Weber added a comment - I think the problem here is that we lose minimum should match support as that is applied AFTER query generation by building a new boolean query. Same thing for phrase slop even though that would not be affected by this patch. If we can move this logic into rewrite method of GraphQuery then we could take all that information into consideration to build a more efficient query.
        Hide
        jim.ferenczi Jim Ferenczi added a comment - - edited

        Maybe TermAutomatonQuery would be a good fit for that problem?

        For pure phrase query it's a good fit because it's a proximity query but for boolean queries the problem is different. We cannot build the TermAutomatonQuery directly, first we need to find the start and end state of each multi-term synonyms in the graph. That's what the attached patch is doing lazily, for each intersection point it creates a multi-term synonym query. Currently the multi-term synonym query is a boolean query but we could change the logic and use the TermAutomatonQuery instead or even create a PhaseQuery for each path in the multi-term synonym. This patch also handles nested multi-term synonyms which makes the detection of intersection points harder.
        Bottom point is that if we are able to extract the multi-term synonyms of the graph then we can choose more easily how we want to search and score these inner graph. Does this makes sense ?

        Show
        jim.ferenczi Jim Ferenczi added a comment - - edited Maybe TermAutomatonQuery would be a good fit for that problem? For pure phrase query it's a good fit because it's a proximity query but for boolean queries the problem is different. We cannot build the TermAutomatonQuery directly, first we need to find the start and end state of each multi-term synonyms in the graph. That's what the attached patch is doing lazily, for each intersection point it creates a multi-term synonym query. Currently the multi-term synonym query is a boolean query but we could change the logic and use the TermAutomatonQuery instead or even create a PhaseQuery for each path in the multi-term synonym. This patch also handles nested multi-term synonyms which makes the detection of intersection points harder. Bottom point is that if we are able to extract the multi-term synonyms of the graph then we can choose more easily how we want to search and score these inner graph. Does this makes sense ?
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Matt Weber I don't think we lose minimum should match support. It will be different but interestingly it would also solve some problems. For instance with the all path solution, a synonym like "ny, new york" with a minimum should match of 1, searching for "ny" would not return documents matching "new york". With the proposed solution each multi-term synonyms is considered as a single clause so "ny" and "new york" count for 1.
        I like the finite strings solution because expressing the minimum should match in percentage gives you correct hits. This is great though it requires to duplicate a lot of terms so I wonder if this is something that we should really target. By considering each multi-term synonyms as 1 clause we could simplify the problem and produce more optimized query ?

        Show
        jim.ferenczi Jim Ferenczi added a comment - Matt Weber I don't think we lose minimum should match support. It will be different but interestingly it would also solve some problems. For instance with the all path solution, a synonym like "ny, new york" with a minimum should match of 1, searching for "ny" would not return documents matching "new york". With the proposed solution each multi-term synonyms is considered as a single clause so "ny" and "new york" count for 1. I like the finite strings solution because expressing the minimum should match in percentage gives you correct hits. This is great though it requires to duplicate a lot of terms so I wonder if this is something that we should really target. By considering each multi-term synonyms as 1 clause we could simplify the problem and produce more optimized query ?
        Hide
        mattweber Matt Weber added a comment -

        Jim Ferenczi I have mixed feelings about that as I can see plus and minus of both. When I was originally working on this I essentially decided that everything should be passed to each path as if it was the original query. What do you think Michael McCandless? Also, there are additional use cases that we handle in elasticsearch that have not made their way into Lucene yet that might be affected by this. Boolean with cutoff frequency, prefix queries, etc.

        Show
        mattweber Matt Weber added a comment - Jim Ferenczi I have mixed feelings about that as I can see plus and minus of both. When I was originally working on this I essentially decided that everything should be passed to each path as if it was the original query. What do you think Michael McCandless ? Also, there are additional use cases that we handle in elasticsearch that have not made their way into Lucene yet that might be affected by this. Boolean with cutoff frequency, prefix queries, etc.
        Hide
        mikemccand Michael McCandless added a comment -

        This is an impressive patch! Thanks Jim Ferenczi.

        I agree the combinatoric explosion is a serious problem/vulnerability/adversary and I love how this patch solves it.

        Can this handle "side paths on side paths" graph structure (I think you called this "nested multi-term synonyms")? While no analysis chain can naturally produce this today (that I know of!), the TokenStream attributes can easily express it. And you could imagine it happening in the future, e.g. if you use Kuromoji tokenizer or WordDelimiterGraphFilter followed by a SynonymGraphFilter (except we'd need to e.g. use the synonym graph filter from LUCENE-5012, which can correctly consume a graph). If this is expected to work maybe we should add a test case showing that?

        It seems like you don't need to be using Term here, except at the end to pass to the newXXXQuery, since everything is in a single field here, and we are hoping to move away from Term entirely (LUCENE-7632)?

        Holes are challenging for graph token streams ... can you add a test case that encounters holes, e.g. simulated StopFilter? There are at least two "fun" cases: a hole that cuts the graph entirely into two partitions, and a synonym spanning over a hole ... CannedTokenStream is useful for feeding such "interesting" cases.

        The Path.id seems to be used only for tie-breaking on compare, not for lookup in the TreeSet as the comment suggests?

        On minShouldMatch: I feel it's actually more correct to define its semantics as operating on whole synonyms, not the individual tokens in each synonym, as this patch does. Really, had you done the same synonym reduction at indexing time instead, so that "new york" in a document caused "ny" to be indexed as well, and then searched on "ny", this is the behavior you would see ("new york" counts as the 1 minShouldMatch).

        Of course, in general I think we are in new (to Lucene) territory here, on how graph structured queries should be matched/scored "properly", and I don't really know what the answer should be. TermAutomatonQuery faces similar challenges. Maybe there are some IR papers out there that have studied this?

        I even think it's odd that we don't use a PhraseQuery to match that syn-expanded "new york" even when the user didn't put double quotes around the query: had you done that syn at index time, it would "act like" a PhraseQuery ... but this is an unrelated issue!

        Show
        mikemccand Michael McCandless added a comment - This is an impressive patch! Thanks Jim Ferenczi . I agree the combinatoric explosion is a serious problem/vulnerability/adversary and I love how this patch solves it. Can this handle "side paths on side paths" graph structure (I think you called this "nested multi-term synonyms")? While no analysis chain can naturally produce this today (that I know of!), the TokenStream attributes can easily express it. And you could imagine it happening in the future, e.g. if you use Kuromoji tokenizer or WordDelimiterGraphFilter followed by a SynonymGraphFilter (except we'd need to e.g. use the synonym graph filter from LUCENE-5012 , which can correctly consume a graph). If this is expected to work maybe we should add a test case showing that? It seems like you don't need to be using Term here, except at the end to pass to the newXXXQuery , since everything is in a single field here, and we are hoping to move away from Term entirely ( LUCENE-7632 )? Holes are challenging for graph token streams ... can you add a test case that encounters holes, e.g. simulated StopFilter ? There are at least two "fun" cases: a hole that cuts the graph entirely into two partitions, and a synonym spanning over a hole ... CannedTokenStream is useful for feeding such "interesting" cases. The Path.id seems to be used only for tie-breaking on compare, not for lookup in the TreeSet as the comment suggests? On minShouldMatch : I feel it's actually more correct to define its semantics as operating on whole synonyms, not the individual tokens in each synonym, as this patch does. Really, had you done the same synonym reduction at indexing time instead, so that "new york" in a document caused "ny" to be indexed as well, and then searched on "ny", this is the behavior you would see ("new york" counts as the 1 minShouldMatch ). Of course, in general I think we are in new (to Lucene) territory here, on how graph structured queries should be matched/scored "properly", and I don't really know what the answer should be. TermAutomatonQuery faces similar challenges . Maybe there are some IR papers out there that have studied this? I even think it's odd that we don't use a PhraseQuery to match that syn-expanded "new york" even when the user didn't put double quotes around the query: had you done that syn at index time, it would "act like" a PhraseQuery ... but this is an unrelated issue!
        Hide
        jim.ferenczi Jim Ferenczi added a comment - - edited

        Thanks for taking a look Matt Weber and Michael McCandless

        Can this handle "side paths on side paths" graph structure (I think you called this "nested multi-term synonyms")? While no analysis chain can naturally produce this today (that I know of!), the TokenStream attributes can easily express it. And you could imagine it happening in the future, e.g. if you use Kuromoji tokenizer or WordDelimiterGraphFilter followed by a SynonymGraphFilter (except we'd need to e.g. use the synonym graph filter from LUCENE-5012, which can correctly consume a graph). If this is expected to work maybe we should add a test case showing that?

        I'll add a test case because it's expected to work . This is also the reason why this patch does not produce

        PhraseQuery

        for synonyms. For simple "side paths" this is easy to do but we would need to switch to Span queries for "side paths on side paths" so I though that it could be done in another issue.

        It seems like you don't need to be using Term here, except at the end to pass to the newXXXQuery, since everything is in a single field here, and we are hoping to move away from Term entirely (LUCENE-7632)?

        Thanks, I'll simplify the patch.

        Holes are challenging for graph token streams ... can you add a test case that encounters holes, e.g. simulated StopFilter? There are at least two "fun" cases: a hole that cuts the graph entirely into two partitions, and a synonym spanning over a hole ... CannedTokenStream is useful for feeding such "interesting" cases.

        I though that holes would not be a problem for boolean queries but now I am not sure. I'll test that.

        The Path.id seems to be used only for tie-breaking on compare, not for lookup in the TreeSet as the comment suggests?

        The comment is misleading. It is needed because I use

        TreeSet#remove

        which uses compare to check object equality. So the

        Path.id

        is the unique identifier for the path.

        Show
        jim.ferenczi Jim Ferenczi added a comment - - edited Thanks for taking a look Matt Weber and Michael McCandless Can this handle "side paths on side paths" graph structure (I think you called this "nested multi-term synonyms")? While no analysis chain can naturally produce this today (that I know of!), the TokenStream attributes can easily express it. And you could imagine it happening in the future, e.g. if you use Kuromoji tokenizer or WordDelimiterGraphFilter followed by a SynonymGraphFilter (except we'd need to e.g. use the synonym graph filter from LUCENE-5012 , which can correctly consume a graph). If this is expected to work maybe we should add a test case showing that? I'll add a test case because it's expected to work . This is also the reason why this patch does not produce PhraseQuery for synonyms. For simple "side paths" this is easy to do but we would need to switch to Span queries for "side paths on side paths" so I though that it could be done in another issue. It seems like you don't need to be using Term here, except at the end to pass to the newXXXQuery, since everything is in a single field here, and we are hoping to move away from Term entirely ( LUCENE-7632 )? Thanks, I'll simplify the patch. Holes are challenging for graph token streams ... can you add a test case that encounters holes, e.g. simulated StopFilter? There are at least two "fun" cases: a hole that cuts the graph entirely into two partitions, and a synonym spanning over a hole ... CannedTokenStream is useful for feeding such "interesting" cases. I though that holes would not be a problem for boolean queries but now I am not sure. I'll test that. The Path.id seems to be used only for tie-breaking on compare, not for lookup in the TreeSet as the comment suggests? The comment is misleading. It is needed because I use TreeSet#remove which uses compare to check object equality. So the Path.id is the unique identifier for the path.
        Hide
        mattweber Matt Weber added a comment -

        Ok this is great. So going forward we should assume that synonyms are to treated together (single token or multi-token) and ideally multi-token synonyms as a phrase. Would it be best to move this logic into GraphQuery itself? This would make it so we can still detect when we are working with graph related queries and be easier to make the various optimizations talked about here. Maybe make GraphQuery store the graph token stream instead of the processed queries and then do the graph processing / query generation when rewrite it called?

        Show
        mattweber Matt Weber added a comment - Ok this is great. So going forward we should assume that synonyms are to treated together (single token or multi-token) and ideally multi-token synonyms as a phrase. Would it be best to move this logic into GraphQuery itself? This would make it so we can still detect when we are working with graph related queries and be easier to make the various optimizations talked about here. Maybe make GraphQuery store the graph token stream instead of the processed queries and then do the graph processing / query generation when rewrite it called?
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        I pushed a new patch that changes how we build boolean graph query with multi-term synonyms. It first finds the articulation points of the graph and builds a boolean query for each point. The articulation points (or cut vertices) are computed using the algorithm described in:
        https://en.wikipedia.org/wiki/Biconnected_component
        This means that each time we find a state where side paths of different lengths start, we generate all path that start at this state and end at the next articulation points. If

        QueryBuilder#autoGenerateMultiTermSynonymsPhraseQuery

        is set to true, a phrase query is generated for each path, otherwise a boolean query.

        Matt Weber Michael McCandless can you take a look ?

        Show
        jim.ferenczi Jim Ferenczi added a comment - I pushed a new patch that changes how we build boolean graph query with multi-term synonyms. It first finds the articulation points of the graph and builds a boolean query for each point. The articulation points (or cut vertices) are computed using the algorithm described in: https://en.wikipedia.org/wiki/Biconnected_component This means that each time we find a state where side paths of different lengths start, we generate all path that start at this state and end at the next articulation points. If QueryBuilder#autoGenerateMultiTermSynonymsPhraseQuery is set to true, a phrase query is generated for each path, otherwise a boolean query. Matt Weber Michael McCandless can you take a look ?
        Hide
        mikemccand Michael McCandless added a comment -

        +1, the new patch looks great! Thanks Jim Ferenczi.

        Show
        mikemccand Michael McCandless added a comment - +1, the new patch looks great! Thanks Jim Ferenczi .
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Thanks Michael McCandless.
        Just to be clear this patch creates a BooleanQuery with MUST clauses (or a PhraseQuery if autoGenerateMultiTermSynonymsPhraseQuery is set to true) for each synonym path.
        I'll commit shortly if there are no objections.

        Show
        jim.ferenczi Jim Ferenczi added a comment - Thanks Michael McCandless . Just to be clear this patch creates a BooleanQuery with MUST clauses (or a PhraseQuery if autoGenerateMultiTermSynonymsPhraseQuery is set to true) for each synonym path. I'll commit shortly if there are no objections.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit e764d3e0590e998676a18f2b11364cecced4a8ad in lucene-solr's branch refs/heads/master from Jim Ferenczi
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e764d3e ]

        LUCENE-7638: Query parsers now analyze the token graph for articulation points (or cut vertices) in order to create more efficient queries for multi-token synonyms.

        Show
        jira-bot ASF subversion and git services added a comment - Commit e764d3e0590e998676a18f2b11364cecced4a8ad in lucene-solr's branch refs/heads/master from Jim Ferenczi [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e764d3e ] LUCENE-7638 : Query parsers now analyze the token graph for articulation points (or cut vertices) in order to create more efficient queries for multi-token synonyms.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 5386116e9a3b54db4674b21e39a41cc4d43553f8 in lucene-solr's branch refs/heads/branch_6x from Jim Ferenczi
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5386116 ]

        LUCENE-7638: Query parsers now analyze the token graph for articulation points (or cut vertices) in order to create more efficient queries for multi-token synonyms.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 5386116e9a3b54db4674b21e39a41cc4d43553f8 in lucene-solr's branch refs/heads/branch_6x from Jim Ferenczi [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5386116 ] LUCENE-7638 : Query parsers now analyze the token graph for articulation points (or cut vertices) in order to create more efficient queries for multi-token synonyms.
        Hide
        mattweber Matt Weber added a comment -

        Jim Ferenczi Sorry so late been swamped. Anyways, this is great! I really like this approach, awesome job man!

        Show
        mattweber Matt Weber added a comment - Jim Ferenczi Sorry so late been swamped. Anyways, this is great! I really like this approach, awesome job man!
        Hide
        mikemccand Michael McCandless added a comment -

        Jim Ferenczi can this be closed now?

        Show
        mikemccand Michael McCandless added a comment - Jim Ferenczi can this be closed now?
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Sure, thanks Michael McCandless and Matt Weber !

        Show
        jim.ferenczi Jim Ferenczi added a comment - Sure, thanks Michael McCandless and Matt Weber !
        Hide
        mattweber Matt Weber added a comment - - edited

        Jim Ferenczi Michael McCandless Could we apply the same articulation points logic in analyzeGraphPhrase but generate span queries to essentially act like a phrase?

        SpanNear[
          SpanOr[
            SpanNear[SpanTerm[new], SpanTerm[york]]
            SpanTerm[ny]
          ],
          SpanTerm[city]
        ]
        
        Show
        mattweber Matt Weber added a comment - - edited Jim Ferenczi Michael McCandless Could we apply the same articulation points logic in analyzeGraphPhrase but generate span queries to essentially act like a phrase? SpanNear[ SpanOr[ SpanNear[SpanTerm[ new ], SpanTerm[york]] SpanTerm[ny] ], SpanTerm[city] ]
        Hide
        mbraun688 Michael Braun added a comment -

        This ticket is missing fix versions, but it looks like there were commits.

        Show
        mbraun688 Michael Braun added a comment - This ticket is missing fix versions, but it looks like there were commits.
        Hide
        ehatcher Erik Hatcher added a comment -

        Michael Braun looks like from CHANGES and commits this was 6.5 so I marked it as so.

        Show
        ehatcher Erik Hatcher added a comment - Michael Braun looks like from CHANGES and commits this was 6.5 so I marked it as so.

          People

          • Assignee:
            Unassigned
            Reporter:
            jim.ferenczi Jim Ferenczi
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development