Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7824

Multi-word synonyms rule with common terms at the same position are buggy

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Resolved
    • Affects Version/s: 7.0, 6.5.1
    • Fix Version/s: 7.0, 6.6
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The automaton built from the graph token stream tries to pack common terms in multi word synonyms that appear at the same position. This means that some states inside a multi word synonym can have multiple transitions.
      As a result the intersection point of the graph are not computed correctly.

      For example the synonym rule: "ny, new york city, new york" is not applied correctly to the query "ny police".
      In this case "police" is detected as part of the multi synonyms path and we create the disjunction between:
      "ny police", "new york police", ...

      I pushed a patch that removes this optim (and creates a single transition from each state) in order to ensure that the intersection points of the graph always showed up at the end of the multi synonym paths.
      Matt Weber can you take a look ?

        Activity

        Hide
        mattweber Matt Weber added a comment -

        Jim Ferenczi Maybe use a BytesRefHash and maintain a id-to-hash map so we still only have single copy of common term in memory and still have a unique id?

        Show
        mattweber Matt Weber added a comment - Jim Ferenczi Maybe use a BytesRefHash and maintain a id-to-hash map so we still only have single copy of common term in memory and still have a unique id?
        Hide
        jimczi Jim Ferenczi added a comment -

        I don't think we should try to optimize here. The number of terms should be small in a query so I would prefer to keep it simple and just create a new entry for each token like the cached token stream does.

        Show
        jimczi Jim Ferenczi added a comment - I don't think we should try to optimize here. The number of terms should be small in a query so I would prefer to keep it simple and just create a new entry for each token like the cached token stream does.
        Hide
        mattweber Matt Weber added a comment -

        Sure, looks good then!

        Show
        mattweber Matt Weber added a comment - Sure, looks good then!
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 21362a3ba4c1e936416635667f257b36235b00ab in lucene-solr's branch refs/heads/master from Jim Ferenczi
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=21362a3 ]

        LUCENE-7824: Fix graph query analysis for multi-word synonym rules with common terms (eg. new york, new york city).

        Show
        jira-bot ASF subversion and git services added a comment - Commit 21362a3ba4c1e936416635667f257b36235b00ab in lucene-solr's branch refs/heads/master from Jim Ferenczi [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=21362a3 ] LUCENE-7824 : Fix graph query analysis for multi-word synonym rules with common terms (eg. new york, new york city).
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 84b8b5a1d895ba2fa2d7fbad8cd4ea50321e0dd3 in lucene-solr's branch refs/heads/branch_6x from Jim Ferenczi
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=84b8b5a ]

        LUCENE-7824: Fix graph query analysis for multi-word synonym rules with common terms (eg. new york, new york city).

        Show
        jira-bot ASF subversion and git services added a comment - Commit 84b8b5a1d895ba2fa2d7fbad8cd4ea50321e0dd3 in lucene-solr's branch refs/heads/branch_6x from Jim Ferenczi [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=84b8b5a ] LUCENE-7824 : Fix graph query analysis for multi-word synonym rules with common terms (eg. new york, new york city).
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 55bad6fec3c984d4ef56f94f0f50b9f1b2e6dba3 in lucene-solr's branch refs/heads/branch_6_6 from Jim Ferenczi
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=55bad6f ]

        LUCENE-7824: Fix graph query analysis for multi-word synonym rules with common terms (eg. new york, new york city).

        Show
        jira-bot ASF subversion and git services added a comment - Commit 55bad6fec3c984d4ef56f94f0f50b9f1b2e6dba3 in lucene-solr's branch refs/heads/branch_6_6 from Jim Ferenczi [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=55bad6f ] LUCENE-7824 : Fix graph query analysis for multi-word synonym rules with common terms (eg. new york, new york city).
        Hide
        jimczi Jim Ferenczi added a comment -

        Thanks Matt Weber

        Show
        jimczi Jim Ferenczi added a comment - Thanks Matt Weber

          People

          • Assignee:
            Unassigned
            Reporter:
            jimczi Jim Ferenczi
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development