Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10264

ManagedSynonymFilterFactory does not parse multi-term synonyms

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 6.4.2
    • Fix Version/s: 6.6, 7.0
    • Component/s: Schema and Analysis
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      The parser that the ManagedSynonymFilterFactory uses to parse the JSON resource into a synonym map does not parse multi-term synonyms in the expected way.

      If the synonym

      {"foo bar":"baz"}

      is added to the managed resource, the expected behavior is that the multi-term synonym "foo bar" will be mapped to the synonym "baz".

      In the analyze method of SynonymMap.Parser, multiple origin terms are concatenated with a separating SynonymMap.WORD_SEPARATOR, but the analyze method is not used by the parser in the ManagedSynonymFilterFactory.

      As a workaround, multi-term synonyms can be uploaded separated by a null character, i.e., uploading

      {"foo\u0000bar":"baz"}

      works.

      1. SOLR-10264.patch
        7 kB
        Christine Poerschke
      2. SOLR-10264.patch
        7 kB
        Christine Poerschke
      3. SOLR-10264.patch
        6 kB
        Jörg Rathlev
      4. SOLR-10264.patch
        2 kB
        Jörg Rathlev
      5. SOLR-10264-test.patch
        4 kB
        Christine Poerschke

        Issue Links

          Activity

          Hide
          joergr Jörg Rathlev added a comment -

          I've attached a patch that modifies the ManagedSynonymParser to call the parent class's analyze method on the mapped terms.

          Show
          joergr Jörg Rathlev added a comment - I've attached a patch that modifies the ManagedSynonymParser to call the parent class's analyze method on the mapped terms.
          Hide
          tboeghk Torsten Bøgh Köster added a comment -

          +1 on this one. To be a little more open for extension, it would be great if the inner classes of the `ManagedSynonymFilterFactory` would be `protected` instead of `private`. Then, a caching implementation would be straight forward and we'd contribute that one as well.

          Show
          tboeghk Torsten Bøgh Köster added a comment - +1 on this one. To be a little more open for extension, it would be great if the inner classes of the `ManagedSynonymFilterFactory` would be `protected` instead of `private`. Then, a caching implementation would be straight forward and we'd contribute that one as well.
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          I'm sorry, but the patch lacks a test.

          Show
          mkhludnev Mikhail Khludnev added a comment - I'm sorry, but the patch lacks a test.
          Hide
          tboeghk Torsten Bøgh Köster added a comment -

          Yeah well, that's a way to cut down contributions. We'll contribute a test later ...

          Show
          tboeghk Torsten Bøgh Köster added a comment - Yeah well, that's a way to cut down contributions. We'll contribute a test later ...
          Hide
          cpoerschke Christine Poerschke added a comment -

          Hi Jörg Rathlev - thanks for opening this ticket with initial patch. It's totally fine for an initial patch to have no test.

          From a quick look TestManagedSynonymFilterFactory might be a good home for the test.

          Show
          cpoerschke Christine Poerschke added a comment - Hi Jörg Rathlev - thanks for opening this ticket with initial patch. It's totally fine for an initial patch to have no test. From a quick look TestManagedSynonymFilterFactory might be a good home for the test.
          Hide
          cpoerschke Christine Poerschke added a comment -

          Attaching partial (not yet multi-term) test changes.

          Show
          cpoerschke Christine Poerschke added a comment - Attaching partial (not yet multi-term) test changes.
          Hide
          joergr Jörg Rathlev added a comment -

          Thank you, Christine, that almost-complete test case was very helpful!

          I've attached an updated patch with a working test case for multi-term origins.

          Compared to Christine's test, I've changed it so that the multi-term origin is in the indexed document, and the single-term synonym in the query, because multi-term origins in the query are problematic for unrelated reasons.

          Show
          joergr Jörg Rathlev added a comment - Thank you, Christine, that almost-complete test case was very helpful! I've attached an updated patch with a working test case for multi-term origins. Compared to Christine's test, I've changed it so that the multi-term origin is in the indexed document, and the single-term synonym in the query, because multi-term origins in the query are problematic for unrelated reasons.
          Hide
          steve_rowe Steve Rowe added a comment -

          multi-term origins in the query are problematic for unrelated reasons.

          FYI I committed SOLR-9185 last week. It adds the sow (split-on-whitespace) param to the edismax and standard/"Lucene" query parsers, which when set to false will allow query-time multi-term origins.

          Show
          steve_rowe Steve Rowe added a comment - multi-term origins in the query are problematic for unrelated reasons. FYI I committed SOLR-9185 last week. It adds the sow (split-on-whitespace) param to the edismax and standard/"Lucene" query parsers, which when set to false will allow query-time multi-term origins.
          Hide
          cpoerschke Christine Poerschke added a comment -

          Attaching patch with slightly revised test, key differences compared to Jörg's patch:

          • the document is not re-added after the core reload
          • multi-term or single-term origin or synonym is randomly decided
          • sow (split-on-whitespace) is set to false as Steve suggested

          However, for some reason this error is being hit

          o.a.s.h.RequestHandlerBase org.apache.solr.search.QueryParserConfigurationException: Field 'text': autoGeneratePhraseQueries == true is disallowed when sow/splitOnWhitespace == false
          

          which is quite unexpected since firstly schema-rest.xml was changed to

          <fieldType name="managed_en" class="solr.TextField" autoGeneratePhraseQueries="false">
          

          and secondly the test itself doesn't actually search on the text field?

          Show
          cpoerschke Christine Poerschke added a comment - Attaching patch with slightly revised test, key differences compared to Jörg's patch: the document is not re-added after the core reload multi-term or single-term origin or synonym is randomly decided sow (split-on-whitespace) is set to false as Steve suggested However, for some reason this error is being hit o.a.s.h.RequestHandlerBase org.apache.solr.search.QueryParserConfigurationException: Field 'text': autoGeneratePhraseQueries == true is disallowed when sow/splitOnWhitespace == false which is quite unexpected since firstly schema-rest.xml was changed to <fieldType name= "managed_en" class= "solr.TextField" autoGeneratePhraseQueries= " false " > and secondly the test itself doesn't actually search on the text field?
          Hide
          cpoerschke Christine Poerschke added a comment -

          Dug a little further re: the sow (split-on-whitespace) error encountered as per above and found at schema-rest.xml#L623 <defaultSearchField>text</defaultSearchField> which would partly explain the consideration of the text field though not fully since the search is asking for the managed_en_field field, hmm.

          Show
          cpoerschke Christine Poerschke added a comment - Dug a little further re: the sow (split-on-whitespace) error encountered as per above and found at schema-rest.xml#L623 <defaultSearchField>text</defaultSearchField> which would partly explain the consideration of the text field though not fully since the search is asking for the managed_en_field field, hmm.
          Hide
          cpoerschke Christine Poerschke added a comment -

          from Steve Rowe in SOLR-10348:

          The query managed_en_field:happy test, with default field = text, is interpreted by the standard query parser (and Lucene classic query parser) as managed_en_field:happy text:test. If you want test to also be against managed_en_field, you have to enclose both words in parens: managed_en_field:(happy test).

          Show
          cpoerschke Christine Poerschke added a comment - from Steve Rowe in SOLR-10348 : The query managed_en_field:happy test , with default field = text , is interpreted by the standard query parser (and Lucene classic query parser) as managed_en_field:happy text:test . If you want test to also be against managed_en_field , you have to enclose both words in parens: managed_en_field:(happy test) .
          Hide
          cpoerschke Christine Poerschke added a comment -

          Attaching final(?) patch, will commit early next week or so if no further comments or concerns.

          Show
          cpoerschke Christine Poerschke added a comment - Attaching final(?) patch, will commit early next week or so if no further comments or concerns.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit adabfdd9c2a50b4141f485655df0d048df21bd23 in lucene-solr's branch refs/heads/master from Christine Poerschke
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=adabfdd ]

          SOLR-10264: Fixes multi-term synonym parsing in ManagedSynonymFilterFactory.
          (Jörg Rathlev, Steve Rowe, Christine Poerschke)

          Show
          jira-bot ASF subversion and git services added a comment - Commit adabfdd9c2a50b4141f485655df0d048df21bd23 in lucene-solr's branch refs/heads/master from Christine Poerschke [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=adabfdd ] SOLR-10264 : Fixes multi-term synonym parsing in ManagedSynonymFilterFactory. (Jörg Rathlev, Steve Rowe, Christine Poerschke)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 3ac32eea3509495256c4b74a695e798f783d7434 in lucene-solr's branch refs/heads/branch_6x from Christine Poerschke
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3ac32ee ]

          SOLR-10264: Fixes multi-term synonym parsing in ManagedSynonymFilterFactory.
          (Jörg Rathlev, Steve Rowe, Christine Poerschke)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 3ac32eea3509495256c4b74a695e798f783d7434 in lucene-solr's branch refs/heads/branch_6x from Christine Poerschke [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3ac32ee ] SOLR-10264 : Fixes multi-term synonym parsing in ManagedSynonymFilterFactory. (Jörg Rathlev, Steve Rowe, Christine Poerschke)
          Hide
          cpoerschke Christine Poerschke added a comment -

          Thanks Jörg and Steve!

          Show
          cpoerschke Christine Poerschke added a comment - Thanks Jörg and Steve!

            People

            • Assignee:
              cpoerschke Christine Poerschke
              Reporter:
              joergr Jörg Rathlev
            • Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development