Solr
  1. Solr
  2. SOLR-6878

solr.ManagedSynonymFilterFactory all-to-all synonym switch (aka. expand)

    Details

      Description

      Hi,

      After switching from SynonymFilterFactory to ManagedSynonymFilterFactory I have found out that there is no way to set an all-to-all synonyms relation. Basically (judgind from google search) there is a need for "expand" functionality switch (known from SynonymFilterFactory) which will treat all synonyms with its keyword as equal.

      For example: if we define a "car":["wagen","ride"] relation it would translate a query that includes one of the synonyms or keyword to "car or wagen or ride" independently of which word was used from those three.

      1. SOLR-6878.patch
        8 kB
        Timothy Potter
      2. SOLR-6878.patch
        13 kB
        Vitaliy Zhovtyuk

        Issue Links

          Activity

          Hide
          Vitaliy Zhovtyuk added a comment -

          Added support for expand parameter and tests for both cases.

          Show
          Vitaliy Zhovtyuk added a comment - Added support for expand parameter and tests for both cases.
          Hide
          Timothy Potter added a comment -

          Thanks for the patch Vitaliy, I'll get this into 5.2

          Show
          Timothy Potter added a comment - Thanks for the patch Vitaliy, I'll get this into 5.2
          Hide
          Timothy Potter added a comment -

          I started going through this patch and I have some questions about how to support the "equivalent" synonyms feature for managed synonyms.

          NOTE: I'm using the term "equivalent" synonyms based on the doc here:
          https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

          Specifically, here are a couple of issues I see with supporting equivalent synonyms lists at the managed API level:

          1) The default value for expand is true (in the patch), but what if the user changes it to false after already having added equivalent synonym lists? Or vice-versa. What do we do about existing equivalent mappings? We could store the equivalent lists in a separate data structure and then apply the correct behavior depending on the expand flag when the managed data is "viewed", i.e. either a GET request from the API or when updating the data used to initialize the underlying SynonymMap. This is similar to what we do with ignoreCase, however the ignoreCase was easily handled but I think allowing expand to be changed by the API is much more complicated.

          Of course we could punt on this issue altogether and just make the expand flag immutable, i.e. you can set it initially to true or false, but cannot change it with the API. If we make it immutable, then we can apply the mapping on update and not have to maintain any additional data structures to remember the raw state of equiv lists.

          2) Let's say we allow users to send in equivalent synonym lists to the API, such as:

          curl -v -X PUT \
            -H 'Content-type:application/json' \
            --data-binary '["funny","entertaining","whimsical","jocular"]' \
            'http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english'
          

          If expand is true, then you end up with the following mappings (pardon the Java code syntax as I didn't want to clean that up for this example):

              assertJQ(endpoint + "/funny",
                  "/funny==['entertaining','jocular','whimiscal']");
              assertJQ(endpoint + "/entertaining",
                  "/entertaining==['funny','jocular','whimiscal']");
              assertJQ(endpoint + "/jocular",
                  "/jocular==['entertaining','funny','whimiscal']");
              assertJQ(endpoint + "/whimiscal",
                  "/whimiscal==['entertaining','funny','jocular']");
          

          What should the API do if the user then decides to update the specific mappings for "funny" by sending in a request such as:

          curl -v -X PUT \
            -H 'Content-type:application/json' \
            --data-binary '{"funny":["hilarious"]}' \
            'http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english'
          

          Does the API treat explicit mappings as having precedence over equivalent lists? Or does it fail with some weird error most users won't understand? Seems to get complicated pretty fast ...

          I didn't go too far down the path of implementing this so there are probably more questions that will come up. To reiterate my original design assumption for managed synonyms, the API was not intended for humans to interact with directly, rather there should be some sort of UI layer on top of this API that translates synonym mappings into low-level API calls. For me, it's much more clear to send in explicit mappings for each synonym than it is to send some flat list and then interpret that list differently based on some flag.

          The only advantage I can see is if the synonym list is huge, then expanding that out in the request makes the request larger. Other than that are there other use cases that require this expand functionality that cannot be achieved with the current implementation? If so, we need to decide if expand should be immutable and what the API should do if an explicit mapping is received for a term that is already used in an equivalent synonym list. Tomasz Sulkowski your thoughts on this?

          Show
          Timothy Potter added a comment - I started going through this patch and I have some questions about how to support the "equivalent" synonyms feature for managed synonyms. NOTE: I'm using the term "equivalent" synonyms based on the doc here: https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory Specifically, here are a couple of issues I see with supporting equivalent synonyms lists at the managed API level: 1) The default value for expand is true (in the patch), but what if the user changes it to false after already having added equivalent synonym lists? Or vice-versa. What do we do about existing equivalent mappings? We could store the equivalent lists in a separate data structure and then apply the correct behavior depending on the expand flag when the managed data is "viewed", i.e. either a GET request from the API or when updating the data used to initialize the underlying SynonymMap. This is similar to what we do with ignoreCase, however the ignoreCase was easily handled but I think allowing expand to be changed by the API is much more complicated. Of course we could punt on this issue altogether and just make the expand flag immutable, i.e. you can set it initially to true or false, but cannot change it with the API. If we make it immutable, then we can apply the mapping on update and not have to maintain any additional data structures to remember the raw state of equiv lists. 2) Let's say we allow users to send in equivalent synonym lists to the API, such as: curl -v -X PUT \ -H 'Content-type:application/json' \ --data-binary '[ "funny" , "entertaining" , "whimsical" , "jocular" ]' \ 'http: //localhost:8983/solr/techproducts/schema/analysis/synonyms/english' If expand is true, then you end up with the following mappings (pardon the Java code syntax as I didn't want to clean that up for this example): assertJQ(endpoint + "/funny" , "/funny==['entertaining','jocular','whimiscal']" ); assertJQ(endpoint + "/entertaining" , "/entertaining==['funny','jocular','whimiscal']" ); assertJQ(endpoint + "/jocular" , "/jocular==['entertaining','funny','whimiscal']" ); assertJQ(endpoint + "/whimiscal" , "/whimiscal==['entertaining','funny','jocular']" ); What should the API do if the user then decides to update the specific mappings for "funny" by sending in a request such as: curl -v -X PUT \ -H 'Content-type:application/json' \ --data-binary '{ "funny" :[ "hilarious" ]}' \ 'http: //localhost:8983/solr/techproducts/schema/analysis/synonyms/english' Does the API treat explicit mappings as having precedence over equivalent lists? Or does it fail with some weird error most users won't understand? Seems to get complicated pretty fast ... I didn't go too far down the path of implementing this so there are probably more questions that will come up. To reiterate my original design assumption for managed synonyms, the API was not intended for humans to interact with directly, rather there should be some sort of UI layer on top of this API that translates synonym mappings into low-level API calls. For me, it's much more clear to send in explicit mappings for each synonym than it is to send some flat list and then interpret that list differently based on some flag. The only advantage I can see is if the synonym list is huge, then expanding that out in the request makes the request larger. Other than that are there other use cases that require this expand functionality that cannot be achieved with the current implementation? If so, we need to decide if expand should be immutable and what the API should do if an explicit mapping is received for a term that is already used in an equivalent synonym list. Tomasz Sulkowski your thoughts on this?
          Hide
          Hoss Man added a comment -

          the "expand" option in the original SynonymFilterFactory was/is really just about allowing brevity for symetric synonyms in the data file – the best approach for the API is to tackle the same problem.

          Instead of thinking about "expand" as a stateful option in ManagedSynonymFilterFactory (or worse, an immutabe stateful option), i would suggest that instead it should just be a (transient) property of the request to add to / create the synonyms mappings – one that doesn't even need to be explicit, since the list syntax already makes it clear.

          today, if someone sends a map of "KEY => LIST-OF(VALUES)" to the API, we interpret that as "for each KEY, for each VALUE in LIST-OF(VALUES), add a synonym mapping of KEY=>VALUE" and later if the user asks to GET mappings or delete mappings they do so by KEY.

          why not let the new "expand" feature just be syntactic sugar on adding symetric sets of KEY=>VALUE mappings via lists of lists?

          if a user is creating or adding to a synonym mapping with a "LIST-OF(LIST-OF(VALUES))" then let the logic be: "for each LIST-OF(VALUES) in the outer LIST, loop over the inner LIST and add a mapping from every VALUE => ever other VALUE in the same inner LIST"

          it should be purely syntactic sugar – GET requests should make it clear how the data is internally modeled.

          What should the API do if the user then decides to update the specific mappings for "funny" by sending in a request such as ...

          we update that exact mapping, and no other mappings are changed – update/delete requests should operate on individual keys, regardless of what type of request added those keys.


          The (more complex) alternative is to create a much more general abstraction of "synonym dictionary entries" where each entry is either a "one way mapping" or a "multi directional mapping" ... so that we internally track & remember that the user gave us some set of one way mappings like {'mad': ['angry']} and also gave us a set of multi directional mappings as lists like ['funny','jocular','whimiscal'] and support some new syntax for saying "i want to edit the list i previously gave you which contains 'jocular' such that it no longer contains 'whimiscal' but now contains 'happy'" and also have sanity checks in place to prevent people from trying to mix the two.

          but i think (as you alluded to above) that way leads to madness.

          Show
          Hoss Man added a comment - the "expand" option in the original SynonymFilterFactory was/is really just about allowing brevity for symetric synonyms in the data file – the best approach for the API is to tackle the same problem. Instead of thinking about "expand" as a stateful option in ManagedSynonymFilterFactory (or worse, an immutabe stateful option), i would suggest that instead it should just be a (transient) property of the request to add to / create the synonyms mappings – one that doesn't even need to be explicit, since the list syntax already makes it clear. today, if someone sends a map of "KEY => LIST-OF(VALUES)" to the API, we interpret that as "for each KEY, for each VALUE in LIST-OF(VALUES), add a synonym mapping of KEY=>VALUE" and later if the user asks to GET mappings or delete mappings they do so by KEY. why not let the new "expand" feature just be syntactic sugar on adding symetric sets of KEY=>VALUE mappings via lists of lists? if a user is creating or adding to a synonym mapping with a "LIST-OF(LIST-OF(VALUES))" then let the logic be: "for each LIST-OF(VALUES) in the outer LIST, loop over the inner LIST and add a mapping from every VALUE => ever other VALUE in the same inner LIST" it should be purely syntactic sugar – GET requests should make it clear how the data is internally modeled. What should the API do if the user then decides to update the specific mappings for "funny" by sending in a request such as ... we update that exact mapping, and no other mappings are changed – update/delete requests should operate on individual keys, regardless of what type of request added those keys. The (more complex) alternative is to create a much more general abstraction of "synonym dictionary entries" where each entry is either a "one way mapping" or a "multi directional mapping" ... so that we internally track & remember that the user gave us some set of one way mappings like {'mad': ['angry']} and also gave us a set of multi directional mappings as lists like ['funny','jocular','whimiscal'] and support some new syntax for saying "i want to edit the list i previously gave you which contains 'jocular' such that it no longer contains 'whimiscal' but now contains 'happy'" and also have sanity checks in place to prevent people from trying to mix the two. but i think (as you alluded to above) that way leads to madness.
          Hide
          Timothy Potter added a comment -

          why not let the new "expand" feature just be syntactic sugar on adding symetric sets of KEY=>VALUE mappings via lists of lists?

          Good idea! I'll start down that path as it seems pretty straight-forward to implement w/o all the state management issues as you mentioned. Thanks Hoss.

          Show
          Timothy Potter added a comment - why not let the new "expand" feature just be syntactic sugar on adding symetric sets of KEY=>VALUE mappings via lists of lists? Good idea! I'll start down that path as it seems pretty straight-forward to implement w/o all the state management issues as you mentioned. Thanks Hoss.
          Hide
          Timothy Potter added a comment -

          Here is an updated patch that implements the idea Hossman laid out in his comment. Basically, if the client sends in a list instead of a map, the expand=true logic is applied as the time of update, i.e. this is syntactic sugar for building up the mappings from a list of symmetric synonyms.

          There's no need to support a list for expand=false because that is simply a mapping of all the terms to the last term in the list, which is already supported by the API. Thus, expand=true is implied when the update request contains a list and not a map.

          Show
          Timothy Potter added a comment - Here is an updated patch that implements the idea Hossman laid out in his comment. Basically, if the client sends in a list instead of a map, the expand=true logic is applied as the time of update, i.e. this is syntactic sugar for building up the mappings from a list of symmetric synonyms. There's no need to support a list for expand=false because that is simply a mapping of all the terms to the last term in the list, which is already supported by the API. Thus, expand=true is implied when the update request contains a list and not a map.
          Hide
          ASF subversion and git services added a comment -

          Commit 1677923 from Timothy Potter in branch 'dev/trunk'
          [ https://svn.apache.org/r1677923 ]

          SOLR-6878: support adding symmetric synonym lists using the managed synonym API

          Show
          ASF subversion and git services added a comment - Commit 1677923 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1677923 ] SOLR-6878 : support adding symmetric synonym lists using the managed synonym API
          Hide
          ASF subversion and git services added a comment -

          Commit 1677924 from Timothy Potter in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1677924 ]

          SOLR-6878: support adding symmetric synonym lists using the managed synonym API

          Show
          ASF subversion and git services added a comment - Commit 1677924 from Timothy Potter in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1677924 ] SOLR-6878 : support adding symmetric synonym lists using the managed synonym API
          Hide
          Anshum Gupta added a comment -

          Bulk close for 5.2.0.

          Show
          Anshum Gupta added a comment - Bulk close for 5.2.0.

            People

            • Assignee:
              Timothy Potter
              Reporter:
              Tomasz Sulkowski
            • Votes:
              3 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development