Solr
  1. Solr
  2. SOLR-5200 Add REST support for reading and modifying Solr configuration
  3. SOLR-5654

Create a synonym filter factory that is (re)configurable, and capable of reporting its configuration, via REST API

    Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      A synonym filter factory could be (re)configurable via REST API by registering with the RESTManager described in SOLR-5653, and then responding to REST API calls to modify its init params and its synonyms resource file.

      Read-only (GET) REST API calls should also be provided, both for init params and the synonyms resource file.

      It should be possible to add/remove/modify one or more entries in the synonyms resource file.

      We should probably use JSON for the REST request body, as is done in the Schema REST API methods.

      1. SOLR-5654.patch
        25 kB
        Steve Rowe
      2. SOLR-5654.patch
        25 kB
        Timothy Potter
      3. SOLR-5654.patch
        22 kB
        Timothy Potter
      4. SOLR-5654.patch
        18 kB
        Timothy Potter
      5. SOLR-5654.patch
        18 kB
        Timothy Potter

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Why do we need new factories for synonyms and stopwords? I dont understand this design at all, this seems like duplication of all the analysis factories!

          Instead, just pass a different resourceloader or something to the existing ones!

          Show
          Robert Muir added a comment - Why do we need new factories for synonyms and stopwords? I dont understand this design at all, this seems like duplication of all the analysis factories! Instead, just pass a different resourceloader or something to the existing ones!
          Hide
          Michael McCandless added a comment -

          In Lucene server (LUCENE-5376) I just created a custom ResourceLoader to pull stopwords that were specified (inlined) in the JSON when the field is registered.

          But separately I think it's ... dangerous to allow changing stopwords / syns on an already created field / running index? Ie, such changes won't fully "take effect" until you re-index all content ... I know it's convenient to be able to make such changes, but it's also trappy.

          Show
          Michael McCandless added a comment - In Lucene server ( LUCENE-5376 ) I just created a custom ResourceLoader to pull stopwords that were specified (inlined) in the JSON when the field is registered. But separately I think it's ... dangerous to allow changing stopwords / syns on an already created field / running index? Ie, such changes won't fully "take effect" until you re-index all content ... I know it's convenient to be able to make such changes, but it's also trappy.
          Hide
          Steve Rowe added a comment -

          Why do we need new factories for synonyms and stopwords? I dont understand this design at all, this seems like duplication of all the analysis factories!

          Instead, just pass a different resourceloader or something to the existing ones!

          The point of this issue is to provide REST API methods to interrogate and modify/persist synonym config and mappings. A different resourceloader would only allow for this info to be pulled from an alternate persistence store - it wouldn't do anything for the REST API and persistence part.

          But separately I think it's ... dangerous to allow changing stopwords / syns on an already created field / running index? Ie, such changes won't fully "take effect" until you re-index all content ... I know it's convenient to be able to make such changes, but it's also trappy.

          That's already true today for people who manually modify config and restart/reload. I guess your point is that we shouldn't be making this easier. I disagree: the point of the issue is to allow people more fine-grained control over an already-existing freedom. I think documentation warning people about the danger of modifying config with an existing index is sufficient to help people who want this capability avoid creating indexes with mixed analysis config.

          Show
          Steve Rowe added a comment - Why do we need new factories for synonyms and stopwords? I dont understand this design at all, this seems like duplication of all the analysis factories! Instead, just pass a different resourceloader or something to the existing ones! The point of this issue is to provide REST API methods to interrogate and modify/persist synonym config and mappings. A different resourceloader would only allow for this info to be pulled from an alternate persistence store - it wouldn't do anything for the REST API and persistence part. But separately I think it's ... dangerous to allow changing stopwords / syns on an already created field / running index? Ie, such changes won't fully "take effect" until you re-index all content ... I know it's convenient to be able to make such changes, but it's also trappy. That's already true today for people who manually modify config and restart/reload. I guess your point is that we shouldn't be making this easier. I disagree: the point of the issue is to allow people more fine-grained control over an already-existing freedom. I think documentation warning people about the danger of modifying config with an existing index is sufficient to help people who want this capability avoid creating indexes with mixed analysis config.
          Hide
          Robert Muir added a comment -

          The point of this issue is to provide REST API methods to interrogate and modify/persist synonym config and mappings. A different resourceloader would only allow for this info to be pulled from an alternate persistence store - it wouldn't do anything for the REST API and persistence part.

          It wouldnt prevent it either.

          Reworded: why is a custom factory necessary?

          Show
          Robert Muir added a comment - The point of this issue is to provide REST API methods to interrogate and modify/persist synonym config and mappings. A different resourceloader would only allow for this info to be pulled from an alternate persistence store - it wouldn't do anything for the REST API and persistence part. It wouldnt prevent it either. Reworded: why is a custom factory necessary?
          Hide
          Michael McCandless added a comment -

          I guess your point is that we shouldn't be making this easier.

          Right.

          I disagree: the point of the issue is to allow people more fine-grained control over an already-existing freedom.

          Just because there's already an existing (not necessarily good) freedom doesn't mean it must be made easier. Optimize is an existing freedom

          Does Solr at least record somewhere that "full re-index required"? So the user (if s/he knows to look in the right place on the admin UI) is informed that inconsistent results might be because they didn't fully re-index yet...

          Show
          Michael McCandless added a comment - I guess your point is that we shouldn't be making this easier. Right. I disagree: the point of the issue is to allow people more fine-grained control over an already-existing freedom. Just because there's already an existing (not necessarily good) freedom doesn't mean it must be made easier. Optimize is an existing freedom Does Solr at least record somewhere that "full re-index required"? So the user (if s/he knows to look in the right place on the admin UI) is informed that inconsistent results might be because they didn't fully re-index yet...
          Hide
          Jack Krupansky added a comment -

          Two reasonable and reliable use cases I have encountered:

          1. Update or replace query-time synonyms - no risk for existing indexed data.

          2. Add new index-time synonyms that will apply to new indexed documents - again, no expectation that they would apply to existing documents, but reindexing would of course apply them anyway.

          Show
          Jack Krupansky added a comment - Two reasonable and reliable use cases I have encountered: 1. Update or replace query-time synonyms - no risk for existing indexed data. 2. Add new index-time synonyms that will apply to new indexed documents - again, no expectation that they would apply to existing documents, but reindexing would of course apply them anyway.
          Hide
          Timothy Potter added a comment - - edited

          Basic implementation which depends on my patch for SOLR-5653.

          It only supports the "solr" format for now and basically uses an adapter to provide a SolrResourceLoader to the existing SynonymFilterFactory which is backed by the managed synonym mappings.

          Also, I should mention that I'm not thrilled with how I handle ignoreCase changes right now, so will probably clean that up a bit in a subsequent patch.

          Show
          Timothy Potter added a comment - - edited Basic implementation which depends on my patch for SOLR-5653 . It only supports the "solr" format for now and basically uses an adapter to provide a SolrResourceLoader to the existing SynonymFilterFactory which is backed by the managed synonym mappings. Also, I should mention that I'm not thrilled with how I handle ignoreCase changes right now, so will probably clean that up a bit in a subsequent patch.
          Hide
          Yonik Seeley added a comment -

          Cool stuff! Can you give some examples of the full URLs? Do they match the JSON storage format?

          Show
          Yonik Seeley added a comment - Cool stuff! Can you give some examples of the full URLs? Do they match the JSON storage format?
          Hide
          Timothy Potter added a comment -

          Thanks Yonik. Here are some examples:

          In schema.xml, you'd activate this using something like:

          <fieldType name="managed_en" class="solr.TextField" positionIncrementGap="100">
          <analyzer>
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="org.apache.solr.rest.schema.analysis.ManagedSynonymFilterFactory" managed="english" />
          </analyzer>
          </fieldType>

          GET a list of managed synonym mappings for managed handle "english" using:

          curl -i -v "http://localhost:8984/solr/<collection>/schema/analysis/synonyms/english"

          This would return a JSON structure that looks like (which is pretty much the same as the JSON backed storage structure):

          {
          "initArgs":

          { "ignoreCase":"true", "format":"solr" }

          ,
          "managedMap":

          { "GB":[ "GiB", "Gigabyte"], "TV":["Television"], "happy":[ "glad", "joyful"] }

          }

          btw ... I'm not in love the with managedMap or managedList thing so am open to suggestions. My thinking there was that the property name gave some hint as to what the type of data structure the value is.

          PUT: Add a mapping using PUT/POST

          curl -v -X PUT \
          -H 'Content-type:application/json' \
          --data-binary '

          {"sad":["unhappy"]}

          ' \
          'http://localhost:8984/solr/<collection>/schema/analysis/synonyms/english'

          There's some question in my mind if PUT should merge in new values to the existing synonym mappings or replace them. I chose to merge in, which puts a burden on the client to DELETE (not yet working) synonym mappings they don't want to keep around. In other words, there's no way to wholesale replace the existing mappings with another set, but that seems more like how users will use the feature, ie. adding a synonym here and there as needs evolve.

          You can also GET a specific mapping (or 404 if one doesn't exist) using:

          curl i -v "http://localhost:8984/solr/<collection>/schema/analysis/synonyms/english/happy" <- would return

          { "happy":["glad"], ... }

          curl i -v "http://localhost:8984/solr/<collection>/schema/analysis/synonyms/english/yappy" <- would return 404

          Lastly, I'm planning to support a GET request to get all known handles:

          curl -i -v "http://localhost:8984/solr/<collection>/schema/analysis/synonyms"

          Currently would return a JSON list of known managed synonym mappings: [ { "english":

          { some stats / metadata here, such as whether it is 'dirty' }

          ]

          Show
          Timothy Potter added a comment - Thanks Yonik. Here are some examples: In schema.xml, you'd activate this using something like: <fieldType name="managed_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="org.apache.solr.rest.schema.analysis.ManagedSynonymFilterFactory" managed="english" /> </analyzer> </fieldType> GET a list of managed synonym mappings for managed handle "english" using: curl -i -v "http://localhost:8984/solr/<collection>/schema/analysis/synonyms/english" This would return a JSON structure that looks like (which is pretty much the same as the JSON backed storage structure): { "initArgs": { "ignoreCase":"true", "format":"solr" } , "managedMap": { "GB":[ "GiB", "Gigabyte"], "TV":["Television"], "happy":[ "glad", "joyful"] } } btw ... I'm not in love the with managedMap or managedList thing so am open to suggestions. My thinking there was that the property name gave some hint as to what the type of data structure the value is. PUT: Add a mapping using PUT/POST curl -v -X PUT \ -H 'Content-type:application/json' \ --data-binary ' {"sad":["unhappy"]} ' \ 'http://localhost:8984/solr/<collection>/schema/analysis/synonyms/english' There's some question in my mind if PUT should merge in new values to the existing synonym mappings or replace them. I chose to merge in, which puts a burden on the client to DELETE (not yet working) synonym mappings they don't want to keep around. In other words, there's no way to wholesale replace the existing mappings with another set, but that seems more like how users will use the feature, ie. adding a synonym here and there as needs evolve. You can also GET a specific mapping (or 404 if one doesn't exist) using: curl i -v "http://localhost:8984/solr/<collection>/schema/analysis/synonyms/english/happy" < - would return { "happy":["glad"], ... } curl i -v "http://localhost:8984/solr/<collection>/schema/analysis/synonyms/english/yappy" < - would return 404 Lastly, I'm planning to support a GET request to get all known handles: curl -i -v "http://localhost:8984/solr/<collection>/schema/analysis/synonyms" Currently would return a JSON list of known managed synonym mappings: [ { "english": { some stats / metadata here, such as whether it is 'dirty' } ]
          Hide
          Timothy Potter added a comment -

          Updated patch to work with the changes in the latest patch for SOLR-5653

          Show
          Timothy Potter added a comment - Updated patch to work with the changes in the latest patch for SOLR-5653
          Hide
          Timothy Potter added a comment -

          Apply after applying the patch for SOLR-5655 (not quite sure how to handle that otherwise?)

          Show
          Timothy Potter added a comment - Apply after applying the patch for SOLR-5655 (not quite sure how to handle that otherwise?)
          Hide
          Timothy Potter added a comment -

          One other thing to notice is that I had to copy-and-paste re-use the getIgnoreCase methods from ManagedWordSetResource. Be better to make that static and re-use it in the synonym code?

          Show
          Timothy Potter added a comment - One other thing to notice is that I had to copy-and-paste re-use the getIgnoreCase methods from ManagedWordSetResource. Be better to make that static and re-use it in the synonym code?
          Hide
          Timothy Potter added a comment -

          Here's an updated patch that uses a custom SynonymMap.Parser implementation instead of the SolrResourceLoader adapter approach based on the excellent suggestion by Steve Rowe. The only caveat is this requires making the SynonymFilterFactory.loadSynonyms method protected instead of private, which seemed like a good trade-off for being able to plug-in a different parser implementation.

          I've also improved the test logic to verify synonyms get applied correctly after core reload. Lastly, I cleaned up a bit of the ignoreCase handling code, such as lowercasing keys / values when building the internal data structures when ignoreCase == true.

          Show
          Timothy Potter added a comment - Here's an updated patch that uses a custom SynonymMap.Parser implementation instead of the SolrResourceLoader adapter approach based on the excellent suggestion by Steve Rowe. The only caveat is this requires making the SynonymFilterFactory.loadSynonyms method protected instead of private, which seemed like a good trade-off for being able to plug-in a different parser implementation. I've also improved the test logic to verify synonyms get applied correctly after core reload. Lastly, I cleaned up a bit of the ignoreCase handling code, such as lowercasing keys / values when building the internal data structures when ignoreCase == true.
          Hide
          Steve Rowe added a comment -

          Here's an updated patch that uses a custom SynonymMap.Parser implementation instead of the SolrResourceLoader adapter approach based on the excellent suggestion by Steve Rowe. The only caveat is this requires making the SynonymFilterFactory.loadSynonyms method protected instead of private, which seemed like a good trade-off for being able to plug-in a different parser implementation.

          Looks great, thanks for making this change. I see no problem with making SynonymFilterFactory.loadSynonyms() protected.

          I've also improved the test logic to verify synonyms get applied correctly after core reload.

          Cool, good test addition.

          The attached patch includes a CHANGES.txt entry and some minor cleanups:

          • Removed the custom boolean parsing logic in ManagedSynonmyFilterFactory.getIgnoreCase() in favor of the new NamedList.getBooleanArg() method.
          • Add missing braces around single-line statement blocks after if and for
          • Added Locale.ROOT as the first arg to several String.format() invocations to make ant precommit calm down.
          • Converted explicit types in generic constructor invocations to the diamond operator.
          • Converted schema references to ManagedSynonymFilterFactory from using the full package to the short form prefix solr..

          I think it's ready to go. I'll commit to trunk shortly.

          Show
          Steve Rowe added a comment - Here's an updated patch that uses a custom SynonymMap.Parser implementation instead of the SolrResourceLoader adapter approach based on the excellent suggestion by Steve Rowe. The only caveat is this requires making the SynonymFilterFactory.loadSynonyms method protected instead of private, which seemed like a good trade-off for being able to plug-in a different parser implementation. Looks great, thanks for making this change. I see no problem with making SynonymFilterFactory.loadSynonyms() protected. I've also improved the test logic to verify synonyms get applied correctly after core reload. Cool, good test addition. The attached patch includes a CHANGES.txt entry and some minor cleanups: Removed the custom boolean parsing logic in ManagedSynonmyFilterFactory.getIgnoreCase() in favor of the new NamedList.getBooleanArg() method. Add missing braces around single-line statement blocks after if and for Added Locale.ROOT as the first arg to several String.format() invocations to make ant precommit calm down. Converted explicit types in generic constructor invocations to the diamond operator. Converted schema references to ManagedSynonymFilterFactory from using the full package to the short form prefix solr. . I think it's ready to go. I'll commit to trunk shortly.
          Hide
          ASF subversion and git services added a comment -

          Commit 1584211 from Steve Rowe in branch 'dev/trunk'
          [ https://svn.apache.org/r1584211 ]

          SOLR-5654: Create a synonym filter factory that is (re)configurable, and capable of reporting its configuration, via REST API

          Show
          ASF subversion and git services added a comment - Commit 1584211 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1584211 ] SOLR-5654 : Create a synonym filter factory that is (re)configurable, and capable of reporting its configuration, via REST API
          Hide
          ASF subversion and git services added a comment -

          Commit 1585147 from sarowe@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1585147 ]

          SOLR-5654: Create a synonym filter factory that is (re)configurable, and capable of reporting its configuration, via REST API (merged trunk r1584211)

          Show
          ASF subversion and git services added a comment - Commit 1585147 from sarowe@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1585147 ] SOLR-5654 : Create a synonym filter factory that is (re)configurable, and capable of reporting its configuration, via REST API (merged trunk r1584211)
          Hide
          ASF subversion and git services added a comment -

          Commit 1585148 from sarowe@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1585148 ]

          SOLR-5654: Add CHANGES.txt entry

          Show
          ASF subversion and git services added a comment - Commit 1585148 from sarowe@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1585148 ] SOLR-5654 : Add CHANGES.txt entry
          Hide
          Steve Rowe added a comment -

          Committed to trunk and branch_4x.

          Thanks Tim!

          Show
          Steve Rowe added a comment - Committed to trunk and branch_4x. Thanks Tim!

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Steve Rowe
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development