Solr
  1. Solr
  2. SOLR-6015

managed synonyms don't gracefully handle lowercasing

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      I've been having bad luck testing new functionallity lately - the first thing I try never works

      /opt/code/lusolr48/solr/example/solr/collection1/conf$ curl -XPUT "http://localhost:8983/solrsis/synonyms/english"   -H 'Content-type:application/json'   --data-binary '{"MB":["MiB","Megabyte"]}'
      {
        "responseHeader":{
          "status":500,
          "QTime":3},
        "error":{
          "msg":"Bad Request",
          "trace":"Bad Request (400) - Unsupported value null for mb; expected single value or a JSON array!\n
      [...]
      

      I finally figured out that if I lowercased "MB" to "mb", then it works as expected.
      Also, it looks like because ignoreCase is true in the initArgs, everything is stored as lower case in the managed map (losing information). Ideally case would be preserved in the synonym file (esp if one wanted to change ignoreCase either way). I imagine stopwords may have the same issue, but I haven't checked.

      1. SOLR-6015.patch
        14 kB
        Timothy Potter

        Activity

        Hide
        Timothy Potter added a comment -

        Here's a patch that addresses the issue with MB vs. mb and handles ignoreCase setting better (I think). Would appreciate some feedback on this before committing.

        The main change here is that I'm preserving case using an inner class named CasePreservingSynonymMappings regardless of the ignoreCase setting. Here's an example of how that looks at runtime:

        {
        "mb" :

        { "MB":["Megabyte","MiB"], "mb":["megabyte"] }

        }

        Thus, the ignoreCase setting isn't applied on store, rather it is applied when the managed synonym mappings data is "viewed". For instance, a get request for the "MB" child with ignoreCase==true would yield a merged list, such as:

        "MB":["Megabyte","MiB","megabyte"]

        This brings me to my first question. Should we only return one form when mappings overlap as is the case with "Megabyte" and "megabyte"? Right now, it returns both forms but with ignoreCase==true, maybe it should return only one of those? Again this is a view and both are stored so if you switch ignoreCase, then there's no information lost.

        If ignoreCase is false and you request the mappings for "MB", then you just get:

        "MB":["Megabyte","MiB"]

        It follows that if ignoreCase == false and the client asks for "Mb", then they get a 404.

        The second question is about switching the ignoreCase setting, which the API allows. The previous code used to rebuild the map, but now that's not needed since we store the data as it was added and only apply the ignoreCase setting when the data is viewed. Am I overlooking something here?

        Lastly, you'll notice I'm applying the ignoreCase setting in the ManagedSynonymParser, which is done to match the behavior of the current SynonymMap parser/builder. I've compared the results of the text analysis performed by the existing SynonymFilterFactory and the ManagedSynonymFilterFactory and they create the same tokens.

        Show
        Timothy Potter added a comment - Here's a patch that addresses the issue with MB vs. mb and handles ignoreCase setting better (I think). Would appreciate some feedback on this before committing. The main change here is that I'm preserving case using an inner class named CasePreservingSynonymMappings regardless of the ignoreCase setting. Here's an example of how that looks at runtime: { "mb" : { "MB":["Megabyte","MiB"], "mb":["megabyte"] } } Thus, the ignoreCase setting isn't applied on store, rather it is applied when the managed synonym mappings data is "viewed". For instance, a get request for the "MB" child with ignoreCase==true would yield a merged list, such as: "MB": ["Megabyte","MiB","megabyte"] This brings me to my first question. Should we only return one form when mappings overlap as is the case with "Megabyte" and "megabyte"? Right now, it returns both forms but with ignoreCase==true, maybe it should return only one of those? Again this is a view and both are stored so if you switch ignoreCase, then there's no information lost. If ignoreCase is false and you request the mappings for "MB", then you just get: "MB": ["Megabyte","MiB"] It follows that if ignoreCase == false and the client asks for "Mb", then they get a 404. The second question is about switching the ignoreCase setting, which the API allows. The previous code used to rebuild the map, but now that's not needed since we store the data as it was added and only apply the ignoreCase setting when the data is viewed. Am I overlooking something here? Lastly, you'll notice I'm applying the ignoreCase setting in the ManagedSynonymParser, which is done to match the behavior of the current SynonymMap parser/builder. I've compared the results of the text analysis performed by the existing SynonymFilterFactory and the ManagedSynonymFilterFactory and they create the same tokens.
        Hide
        Steve Rowe added a comment -

        Tim,

        The approach looks great - cool that all ignoreCase changes are now supported.

        Two things I noticed:

        • One optimization you might consider in CasePreservingSynonymMappings.getMappings(): in the (very likely overwhelmingly ordinary) case of only one cased version of a key, you could just return the mappings from CasePreservingSynonymMappings rather than copying them to a new map, since no merging will be required.
        • You can drop the overridden ManagedSynonymFilterFactory.updateInitArgs(), since it just calls the superclass method of the same name (after validating params, which the superclass method also does).
        Show
        Steve Rowe added a comment - Tim, The approach looks great - cool that all ignoreCase changes are now supported. Two things I noticed: One optimization you might consider in CasePreservingSynonymMappings.getMappings() : in the (very likely overwhelmingly ordinary) case of only one cased version of a key, you could just return the mappings from CasePreservingSynonymMappings rather than copying them to a new map, since no merging will be required. You can drop the overridden ManagedSynonymFilterFactory.updateInitArgs() , since it just calls the superclass method of the same name (after validating params, which the superclass method also does).
        Hide
        ASF subversion and git services added a comment -

        Commit 1596928 from Timothy Potter in branch 'dev/trunk'
        [ https://svn.apache.org/r1596928 ]

        SOLR-6015: improved strategy for handling managed synonyms when ignoreCase=true

        Show
        ASF subversion and git services added a comment - Commit 1596928 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1596928 ] SOLR-6015 : improved strategy for handling managed synonyms when ignoreCase=true
        Hide
        ASF subversion and git services added a comment -

        Commit 1602948 from Timothy Potter in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1602948 ]

        SOLR-6015: Backport fixes from trunk to branch_4x.

        Show
        ASF subversion and git services added a comment - Commit 1602948 from Timothy Potter in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1602948 ] SOLR-6015 : Backport fixes from trunk to branch_4x.
        Hide
        ASF subversion and git services added a comment -

        Commit 1602950 from Timothy Potter in branch 'dev/trunk'
        [ https://svn.apache.org/r1602950 ]

        SOLR-6015: Moving change note to 4.9 bugfix section.

        Show
        ASF subversion and git services added a comment - Commit 1602950 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1602950 ] SOLR-6015 : Moving change note to 4.9 bugfix section.

          People

          • Assignee:
            Timothy Potter
            Reporter:
            Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development