Solr
  1. Solr
  2. SOLR-2866

Marked synonym filter for selective token expansion

    Details

      Description

      Hi everybody,

      My name is Victor van der Wolf and since recently I work for the Royal Library in the Netherlands. One of my first assignments here was to see if I could implement some stemming algorithm for our websites. Our search engine is solr/lucene 3.4.

      Basically I had 2 requirements to work with:

      1) It should be possible to switch the stemming functionality on and off in the front end
      2) No extra storage should be required (no extra indexing).

      I shortly came to the conclusion that it would be practical to use the SynonymFilter to do that. I got hold of a dutch library and used a stemming algorithm to generate a synonym file on that.

      Then I thought that I could maybe use 2 different query analyzers under the "field type" and then call one or the other depending if I want stemming or not, like this q=<field>:<analyzer>:<search term>. Unfortunately this did not seem possible.

      Then, after some discussions with Erick Erickson, it became clear that a good approach could be to write my own SynonymFilter and apply some kind of token marking to decide it that token should be "synonymized" or not. Well, I did just that and it works like a charm.

      I would like to contribute this MarkedSynonymFilter class to the project.

      I used the SynonymFilter class as a starting point and added some extra functionality to that. First of all, I added 3 new parameters called lookup, preMark and postmark. The preMark and postmark parameters contain some kind of pre- and suffix to recognize if a token should be "synonymized" or not. A simple regex is used to determine this. Then the lookup parameter determines the behaviour of the MarkedSynonymFilter:

      lookup=marked - marked tokens will be synonymized
      lookup=unmarked - unmarked tokens will be synonymized
      lookup=all - all tokens should be synonymized
      lookup=none - none of the tokens should be synonymized

      I started out writing this based on version 3.3, later I discovered that we were using 3.4 and I had to upgrade it. Unfortunately the whole SynonymFilter code has been revised and for the moment there is the Slow and the Fast synonym filter where the Slow one if depricated. My addition is based on the slow version I am afraid.

      Anyway, I am curious about your comments. Please let me know if I should go forward with this and create a JIRA issue + my code as a patch.

      Cheers,

      Victor van der Wolf

      1. SlowMarkedSynonymFilterFactory.java
        7 kB
        Victor van der Wolf
      2. SlowMarkedSynonymFilter.java
        11 kB
        Victor van der Wolf
      3. MarkedSynonymFilterFactory.java
        3 kB
        Victor van der Wolf

        Activity

        Hide
        Uwe Schindler added a comment -

        Move issue to Solr 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Solr 4.9.
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Hoss Man added a comment -

        Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

        email notification suppressed to prevent mass-spam
        psuedo-unique token identifying these issues: hoss20120321nofix36

        Show
        Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
        Hide
        Mike added a comment -

        Hi. FYI, I've created a new issue, SOLR-3099, that is requesting that this feature be supported in the index and the edismax parser. I don't think the overlap is huge, but that seemed like a better approach to me, so I've created a branch of the conversation over there.

        Show
        Mike added a comment - Hi. FYI, I've created a new issue, SOLR-3099 , that is requesting that this feature be supported in the index and the edismax parser. I don't think the overlap is huge, but that seemed like a better approach to me, so I've created a branch of the conversation over there.
        Hide
        Mike added a comment -

        This seems like a strange solution to me. The main problem I see is that this requires a huge synonym file, which has to be created before the filter can be used, and which will inevitably fail some of the time.

        If I understand this correctly, I see three parts to the solution:
        1. We need a new flag when indexing. Something like stems=True on the string type. When this flag is enabled, the index stores the stemmed versions of terms in addition to the unstemmed versions.
        2. On the query side, a new operator is needed, as is mentioned by Robert Muir. In Sphinx search, they use the equals sign (=), so that queries like ="signing agreement" or =signing can be made. The query parser can then identify the operator, and decide which word map to use, stemmed or not.
        3. The "lookup" parameter makes sense to include as well, though I'd suggest we call it exactMatch instead, if possible. I don't see the value in the "unmarked" option though. How is this different than "all"?

        This is probably a more complicated solution than the one proposed, and I'm fairly new to Solr, but I'd hate to see a solution involving long text files land, and for the correct solution to be put off as a result (though I know this is code we have now).

        A possibly-related issue is SOLR-1980, which is implementing "boundary match support". Almost seems like that feature could do double duty as exact match somehow (haven't thought that entirely through though).

        Show
        Mike added a comment - This seems like a strange solution to me. The main problem I see is that this requires a huge synonym file, which has to be created before the filter can be used, and which will inevitably fail some of the time. If I understand this correctly, I see three parts to the solution: 1. We need a new flag when indexing. Something like stems=True on the string type. When this flag is enabled, the index stores the stemmed versions of terms in addition to the unstemmed versions. 2. On the query side, a new operator is needed, as is mentioned by Robert Muir. In Sphinx search, they use the equals sign (=), so that queries like ="signing agreement" or =signing can be made. The query parser can then identify the operator, and decide which word map to use, stemmed or not. 3. The "lookup" parameter makes sense to include as well, though I'd suggest we call it exactMatch instead, if possible. I don't see the value in the "unmarked" option though. How is this different than "all"? This is probably a more complicated solution than the one proposed, and I'm fairly new to Solr, but I'd hate to see a solution involving long text files land, and for the correct solution to be put off as a result (though I know this is code we have now ). A possibly-related issue is SOLR-1980 , which is implementing "boundary match support". Almost seems like that feature could do double duty as exact match somehow (haven't thought that entirely through though).
        Hide
        Steve Rowe added a comment -

        I re-read the issue description and looked closer at the filter code and I can see that I was wrong about several things - sorry for the noise:

        1. I was wrong about the naming - "Marked" is exactly right; I misinterpreted the premark and postmark matching/stripping as a form of stemming, which it isn't.
        2. My suggestions about combining premark and postmark options and making them regexes don't make much sense either - it's very likely that whatever generates the marks will use a fixed string.

        However, I still don't understand the need for the "marked" ctor param - why use this filter if you don't want its behavior?

        Show
        Steve Rowe added a comment - I re-read the issue description and looked closer at the filter code and I can see that I was wrong about several things - sorry for the noise: I was wrong about the naming - "Marked" is exactly right; I misinterpreted the premark and postmark matching/stripping as a form of stemming, which it isn't. My suggestions about combining premark and postmark options and making them regexes don't make much sense either - it's very likely that whatever generates the marks will use a fixed string. However, I still don't understand the need for the "marked" ctor param - why use this filter if you don't want its behavior?
        Hide
        Robert Muir added a comment -

        This trick can only work for "expanding" stemming. Investigate possibility of extending HunSpell stemmer to work in expand mode?

        Its true stuff like this can be done... though in my opinion the cleanest way is to add something like a 'stem' operator to
        a queryparser that would support this.

        you can solve it with hunspell by analyzing the term then generating, (like 'unmunch')
        or, you can implement an algorithmic approach with automatonquery (expanding to scoring boolean query, even boosting exact match)

        With approaches like these you don't do anything special at index time, just leave the words unstemmed.

        Show
        Robert Muir added a comment - This trick can only work for "expanding" stemming. Investigate possibility of extending HunSpell stemmer to work in expand mode? Its true stuff like this can be done... though in my opinion the cleanest way is to add something like a 'stem' operator to a queryparser that would support this. you can solve it with hunspell by analyzing the term then generating, (like 'unmunch') or, you can implement an algorithmic approach with automatonquery (expanding to scoring boolean query, even boosting exact match) With approaches like these you don't do anything special at index time, just leave the words unstemmed.
        Hide
        Steve Rowe added a comment -

        This sounds interesting.

        Some questions:

        1. Couldn’t you combine and generalize the Premark and Postmark options, to be a single regular expression option? Full regex capability, including matching on things that are neither prefix nor postfix, would increase the usefulness of this filter.
        2. Under what circumstances would lookup=all and lookup=none option values be useful? Seems like you could just use SynonymFilter (for lookup=all) or no filter at all (for lookup=none) instead? Looking at your code, this question becomes: why do you need the marked parameter to the filter ctor? If that were eliminated, the invert option would be sufficient to enable lookup=marked and lookup=unmarked.

        Also, a naming question: "Marked" to me implies a separate process that only adds a mark for later processing, but I think you mean something like "matching" instead? My suggestion: SelectiveSynonymFilter.

        One more naming issue: I don't think "Slow" should be part of the class names.

        Show
        Steve Rowe added a comment - This sounds interesting. Some questions: Couldn’t you combine and generalize the Premark and Postmark options, to be a single regular expression option? Full regex capability, including matching on things that are neither prefix nor postfix, would increase the usefulness of this filter. Under what circumstances would lookup=all and lookup=none option values be useful? Seems like you could just use SynonymFilter (for lookup=all ) or no filter at all (for lookup=none ) instead? Looking at your code, this question becomes: why do you need the marked parameter to the filter ctor? If that were eliminated, the invert option would be sufficient to enable lookup=marked and lookup=unmarked . Also, a naming question: "Marked" to me implies a separate process that only adds a mark for later processing, but I think you mean something like "matching" instead? My suggestion: SelectiveSynonymFilter. One more naming issue: I don't think "Slow" should be part of the class names.
        Hide
        Jan Høydahl added a comment -

        A few thoughts:
        We should investigate a better way of passing paramters to Analysis than encoding terms in a special way
        q=cars&enableQuerySideSynonyms=true looks better than e.g. q=cars_syn

        You could optimize the regular expression matching by pre-compiling the regex

        This trick can only work for "expanding" stemming. Investigate possibility of extending HunSpell stemmer to work in expand mode?

        Show
        Jan Høydahl added a comment - A few thoughts: We should investigate a better way of passing paramters to Analysis than encoding terms in a special way q=cars&enableQuerySideSynonyms=true looks better than e.g. q=cars_syn You could optimize the regular expression matching by pre-compiling the regex This trick can only work for "expanding" stemming. Investigate possibility of extending HunSpell stemmer to work in expand mode?
        Hide
        Jan Høydahl added a comment -

        Thanks a lot for your contibution. Would you care to upload your patch as SOLR-2866.patch? This is easier for others to try out. Here's an instruction on how to create a patch file: http://wiki.apache.org/solr/HowToContribute#Generating_a_patch

        Show
        Jan Høydahl added a comment - Thanks a lot for your contibution. Would you care to upload your patch as SOLR-2866 .patch? This is easier for others to try out. Here's an instruction on how to create a patch file: http://wiki.apache.org/solr/HowToContribute#Generating_a_patch

          People

          • Assignee:
            Unassigned
            Reporter:
            Victor van der Wolf
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development