Lucene - Core
  1. Lucene - Core
  2. LUCENE-2198

support protected words in Stemming TokenFilters

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This is from LUCENE-1515

      I propose that all stemming TokenFilters have an 'exclusion set' that bypasses any stemming for words in this set.
      Some stemming tokenfilters have this, some do not.

      This would be one way for Karl to implement his new swedish stemmer (as a text file of ignore words).
      Additionally, it would remove duplication between lucene and solr, as they reimplement snowballfilter since it does not have this functionality.
      Finally, I think this is a pretty common use case, where people want to ignore things like proper nouns in the stemming.

      As an alternative design I considered a case where we generalized this to CharArrayMap (and ignoring words would mean mapping them to themselves), which would also provide a mechanism to override the stemming algorithm. But I think this is too expert, could be its own filter, and the only example of this i can find is in the Dutch stemmer.

      So I think we should just provide ignore with CharArraySet, but if you feel otherwise please comment.

      1. LUCENE-2198.patch
        74 kB
        Simon Willnauer
      2. LUCENE-2198.patch
        16 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Committed revision: 903608

          Thanks Simon!

          Show
          Uwe Schindler added a comment - Committed revision: 903608 Thanks Simon!
          Hide
          Uwe Schindler added a comment -

          I take this one and will commit this in a day or two.

          When we get additional boolean attributes, I can open an issue, to provide an "combined" AttributeImpl for them to support heavy cloning.

          Show
          Uwe Schindler added a comment - I take this one and will commit this in a day or two. When we get additional boolean attributes, I can open an issue, to provide an "combined" AttributeImpl for them to support heavy cloning.
          Hide
          Uwe Schindler added a comment -

          I also don't understand the arguments about type safety. Token bundles multiple attributes together w/o loss of type safety, right?

          With "type safety" I mean, not attributes in general (they are type safe withy any impl). FlagAttribute itsself is not type safe, because everybody can store/update any bit in this integer. If you have two different filters updating the same bit but mean something other with the bit, it gets broken. Maybe other Filters just update the flags using no bit operations (because we have no support for these in the API).

          If we have more than one boolean attribute in lucene in future, we can extend DEFAULT_ATTRIBUTE_FACTORY to support this.

          My idea is to have a default AttributeImpl for boolean attributes that support things like set/get of a bit (like BitSet). You subclass it e.g. to generate a combined impl for 4 boolean interfaces we may have in futrure in Lucene core. In the ctor you pass the bitmasks and the impl of all boolean get/setters delegate to the generic BitSet-like methods. Clone and copyTo it then simple, as it only copies the word if the target AttributeImpl is the same class (like Token.copyTo).

          Show
          Uwe Schindler added a comment - I also don't understand the arguments about type safety. Token bundles multiple attributes together w/o loss of type safety, right? With "type safety" I mean, not attributes in general (they are type safe withy any impl). FlagAttribute itsself is not type safe, because everybody can store/update any bit in this integer. If you have two different filters updating the same bit but mean something other with the bit, it gets broken. Maybe other Filters just update the flags using no bit operations (because we have no support for these in the API). If we have more than one boolean attribute in lucene in future, we can extend DEFAULT_ATTRIBUTE_FACTORY to support this. My idea is to have a default AttributeImpl for boolean attributes that support things like set/get of a bit (like BitSet). You subclass it e.g. to generate a combined impl for 4 boolean interfaces we may have in futrure in Lucene core. In the ctor you pass the bitmasks and the impl of all boolean get/setters delegate to the generic BitSet-like methods. Clone and copyTo it then simple, as it only copies the word if the target AttributeImpl is the same class (like Token.copyTo).
          Hide
          Yonik Seeley added a comment -

          Only in the case that having multiple boolean attributes in the same stream, there is additional cost.

          But that's exactly what I was talking about when I said "if we should "bundle" this with other (future) binary flags". No, it doesn't look like we have any others now, I was talking about if we were going to have any others in the future.

          I also don't understand the arguments about type safety. Token bundles multiple attributes together w/o loss of type safety, right?

          If we have more than one boolean attribute in lucene in future, we can extend DEFAULT_ATTRIBUTE_FACTORY to support this.

          Hmm, ok - that was really my only concern - the number of clone() calls due to increasing numbers of attributes.

          Show
          Yonik Seeley added a comment - Only in the case that having multiple boolean attributes in the same stream, there is additional cost. But that's exactly what I was talking about when I said "if we should "bundle" this with other (future) binary flags". No, it doesn't look like we have any others now, I was talking about if we were going to have any others in the future. I also don't understand the arguments about type safety. Token bundles multiple attributes together w/o loss of type safety, right? If we have more than one boolean attribute in lucene in future, we can extend DEFAULT_ATTRIBUTE_FACTORY to support this. Hmm, ok - that was really my only concern - the number of clone() calls due to increasing numbers of attributes.
          Hide
          Uwe Schindler added a comment -

          Surely the native clone() invoked for every additional attribute counts for something?

          FlagsAttribute is not used anywhere in Lucene. So it does not matter if you have a separate attribute or the FlagsAttribute for cloning in this issue. Only in the case that having multiple boolean attributes in the same stream, there is additional cost. But this is really seldom, so type safety is more important and helps preventing bugs.

          And by the way, you can combine all attributes using a special AttributeFactory into the same AttributeImpl if you need speed (e.g. Token). Then you can have lots of boolean attributes with getters/setters, but all use the same AttributeImpl with the same bitset. If we have more than one boolean attribute in lucene in future, we can extend DEFAULT_ATTRIBUTE_FACTORY to support this.

          Show
          Uwe Schindler added a comment - Surely the native clone() invoked for every additional attribute counts for something? FlagsAttribute is not used anywhere in Lucene. So it does not matter if you have a separate attribute or the FlagsAttribute for cloning in this issue. Only in the case that having multiple boolean attributes in the same stream, there is additional cost. But this is really seldom, so type safety is more important and helps preventing bugs. And by the way, you can combine all attributes using a special AttributeFactory into the same AttributeImpl if you need speed (e.g. Token). Then you can have lots of boolean attributes with getters/setters, but all use the same AttributeImpl with the same bitset. If we have more than one boolean attribute in lucene in future, we can extend DEFAULT_ATTRIBUTE_FACTORY to support this.
          Hide
          Robert Muir added a comment -

          Surely the native clone() invoked for every additional attribute counts for something?

          But it looks to me like only each Attribute*Impl* is cloned, so if you are worried about this and using a lot of attributes you could use your own AttributeFactory, similar to Token.TOKEN_ATTRIBUTE_FACTORY to pack everything however you see fit, right?

          I think the interface as boolean is correct, and I think the reference implementation is correct too.

          Show
          Robert Muir added a comment - Surely the native clone() invoked for every additional attribute counts for something? But it looks to me like only each Attribute*Impl* is cloned, so if you are worried about this and using a lot of attributes you could use your own AttributeFactory, similar to Token.TOKEN_ATTRIBUTE_FACTORY to pack everything however you see fit, right? I think the interface as boolean is correct, and I think the reference implementation is correct too.
          Hide
          Uwe Schindler added a comment -

          My problem with FlagsAttribute is missing "type safety". You have to choose an bit mask for "your" attribute but another TokenFilter in your streanm could use the same bit mask. So in my opinion, FlagsAttribute should be deprectated and replaced by simple boolean attributes everybody can define type safe.

          About the speed of cloning: Clone was slow in old java versions, but now it is done directly in the JVM. Cloning an Attribute using the following code is much slower than invoking clone() [please note, not because of the reflections, it there to show how it should be implemented):

          AttributeImpl clone = this.getClass().newInstance();
          this.copyTo(clone);
          return clone;
          

          Michi Busch and me are currently investigating fast Attribute Proxies (needed for flex MultiEnums) and also fast capturing of states using CGLIB.

          Show
          Uwe Schindler added a comment - My problem with FlagsAttribute is missing "type safety". You have to choose an bit mask for "your" attribute but another TokenFilter in your streanm could use the same bit mask. So in my opinion, FlagsAttribute should be deprectated and replaced by simple boolean attributes everybody can define type safe. About the speed of cloning: Clone was slow in old java versions, but now it is done directly in the JVM. Cloning an Attribute using the following code is much slower than invoking clone() [please note, not because of the reflections, it there to show how it should be implemented): AttributeImpl clone = this .getClass().newInstance(); this .copyTo(clone); return clone; Michi Busch and me are currently investigating fast Attribute Proxies (needed for flex MultiEnums) and also fast capturing of states using CGLIB.
          Hide
          Simon Willnauer added a comment -

          I kind of agree with both of you. When I started implementing this attribute I had FlagAttribute in mind but I didn't choose it because users can randomly choose a bit of the word which might lead to unexpected behavior.

          Another solution I had in mind is to introduce another Attribute (or extend FlagAttribute) holding a Lucene private (not the java visibility keyword) Enum that can be extended in the future. Internally this could use a word or a Bitset (a word will do I guess) where bits can be set according to the enum ord. That way we could encode way more than only one single boolean and the cost of adding new "flags" / enum values would be minimal.

          booleanAttribute.isSet(BooelanAttributeEnum.Keyword)
          

          something like that, thoughts?

          Show
          Simon Willnauer added a comment - I kind of agree with both of you. When I started implementing this attribute I had FlagAttribute in mind but I didn't choose it because users can randomly choose a bit of the word which might lead to unexpected behavior. Another solution I had in mind is to introduce another Attribute (or extend FlagAttribute) holding a Lucene private (not the java visibility keyword) Enum that can be extended in the future. Internally this could use a word or a Bitset (a word will do I guess) where bits can be set according to the enum ord. That way we could encode way more than only one single boolean and the cost of adding new "flags" / enum values would be minimal. booleanAttribute.isSet(BooelanAttributeEnum.Keyword) something like that, thoughts?
          Hide
          Yonik Seeley added a comment -

          If we start to discuss about such topics, we should not have used Attributes at all.

          Attributes are great for allowing extensions. But for attributes that often appear together, it seems like we should at least consider bundling them.
          You seem to be implying that we shouldn't question the cost of doing things one way vs another?

          There is no additional cost

          Surely the native clone() invoked for every additional attribute counts for something?

          Because the State object is build internally before and just cloned. So the number of attributes vs. the size of the captured states do not affect performance in neglectible manner.

          Not sure I understand... if captureState() is invoked, isn't .clone() called on each attribute?

          Show
          Yonik Seeley added a comment - If we start to discuss about such topics, we should not have used Attributes at all. Attributes are great for allowing extensions. But for attributes that often appear together, it seems like we should at least consider bundling them. You seem to be implying that we shouldn't question the cost of doing things one way vs another? There is no additional cost Surely the native clone() invoked for every additional attribute counts for something? Because the State object is build internally before and just cloned. So the number of attributes vs. the size of the captured states do not affect performance in neglectible manner. Not sure I understand... if captureState() is invoked, isn't .clone() called on each attribute?
          Hide
          Uwe Schindler added a comment - - edited

          If we start to discuss about such topics, we should not have used Attributes at all.

          There is no additional cost, because with 3.0, the FlagAttribute is used nowhere, so it is not captured, too. We would just move this to another attribute, the cost is the same. And capturing a boolean costs no much. Because the State object is build internally before and just cloned. So the number of attributes vs. the size of the captured states do not affect performance in neglectible manner.

          Show
          Uwe Schindler added a comment - - edited If we start to discuss about such topics, we should not have used Attributes at all. There is no additional cost, because with 3.0, the FlagAttribute is used nowhere, so it is not captured, too. We would just move this to another attribute, the cost is the same. And capturing a boolean costs no much. Because the State object is build internally before and just cloned. So the number of attributes vs. the size of the captured states do not affect performance in neglectible manner.
          Hide
          Yonik Seeley added a comment -

          One random thought is to wonder how much this adds to the cost of capturing state - and if we should "bundle" this with other (future) binary flags into a word (this is what the original "flags" field on Token was for).

          Show
          Yonik Seeley added a comment - One random thought is to wonder how much this adds to the cost of capturing state - and if we should "bundle" this with other (future) binary flags into a word (this is what the original "flags" field on Token was for).
          Hide
          Uwe Schindler added a comment -

          The KeywordAttribute and *Impl looks correct. copyTo() behaves also correctly (this is important in some cases!).

          Show
          Uwe Schindler added a comment - The KeywordAttribute and *Impl looks correct. copyTo() behaves also correctly (this is important in some cases!).
          Hide
          Simon Willnauer added a comment -

          This patch ports all stemmers in core and contrib/analyzers to make use of the KeywordAttribute.
          I did not include snowball yet.

          Show
          Simon Willnauer added a comment - This patch ports all stemmers in core and contrib/analyzers to make use of the KeywordAttribute. I did not include snowball yet.
          Hide
          Robert Muir added a comment -

          the patch looks great to me Simon. I especially like the naming.

          Show
          Robert Muir added a comment - the patch looks great to me Simon. I especially like the naming.
          Hide
          Simon Willnauer added a comment -

          This patch contains an intial design proposal. I tried to name the new attribute a little bit more generic as this could easily be used outside of the stemming domain.

          all tests pass – comments welcome.

          Show
          Simon Willnauer added a comment - This patch contains an intial design proposal. I tried to name the new attribute a little bit more generic as this could easily be used outside of the stemming domain. all tests pass – comments welcome.
          Hide
          Erik Hatcher added a comment -

          +1 on the StemAttribute approach. I've just encountered this exact need in some custom code I've been reviewing, where the decision to stem or not is dynamic per term (with the approach I'm looking at using a custom term type string and a custom stem filter).

          Show
          Erik Hatcher added a comment - +1 on the StemAttribute approach. I've just encountered this exact need in some custom code I've been reviewing, where the decision to stem or not is dynamic per term (with the approach I'm looking at using a custom term type string and a custom stem filter).
          Hide
          Robert Muir added a comment -

          hi Simon, the more i think about it, the more i like your design.

          this way, for example, if someone has pos tagging/named entity recognition or a similar sophisticated NLP process, they could change the value of this attribute and prevent these from being stemmed at runtime. with static sets this is not possible.

          Show
          Robert Muir added a comment - hi Simon, the more i think about it, the more i like your design. this way, for example, if someone has pos tagging/named entity recognition or a similar sophisticated NLP process, they could change the value of this attribute and prevent these from being stemmed at runtime. with static sets this is not possible.
          Hide
          Simon Willnauer added a comment -

          So I think we should just provide ignore with CharArraySet, but if you feel otherwise please comment.

          While I read your proposal a possibly more flexible design came to my mind. We could introduce a StemAttribute that has a method public boolean stem() used by every stemmer to decide if a token should be stemmed. That way we decouple the decision if a token should be stemmed from the stemming algorithm. This also enables custom filters to set the values based on other reasons aside from a term being in a set.
          The default value for sure it true but can be set on any condition. inside an analyzer we can add a filter right before the stemmer based on a CharArraySet. Yet if the set is empty or null we simply leave the filter out.

          Show
          Simon Willnauer added a comment - So I think we should just provide ignore with CharArraySet, but if you feel otherwise please comment. While I read your proposal a possibly more flexible design came to my mind. We could introduce a StemAttribute that has a method public boolean stem() used by every stemmer to decide if a token should be stemmed. That way we decouple the decision if a token should be stemmed from the stemming algorithm. This also enables custom filters to set the values based on other reasons aside from a term being in a set. The default value for sure it true but can be set on any condition. inside an analyzer we can add a filter right before the stemmer based on a CharArraySet. Yet if the set is empty or null we simply leave the filter out.

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development